Using LAPACK from C++

I like C++ and after I have switched from Fortran to C++ at the beginning of the nineties, I wanted to do everything in C++ including linear algebra. My first linear algebra in C++ was GNUSSL by Robert D. Pierce. It was a nice template library that helped me a lot to understand how one can use templates creatively. Yet, at some point I have decided to benchmark it with the Fortran libraries. At that time I have used a lot the code from Numerical Methods and Software as well as IMSL. Yet, while developing my open source TDLIB, I wanted to ground it on free libraries and then I have decided to try LAPACK. The test has shown that LAPACK is much faster than GNUSSL and I use LAPACK since then. I should mention that LAPACK was faster not because it is written in Fortran but rather because it uses better algorithms. It well might be that nowadays one can find a linear algebra completely in C++ with the same performance (for example I have heard announcement about eigen,, but I have not tried it yet). The goal of this section is to show how one can interface LAPACK from within C++ as well as to demonstrate what technologies behind LAPACK are must for any linear algebra library.

In LAPACK Users’ Guide (online version there is a description of available functions. There a lot of them as LAPACK has functions for different types of matrices and you can speed up your code if you employ a function that fits your particular matrix. The procedure of using LAPACK is as follows. You search for functions you need, then you open the Fortran code and there you find a description of function arguments, for example for DGETRF. Then what is left is to write a declaration and just use it in C++ as has been shown in the previous section Using decomp and solve from C++. Two basic examples for DGESV and DPOTRI could be found here


Somewhat better examples could be found in TDLIB

where there are inlined wrapper functions to simplify the use of LAPACK functions in the C++ code (see ex subdirectory for examples on how to use the header lapack.h).

Below I consider DGETRF for the LU-decomposition of the general double precision matrix and DGETRS for the back substitution. To understand the technology I will also interface DGETF2, the level 2 BLAS versions of DGETRF. DGETF2 is similar to decomp and DGETRF uses the block algorithm and hence we can compare what the block algorithm brings to the LU-decomposition performance in LAPACK.

I will use my matrix class from the previous section to keep the dense matrix column-wise in C++ (see also Class Matrix). At the end of matrix.h you will find declaration of DGETRF, DGETRS and DGETF2 as well as the wrapper functions to use them with the Matrix class. In the header there is also a compiler macro MKL that changes the LAPACK function names as gfortran compiles function names lowercase and with underscore and in the MKL on Windows they are uppercase and without underscore. The C++ code is in that is similar to for decomp and solve and

I compile first LAPACK with the reference BLAS from Netlib and then use LAPACK with the optimized BLAS to show the difference. I use gcc/gfortran under Cygwin for the first goal and then Microsoft VC + MKL for the second (MKL already includes LAPACK) but it should be not too difficult to change the compilers and the optimized BLAS.

To compile LAPACK from Netlib under Cygwin use the next commands (I see that there is a new version of LAPACK and the new C interface, it could be in principle possible to use it):

$ wget
$ tar zxvf lapack-3.3.0.tgz
$ make blaslib
$ make lapacklib
$ ls *.a
blas_LINUX.a  lapack_LINUX.a

I compile only libraries. Just make will also compiles tests. It could be possible to speed up the process by compiling only the double version but then with expense of couple of extra commands.

The suffix LINUX comes from, it could be possible to remove it there. Now let us just rename the libraries

$ mv blas_LINUX.a libblas.a
$ mv lapack_LINUX.a liblapack.a

and mention the directory where the files are located (pwd).

The makefile compiles with the LAPACK and reference BLAS and then with Intel MKL. If you use other compilers or another optimized BLAS, please correct the makefile.

The first command to compile the code with gcc and the reference BLAS compiled above is as follows

$ g++ -L $HOME/misc/lib/lapack-3.3.0 -llapack -lblas -lgfortran -o main

You need to change the path after –L to adjust it for your setup. The second command to use Intel MKL is

$ cl -O2 -EHsc -D_SECURE_SCL=0 -MD -DUSECLOCK -DMKL mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib -Femain2.exe

Here there are some more options for MS VC to compile the C++ code and also two compiler macros: USECLOCK to use clock() in the Timing class and MKL to change the LAPACK names.

Intel MKL is multithreaded and by default uses all available cores. To make the tests only with one core, I issue under tsch

$ setenv OMP_NUM_THREADS 1

I will show parallel benchmarks later. In the tables below there are times on my new HP 8540w notebook with Intel Core i7 processor. The code is compiled 32-bit. gcc is 4.3.4 under Cygwin 1.7 and Intel MKL is 11.1. To make comparison with solve and lu_factor, I have run then as well on my new notebook. SciPy is 0.7 with NumPy 1.3 under Python 2.5.

Matrix dimension dgetf2 dgetrf dgetf2 (MKL) dgetrf (MKL) decomp lu_factor
500 0.046 0.047 0.031 0.016 0.046 0.02
1000 0.37 0.36 0.28 0.08 0.39 0.12
2000 5.4 2.8 4.1 0.55 6.0 0.81
3000 19.7 10.7 16.8 1.76 20.2 2.2
4000 46.6 24.8 43.8 4.1 48 5.3

The times for decomp compared with those in the previous sections (Linear Solve in Python and Using decomp and solve from Fortran) went down but the times for lu_factor are about the same. I do not know how to explain this. It could be because of the new hardware or it is due to the newer version of gcc.

In any case the old Forsythe’s decomp is quite competitive with dgetf2 from the newest LAPACK. The LU-decomposition as such has not changed since then. The changes are in using BLAS and in the block algorithms

The BLAS as such does not change much. dgetf2 is just a bit faster as dgetf2 with the reference BLAS. It is a combination of the block algorithms with the optimized BLAS that makes the difference. When we compare dgetf2 and dgetrf with the reference BLAS then one already sees the difference. Finally dgetrf with the optimized BLAS reduces time almost 10 times as compared with decomp and dgetf2. I should mention that on another hardware I have observed that the difference between dgetf2 and dgetrf is even more than what adds the optimized BLAS:

but this seems to be hardware dependent.

The performance in Python is close to that with Intel MKL. I guess that in the SciPy version that I have employed they have used old ATLAS, so some difference in the table above.

Now back to multithreaded BLAS in Intel MKL. The table below shows the performance with 1, 2 and 4 processors in seconds. Four cores cut time almost twice.

Matrix dimension 1 core 2 cores 4 cores
4000 4.1 2.9 1.9
10000 67 44 26.7

Finally I would like to mention an interesting project FLAME (Formal Linear Algebra Method Environment) by Prof Robert A. van de Geijn

There is also a nice book that pretty didactically describes numerics as well as a new nice way to program linear algebra: The Science of Programming Matrix Computations

For those who knows Russian, I have some short description of the book here:


Linear Solve in Python (NumPy and SciPy)
Using decomp and solve from Fortran
Using decomp and solve from C++

Comments are closed.