I like C++ and after I have switched from Fortran to C++ at the beginning of the nineties, I wanted to do everything in C++ including linear algebra. My first linear algebra in C++ was GNUSSL by Robert D. Pierce. It was a nice template library that helped me a lot to understand how one can use templates creatively. Yet, at some point I have decided to benchmark it with the Fortran libraries. At that time I have used a lot the code from Numerical Methods and Software as well as IMSL. Yet, while developing my open source TDLIB, I wanted to ground it on free libraries and then I have decided to try LAPACK. The test has shown that LAPACK is much faster than GNUSSL and I use LAPACK since then. I should mention that LAPACK was faster not because it is written in Fortran but rather because it uses better algorithms. It well might be that nowadays one can find a linear algebra completely in C++ with the same performance (for example I have heard announcement about eigen, http://eigen.tuxfamily.org/, but I have not tried it yet). The goal of this section is to show how one can interface LAPACK from within C++ as well as to demonstrate what technologies behind LAPACK are must for any linear algebra library.
In LAPACK Users’ Guide (online version http://www.netlib.org/lapack/lug/) there is a description of available functions. There a lot of them as LAPACK has functions for different types of matrices and you can speed up your code if you employ a function that fits your particular matrix. The procedure of using LAPACK is as follows. You search for functions you need, then you open the Fortran code and there you find a description of function arguments, for example for DGETRF. Then what is left is to write a declaration and just use it in C++ as has been shown in the previous section Using decomp and solve from C++. Two basic examples for DGESV
and DPOTRI
could be found here
Somewhat better examples could be found in TDLIB
http://evgenii.rudnyi.ru/soft/tdlib00+/lib/lapack/
where there are inlined wrapper functions to simplify the use of LAPACK functions in the C++ code (see ex
subdirectory for examples on how to use the header lapack.h
).
Below I consider DGETRF
for the LU-decomposition of the general double precision matrix and DGETRS
for the back substitution. To understand the technology I will also interface DGETF2
, the level 2 BLAS versions of DGETRF
. DGETF2
is similar to decomp
and DGETRF
uses the block algorithm and hence we can compare what the block algorithm brings to the LU-decomposition performance in LAPACK.
I will use my matrix class from the previous section to keep the dense matrix column-wise in C++ (see also Class Matrix). At the end of matrix.h you will find declaration of DGETRF
, DGETRS
and DGETF2
as well as the wrapper functions to use them with the Matrix
class. In the header there is also a compiler macro MKL
that changes the LAPACK function names as gfortran compiles function names lowercase and with underscore and in the MKL on Windows they are uppercase and without underscore. The C++ code is in main.cc that is similar to main.cc for decomp and solve and 02lu.py.
I compile first LAPACK with the reference BLAS from Netlib and then use LAPACK with the optimized BLAS to show the difference. I use gcc/gfortran
under Cygwin for the first goal and then Microsoft VC + MKL for the second (MKL already includes LAPACK) but it should be not too difficult to change the compilers and the optimized BLAS.
To compile LAPACK from Netlib under Cygwin use the next commands (I see that there is a new version of LAPACK and the new C interface, it could be in principle possible to use it):
$ wget http://www.netlib.org/lapack/lapack-3.3.0.tgz
$ tar zxvf lapack-3.3.0.tgz
$ cp INSTALL/make.inc.gfortran make.inc
$ make blaslib
$ make lapacklib
$ ls *.a
blas_LINUX.a lapack_LINUX.a
I compile only libraries. Just make
will also compiles tests. It could be possible to speed up the process by compiling only the double version but then with expense of couple of extra commands.
The suffix LINUX
comes from make.inc.gfortran
, it could be possible to remove it there. Now let us just rename the libraries
$ mv blas_LINUX.a libblas.a
$ mv lapack_LINUX.a liblapack.a
and mention the directory where the files are located (pwd
).
The makefile compiles main.cc with the LAPACK and reference BLAS and then with Intel MKL. If you use other compilers or another optimized BLAS, please correct the makefile.
The first command to compile the code with gcc
and the reference BLAS compiled above is as follows
$ g++ main.cc -L $HOME/misc/lib/lapack-3.3.0 -llapack -lblas -lgfortran -o main
You need to change the path after –L
to adjust it for your setup. The second command to use Intel MKL is
$ cl -O2 -EHsc -D_SECURE_SCL=0 -MD -DUSECLOCK -DMKL main.cc mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib -Femain2.exe
Here there are some more options for MS VC to compile the C++ code and also two compiler macros: USECLOCK
to use clock()
in the Timing
class and MKL
to change the LAPACK names.
Intel MKL is multithreaded and by default uses all available cores. To make the tests only with one core, I issue under tsch
$ setenv OMP_NUM_THREADS 1
I will show parallel benchmarks later. In the tables below there are times on my new HP 8540w notebook with Intel Core i7 processor. The code is compiled 32-bit. gcc
is 4.3.4 under Cygwin 1.7 and Intel MKL is 11.1. To make comparison with solve
and lu_factor
, I have run then as well on my new notebook. SciPy is 0.7 with NumPy 1.3 under Python 2.5.
Matrix dimension | dgetf2 | dgetrf | dgetf2 (MKL) | dgetrf (MKL) | decomp | lu_factor |
500 | 0.046 | 0.047 | 0.031 | 0.016 | 0.046 | 0.02 |
1000 | 0.37 | 0.36 | 0.28 | 0.08 | 0.39 | 0.12 |
2000 | 5.4 | 2.8 | 4.1 | 0.55 | 6.0 | 0.81 |
3000 | 19.7 | 10.7 | 16.8 | 1.76 | 20.2 | 2.2 |
4000 | 46.6 | 24.8 | 43.8 | 4.1 | 48 | 5.3 |
The times for decomp
compared with those in the previous sections (Linear Solve in Python and Using decomp and solve from Fortran) went down but the times for lu_factor
are about the same. I do not know how to explain this. It could be because of the new hardware or it is due to the newer version of gcc
.
In any case the old Forsythe’s decomp is quite competitive with dgetf2
from the newest LAPACK. The LU-decomposition as such has not changed since then. The changes are in using BLAS and in the block algorithms
http://www.netlib.org/lapack/lug/node66.html
The BLAS as such does not change much. dgetf2
is just a bit faster as dgetf2
with the reference BLAS. It is a combination of the block algorithms with the optimized BLAS that makes the difference. When we compare dgetf2
and dgetrf
with the reference BLAS then one already sees the difference. Finally dgetrf
with the optimized BLAS reduces time almost 10 times as compared with decomp
and dgetf2
. I should mention that on another hardware I have observed that the difference between dgetf2
and dgetrf
is even more than what adds the optimized BLAS:
http://groups.google.com/group/matrixprogramming/msg/7c5bbc612652a198
but this seems to be hardware dependent.
The performance in Python is close to that with Intel MKL. I guess that in the SciPy version that I have employed they have used old ATLAS, so some difference in the table above.
Now back to multithreaded BLAS in Intel MKL. The table below shows the performance with 1, 2 and 4 processors in seconds. Four cores cut time almost twice.
Matrix dimension | 1 core | 2 cores | 4 cores |
4000 | 4.1 | 2.9 | 1.9 |
10000 | 67 | 44 | 26.7 |
Finally I would like to mention an interesting project FLAME (Formal Linear Algebra Method Environment) by Prof Robert A. van de Geijn
http://www.cs.utexas.edu/~flame/
There is also a nice book that pretty didactically describes numerics as well as a new nice way to program linear algebra: The Science of Programming Matrix Computations
For those who knows Russian, I have some short description of the book here:
http://blog.rudnyi.ru/ru/2010/05/science-of-programming-matrix-computations.html
Previous:
Overview
Linear Solve in Python (NumPy and SciPy)
Using decomp and solve from Fortran
Using decomp and solve from C++
RSS