I like C++ and after I have switched from Fortran to C++ at the beginning of the nineties, I wanted to do everything in C++ including linear algebra. My first linear algebra in C++ was GNUSSL by Robert D. Pierce. It was a nice template library that helped me a lot to understand how one can use templates creatively. Yet, at some point I have decided to benchmark it with the Fortran libraries. At that time I have used a lot the code from Numerical Methods and Software as well as IMSL. Yet, while developing my open source TDLIB, I wanted to ground it on free libraries and then I have decided to try LAPACK. The test has shown that LAPACK is much faster than GNUSSL and I use LAPACK since then. I should mention that LAPACK was faster not because it is written in Fortran but rather because it uses better algorithms. It well might be that nowadays one can find a linear algebra completely in C++ with the same performance (for example I have heard announcement about eigen, http://eigen.tuxfamily.org/, but I have not tried it yet). The goal of this section is to show how one can interface LAPACK from within C++ as well as to demonstrate what technologies behind LAPACK are must for any linear algebra library.

In LAPACK Users’ Guide (online version http://www.netlib.org/lapack/lug/) there is a description of available functions. There a lot of them as LAPACK has functions for different types of matrices and you can speed up your code if you employ a function that fits your particular matrix. The procedure of using LAPACK is as follows. You search for functions you need, then you open the Fortran code and there you find a description of function arguments, for example for DGETRF. Then what is left is to write a declaration and just use it in C++ as has been shown in the previous section Using decomp and solve from C++. Two basic examples for `DGESV`

and `DPOTRI`

could be found here

Somewhat better examples could be found in TDLIB

http://evgenii.rudnyi.ru/soft/tdlib00+/lib/lapack/

where there are inlined wrapper functions to simplify the use of LAPACK functions in the C++ code (see `ex`

subdirectory for examples on how to use the header `lapack.h`

).

Below I consider `DGETRF`

for the LU-decomposition of the general double precision matrix and `DGETRS`

for the back substitution. To understand the technology I will also interface `DGETF2`

, the level 2 BLAS versions of `DGETRF`

. `DGETF2`

is similar to `decomp`

and `DGETRF`

uses the block algorithm and hence we can compare what the block algorithm brings to the LU-decomposition performance in LAPACK.

I will use my matrix class from the previous section to keep the dense matrix column-wise in C++ (see also Class Matrix). At the end of matrix.h you will find declaration of `DGETRF`

, `DGETRS`

and `DGETF2`

as well as the wrapper functions to use them with the `Matrix`

class. In the header there is also a compiler macro `MKL`

that changes the LAPACK function names as gfortran compiles function names lowercase and with underscore and in the MKL on Windows they are uppercase and without underscore. The C++ code is in main.cc that is similar to main.cc for decomp and solve and 02lu.py.

I compile first LAPACK with the reference BLAS from Netlib and then use LAPACK with the optimized BLAS to show the difference. I use `gcc/gfortran`

under Cygwin for the first goal and then Microsoft VC + MKL for the second (MKL already includes LAPACK) but it should be not too difficult to change the compilers and the optimized BLAS.

To compile LAPACK from Netlib under Cygwin use the next commands (I see that there is a new version of LAPACK and the new C interface, it could be in principle possible to use it):

`$ wget http://www.netlib.org/lapack/lapack-3.3.0.tgz`

$ tar zxvf lapack-3.3.0.tgz

$ cp INSTALL/make.inc.gfortran make.inc

$ make blaslib

$ make lapacklib

$ ls *.a

blas_LINUX.a lapack_LINUX.a

I compile only libraries. Just `make`

will also compiles tests. It could be possible to speed up the process by compiling only the double version but then with expense of couple of extra commands.

The suffix `LINUX`

comes from `make.inc.gfortran`

, it could be possible to remove it there. Now let us just rename the libraries

`$ mv blas_LINUX.a libblas.a`

$ mv lapack_LINUX.a liblapack.a

and mention the directory where the files are located (`pwd`

).

The makefile compiles main.cc with the LAPACK and reference BLAS and then with Intel MKL. If you use other compilers or another optimized BLAS, please correct the makefile.

The first command to compile the code with `gcc`

and the reference BLAS compiled above is as follows

`$ g++ main.cc -L $HOME/misc/lib/lapack-3.3.0 -llapack -lblas -lgfortran -o main`

You need to change the path after `–L`

to adjust it for your setup. The second command to use Intel MKL is

`$ cl -O2 -EHsc -D_SECURE_SCL=0 -MD -DUSECLOCK -DMKL main.cc mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib -Femain2.exe`

Here there are some more options for MS VC to compile the C++ code and also two compiler macros: `USECLOCK`

to use `clock()`

in the `Timing`

class and `MKL`

to change the LAPACK names.

Intel MKL is multithreaded and by default uses all available cores. To make the tests only with one core, I issue under tsch

`$ setenv OMP_NUM_THREADS 1`

I will show parallel benchmarks later. In the tables below there are times on my new HP 8540w notebook with Intel Core i7 processor. The code is compiled 32-bit. `gcc`

is 4.3.4 under Cygwin 1.7 and Intel MKL is 11.1. To make comparison with `solve`

and `lu_factor`

, I have run then as well on my new notebook. SciPy is 0.7 with NumPy 1.3 under Python 2.5.

Matrix dimension | dgetf2 | dgetrf | dgetf2 (MKL) | dgetrf (MKL) | decomp | lu_factor |

500 | 0.046 | 0.047 | 0.031 | 0.016 | 0.046 | 0.02 |

1000 | 0.37 | 0.36 | 0.28 | 0.08 | 0.39 | 0.12 |

2000 | 5.4 | 2.8 | 4.1 | 0.55 | 6.0 | 0.81 |

3000 | 19.7 | 10.7 | 16.8 | 1.76 | 20.2 | 2.2 |

4000 | 46.6 | 24.8 | 43.8 | 4.1 | 48 | 5.3 |

The times for `decomp`

compared with those in the previous sections (Linear Solve in Python and Using decomp and solve from Fortran) went down but the times for `lu_factor`

are about the same. I do not know how to explain this. It could be because of the new hardware or it is due to the newer version of `gcc`

.

In any case the old Forsythe’s decomp is quite competitive with `dgetf2`

from the newest LAPACK. The LU-decomposition as such has not changed since then. The changes are in using BLAS and in the block algorithms

http://www.netlib.org/lapack/lug/node66.html

The BLAS as such does not change much. `dgetf2`

is just a bit faster as `dgetf2`

with the reference BLAS. It is a combination of the block algorithms with the optimized BLAS that makes the difference. When we compare `dgetf2`

and `dgetrf`

with the reference BLAS then one already sees the difference. Finally `dgetrf`

with the optimized BLAS reduces time almost 10 times as compared with `decomp`

and `dgetf2`

. I should mention that on another hardware I have observed that the difference between `dgetf2`

and `dgetrf`

is even more than what adds the optimized BLAS:

http://groups.google.com/group/matrixprogramming/msg/7c5bbc612652a198

but this seems to be hardware dependent.

The performance in Python is close to that with Intel MKL. I guess that in the SciPy version that I have employed they have used old ATLAS, so some difference in the table above.

Now back to multithreaded BLAS in Intel MKL. The table below shows the performance with 1, 2 and 4 processors in seconds. Four cores cut time almost twice.

Matrix dimension | 1 core | 2 cores | 4 cores |

4000 | 4.1 | 2.9 | 1.9 |

10000 | 67 | 44 | 26.7 |

Finally I would like to mention an interesting project FLAME (Formal Linear Algebra Method Environment) by Prof Robert A. van de Geijn

http://www.cs.utexas.edu/~flame/

There is also a nice book that pretty didactically describes numerics as well as a new nice way to program linear algebra: The Science of Programming Matrix Computations

For those who knows Russian, I have some short description of the book here:

http://blog.rudnyi.ru/ru/2010/05/science-of-programming-matrix-computations.html

## Previous:

Overview

Linear Solve in Python (NumPy and SciPy)

Using decomp and solve from Fortran

Using decomp and solve from C++

RSS