====== Optimizing HPC programs ======
\\
Optimizing a code is a full time job. This page regroups few tips and major compiler options to get good performances.
To reach higher performances, many tutorials are available on Internet.
Before looking for hard optimization, developers should track common mistakes like bad memory management, bad IO, independent operations inside loops, stripping, etc. Algorithms used should also be checked.
I try to keep these pages up to date, but some flags may be deprecated.
===== Few words =====
* Do not use exotic optimization flags you don't understand. Prefer basic flags with a clean code (will let the compiler do good choices).
* Be extremely careful with some flags as they do not all conserve numerical precision. If doing mathematical calculations, prefer conservative options.
* Try to maximize the use of vectorial instructions (SIMDs) by giving the compiler a clean code.
* Most of the time, simple binaries io are more efficient than specific library io. For MPI, prefer writing one file per process (and assemble it later to the desired format) than one file for all processes. Note also that C io are often faster than Fortran io, use interfaces to use them in Fortran.
* There are many ways to optimize a code. You can use external libraries, tune the algorithm, etc. Try to think globally and don't focus only on the code.
===== Optimization order =====
The order for a good optimization is :\\
Algorithm optimization (thinking future parallelization) => Code optimization => Parallelization
===== External resources =====
http://wiki.gentoo.org/wiki/GCC_optimization/en \\
https://software.intel.com/en-us/articles/step-by-step-optimizing-with-intel-c-compiler
===== Basics =====
To get all performances from compilers, and considering you are compiling on the same computer architecture (CP, MB, etc) you are running calculations, use the following options :
==== Gnu gfortran ====
** Standard :**
-O3 -march=native -mtune=native
** Hard optimization ** (use carefully, may slow down or give wrong results) **:**
-O4 -ffast-math -fforce-addr -fstrength-reduce -frerun-cse-after-loop -fexpensive-optimizations -fcaller-saves -funroll-loops -funroll-all-loops -fno-rerun-loop-opt
==== Intel ifort ====
-O3 -fast -xHost
If you get problems with **-fast** (probably because of static missing libraries), replace with **-O3 -xHost -no-prec-div -ipo**. If you still have problems (linking), replace **-ipo** by **-ip**. At the end, -O3 -xHost is enough if all other flags do not work properly.
**Warning:** if you need precision (like with long computation: cfd, etc), do not use -no-prec-div, only -xHost. -no-prec-div may reduce precision of division operations.
===== Vectorize =====
==== GCC (gcc/gfortran) ====
To get informations on what is vectorized by the compiler, add :
-ftree-vectorizer-verbose=2
To change verbosity, change the number at the end (0 to 6)
==== Intel Compiler (icc/ifort) ====
To get informations on what is vectorized by the compiler, add :
Note: vec-report is now deprecated. Use qopt-report. Report is saved in an optrpt file.
-qopt-report=1
To change verbosity, add a number at the end (0 to 5) use :
-qopt-report=5
To get more info on SIMD used, user can use -fcode-asm -Faasm.s to get assembly language used by compiler:
ifort -O3 -qopt-report=1 test.f90 -fcode-asm -Faasm.s
Then take a look in asm.s. For example, SSE2 instructions would be:
009a5 f2 44 0f 58 c7 addsd %xmm7, %xmm8 #test.f90:35.83
009aa f2 44 0f 59 c0 mulsd %xmm0, %xmm8 #test.f90:35.83
And if you used -xHost or similar optimization options, AVX and FMA may show up:
00942 c4 42 e9 a9 d8 vfmadd213sd %xmm8, %xmm2, %xmm11 #test.f90:34.25
00991 c5 4d 58 0b vaddpd (%rbx), %ymm6, %ymm9 #test.f90:35.61