====== Debugging HPC programs ======
\\
This page is dedicated to debugging methods for HPC codes. New HPC developers should know these basic options to save time in their work.
All methods here provide a way to trace the related bug, which means finding the exact code line that is generating the bug.
I try to keep these pages up to date, but some flags may be deprecated.
Beware: some things named bug here may not be bugs but only mathematical/physical results. For example, a calculation may finish with a result just too high to be stored in 64bit memory. In fact, this is not really a bug, just a limitation, code and calculations are good.
If you are new in HPC programming or in debugging, a small tutorial on how to use the following flags is available. See [[software:development:debug:help|help debug]]. There are also examples for FPE and Uninitiliazed values debugging. All methods are then based on the same philosophy.
For reference :\\
Compilers used :
* gcc and gfortran 4.8.2 (from ubuntu 14.04 x86_64)
* icc and ifort 14.0.3 (from Intel Parallel Studio 2013 SP1 x86_64)
Tools used :
* valgrind 3.10.0
* gdb 7.7
Files used to simulate most of common bugs : [[software:development:debug:debf|deb_f.f90]] , [[software:development:debug:debc|deb_c.c]].
===== Main types of bugs =====
When developing HPC programs, bugs encountered are often the sames. Here is a list of most common bugs :
* [[#floating_point_exception|Floating point exceptions called fpe (Invalid, Overflow, Zero) ]]
* [[#Uninitialized_variables|Uninitialized values reading]]
* [[#Allocation/deallocation_issues|Allocation/deallocation issues]]
* [[#Array_out_of_bound_reading/writing|Array out of bound reading/writing]]
* [[#IO_issues|IO issues]]
* [[#Memory_leak|Memory leak]]
* [[#Stack_overflow|Stack overflow]]
* [[#Buffer_overflow|Buffer overflow]]
There are many other types of bugs, but these are the most common and the most easy to solve when using the appropriate tools.
===== When could there be a bug ? =====
First of all is to identify the presence of a bug :
* Program returns an error message
* Program returns an error exit code (other than 0)
* Program finishes with NaN or +Inf values
* Program ends unexpectedly
* Other cases, many scenario are possible
How to get the exit code of a program ?
* $? gives you the exit code of the last executed command.
* Other than 0 means something went wrong, and this code may help you understand why.
~$ gfortran myokprog.f90
~$ ./a.out
Hello world !
~$ echo $?
0
~$ gfortran mybugprog.f90
~$ ./a.out
Program received signal SIGSEGV: Segmentation
fault - invalid memory reference.
Backtrace for this error:
#0 0x7FFC993C87D7
#1 0x7FFC993C8DDE
#2 0x7FFC9901FC2F
Segmentation fault (core dumped)
$ echo $?
139
===== How to find them =====
Here is the list of debug flags/tools to use to trace bugs discussed above.
First part is generic (Quick debug strategy), while the second part is specific for each bug.
==== Quick debug strategy ====
Most of the time, these compilation options will find your bug (except for gcc which has only few debug options) :
^ Compiler ^ Compiler options ^
| gfortran | -Wuninitialized -O -g -fbacktrace -ffpe-trap=zero,underflow,overflow,invalid -fbounds-check -fimplicit-none -ftrapv |
| gcc | -g -Wall |
| ifort | -g -traceback -fpe0 -check all -ftrapuv -fp-stack-check -warn all -no-ftz |
| icc | Test 1 : -g -traceback -check=uninit -fp-stack-check -no-ftz\\ Test 2 : -g -traceback -check-pointers=rw |
If C code, try FPE strategy (see below).
If not enough, compile with :
^ Compiler ^ Compiler options ^
| gfortran | -g -fbacktrace |
| gcc | -g |
| ifort | -g -traceback |
| icc | -g -traceback |
And launch the program with valgrind :
~$ valgrind myprog.exe
Most of the time it will get the error.
==== Floating Point exception ====
There are three types of FPE :
* **Zero** : when you divide by zero, very common in HPC. For example : A/0.0=+∞
* **Invalid** : when the operation is mathematically impossible. For example : acos(10.0) = NaN
* **Overflow/Underflow** : when you reach maximum/minimum number that system can hold. For example : exp(10E15) = A huge number = +Inf
**Behavior :** FPE will not generate an error at runtime or at compilation time (GCC/INTEL).
===Tracing in Fortran ===
^ Compiler ^ Way to trace bug ^
| gfortran | Compiler flags : **-g -fbacktrace -ffpe-trap=zero,underflow,overflow,invalid**.\\ The fpe will be explicitly displayed at runtime. |
| ifort | Compiler flags : **-g -traceback -fpe0**.\\ The fpe will be explicitly displayed at runtime. |
===Tracing in C ===
^ Compiler ^ Way to trace bug ^
| gcc and icc | Add **#include ** in the main source file, then use **feenableexcept(FE_DIVBYZERO| FE_INVALID|FE_OVERFLOW);** juste after main.\\ Compiler flags : **-g**.\\ The fpe will generate a floating point error at runtime. Then use gdb to get informations on the code line generating the fpe. |
==== Uninitialized variables ====
When you try to read a non initialized variable. The program may not stop, and all following calculations will be based on a random value. This is common with MPI programs (Ghosts, etc).\\
Three main types of initialized variables :
* **Static variable** : variable uninitialized is static
* **Dynamic variable** : variable uninitialized is dynamic
* **Not allocated variable** : try to use a non allocated dynamic variable
**Behavior :**
* Static variable : no error at runtime
* Dynamic variable : no error at runtime
* Not allocated variable : segmentation fault at runtime
Memcheck of Valgrind will let the program run and use uninitialized values, keeping track of these operations. It will only complain when a variable "goes out" of the program (printing in the terminal, writing in a file, etc). The error will be indicated at the line of this print/write. To get more informations on the variable uninitialized, use %%--%%track-origins=yes as Valgrind flag.
=== Tracing in Fortran ===
^ Compiler ^ Way to trace bug ^
| gfortran | - static variable : Compiler options : **-Wuninitialized -O -g -fbacktrace**. Will display a warning at compilation time.\\ To get more informations, use **Valgrind**. The error will be a “Conditional jump or move depends on uninitialized value(s)” |
| | - dynamic variable : Compiler options : **-g -fbacktrace**.\\ Use **Valgrind %%--%%track-origins=yes**. The error will be a “Conditional jump or move depends on uninitialized value(s)” |
| | - not allocated variable : Compiler options : **-g -fbacktrace**. The error will be explicitly displayed at runtime.|
| ifort | - static variable : Compiler options : **-check all**. The error will be explicitly displayed at runtime.\\ Possibility to replace all uninitialized values by a huge number, use -ftrapuv |
| | - dynamic variable : Compiler options : **-g -traceback**.\\ Use **Valgrind %%--%%track-origins=yes**. The error will be a “Conditional jump or move depends on uninitialized value(s)” |
| | - not allocated variable : Compiler options : **-g -traceback**. The error will be explicitly displayed at runtime.|
=== Tracing in C ===
^ Compiler ^ Way to trace bug ^
| gcc | - static variable : Compiler options : **-Wuninitialized** or **-Wall**. Will display a warning at compilation time.\\ To get more informations, use **Valgrind**. The error will be a “Conditional jump or move depends on uninitialized value(s)” |
| | - dynamic variable : Compiler options : **-g**.\\ Use **Valgrind %%--%%track-origins=yes**. The error will be a “Conditional jump or move depends on uninitialized value(s)” |
| | - not allocated variable : Compiler options : **-Wuninitialized** or **-Wall**. Will display a warning at compilation time.\\ To get more informations, use **Valgrind**. The error will be a “Conditional jump or move depends on uninitialized value(s)”\\ To get more informations, use **gdb** and ask **backtrace**. |
| icc | - static variable : Compiler options : **-Wuninitialized**. Will display a warning at compilation time.\\ **-g -traceback -check=uninit**. The error will be explicitly displayed at runtime. |
| | - dynamic variable : Compiler options : **-g -traceback**.\\ Use **Valgrind %%--%%track-origins=yes**. The error will be a “Conditional jump or move depends on uninitialized value(s)” |
| | - not allocated variable : Compiler options : **-Wuninitialized**. Will display a warning at compilation time.\\ **-g -traceback -check=uninit**. The error will be explicitly displayed at runtime. |
==== Allocation/deallocation issues ====
=== Tracing in Fortran ===
^ Compiler ^ Way to trace bug ^
| gfortran | - free a non allocated variable : Compiler options : **-g -fbacktrace**. The error will be explicitly displayed at runtime. |
| | - allocate an already allocated variable : Compiler options : **-g -fbacktrace**. The error will be explicitly displayed at runtime. |
| | - not freed memory : Compiler options : **-g -fbacktrace**.\\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. |
| ifort | - free a non allocated variable : Compiler options : **-g -traceback**. The error will be explicitly displayed at runtime. |
| | - allocate an already allocated variable : Compiler options : **-g -traceback**. The error will be explicitly displayed at runtime. |
| | - not freed memory : Compiler options : **-g -traceback**. \\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. |
=== Tracing in C ===
^ Compiler ^ Way to trace bug ^
| gcc | - free a non allocated variable : Compiler options : **-Wuninitialized** or **-Wall**. Will display a warning at compilation time.\\ To get more informations, use **Valgrind**. The error will be a “Conditional jump or move depends on uninitialized value(s)” |
| | - allocate an already allocated variable : Compiler options : **-g -fbacktrace**.\\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. |
| | - not freed memory : Compiler options : **-g -fbacktrace**.\\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. |
| icc | - free a non allocated variable : Compiler options : **-Wuninitialized**. Will display a warning at compilation time.\\ **-g -traceback -check=uninit**. The error will be explicitly displayed at runtime. |
| | - allocate an already allocated variable : Compiler options : **-g -traceback**. \\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. |
| | - not freed memory : Compiler options : **-g -traceback**. \\ Use **Valgrind %%--%%leak-check=full**. Look for LEAK SUMMARY, definitely lost. |
==== Array out of bound reading/writing ====
=== Tracing in Fortran ===
^ Compiler ^ Way to trace bug ^
| gfortran | Compiler options : **-g -fbacktrace -fbounds-check**. The error will be explicitly displayed at runtime. |
| ifort | Compiler options : **-g -traceback -check all** (or -check bounds). The error will be explicitly displayed at runtime.|
=== Tracing in C ===
^ Compiler ^ Way to trace bug ^
| gcc | Compiler options : **-g**. Use **Valgrind**, the error will be a "Invalid read/write of size 8/16".\\ Or patch gcc and recompile it with bounds checking (http://sourceforge.net/projects/boundschecking/) |
| icc | Compiler options : **-g -traceback -check-pointers=rw**. The error will be explicitly displayed at runtime. \\ Warning : check-pointers=rw makes all other debugging options not working when activated, be careful.|
==== IO issues ====
IO errors are often very explicit. No need to use a debugging tool. However, Valgrind and fpe options can detect some related errors (bad reading = bad initialized value or = fpe, etc.)
Do not forget to set **-g -fbacktrace** (gfortran) or **-g -traceback** (icc/ifort) to get useful error information.
Simply be careful by securing all read/write (get output code and check it).
==== Memory leak ====
=== Tracing in Fortran ===
^ Compiler ^ Way to trace bug ^
| gfortran | Compiler options : **-g -fbacktrace**. Use Valgrind --leak-check=full. Look for LEAK SUMMARY, definitely lost. |
| ifort | Compiler options : **-g -traceback**. Use Valgrind --leak-check=full. Look for LEAK SUMMARY, definitely lost. |
=== Tracing in C ===
^ Compiler ^ Way to trace bug ^
| gcc | Compiler options : **-g**. Use Valgrind --leak-check=full. Look for LEAK SUMMARY, definitely lost. |
| icc | Compiler options : **-g -traceback**. Use Valgrind --leak-check=full. Look for LEAK SUMMARY, definitely lost. |
==== Stack overflow ====
=== Tracing in Fortran ===
^ Compiler ^ Way to trace bug ^
| gfortran | Compiler options : **-g -fbacktrace**. Use Valgrind. Look for "Stack overflow in thread X" or "Access not within mapped region".\\ **gdb** will catch it with backtrace but not a lot of informations. |
| ifort | Compiler options : **-g -traceback**. Use Valgrind. Look for "Stack overflow in thread X" or "Access not within mapped region".\\ **gdb** will catch it with backtrace but not a lot of informations. |
=== Tracing in C ===
^ Compiler ^ Way to trace bug ^
| gcc | Compiler options : **-g**. Use Valgrind. Look for "Stack overflow in thread X" or "Access not within mapped region".\\ **gdb** will catch it with backtrace but not a lot of informations. |
| icc | Compiler options : **-g -traceback**. Use Valgrind. Look for "Stack overflow in thread X" or "Access not within mapped region".\\ **gdb** will catch it with backtrace but not a lot of informations. |
==== Buffer overflow ====
=== Tracing in Fortran ===
^ Compiler ^ Way to trace bug ^
| gfortran | Compiler options : **-g -fbacktrace**. The error will be explicitly displayed at runtime. |
| ifort | Compiler options : **-g -traceback**. The error will be explicitly displayed at runtime. |
=== Tracing in C ===
^ Compiler ^ Way to trace bug ^
| gcc | Compiler options : **-g**. Use **gdb**. Ask for **backtrace** after error, lot of informations. |
| icc | Compiler options : **-g -traceback -check-pointers=rw**. The error will be explicitly displayed at runtime. \\ Warning : check-pointers=rw makes all other debugging options not working when activated, be careful.|