\\
===== Debug using compilation options =====
Consider this program in fortran (fortran is very similar to C) :
Program Bug
implicit none
real, allocatable, dimension(:) :: tab
allocate(tab(1:10))
tab(:) = 1.0
call Buggy()
deallocate(tab)
contains
Subroutine Buggy()
print *, tab(11)
End Subroutine Buggy
End Program Bug
If compiled using no options or optim options and then execute :
gfortran bug.f90
./a.out
You get, with no errors or warnings :
1.85398793E-40
The same using ifort compiler. You know this result is absurd, but you want to locate the error. When compiled with debug options :
gfortran bug.f90 -g -Wuninitialized -O -fbacktrace -fbounds-check -ffpe-trap=zero,underflow,overflow,invalid -ftrapv -fimplicit-none -fno-automatic
./a.out
You get :
At line 15 of file bug.f90
Fortran runtime error: Array reference out of bounds for array 'tab', upper bound of dimension 1 exceeded (11 > 10)
Backtrace for this error:
+ function buggy (0x400A70)
at line 15 of file bug.f90
+ function bug (0x400BE4)
at line 9 of file bug.f90
+ /lib/libc.so.6(__libc_start_main+0xfd) [0x7fa49f04ac4d]
Which is simple to use: you made an error, line 15 of file bug.f90, the array tab has been called with 11 when it's size is not more than 10 (in fortran, arrays start at 1).
Now, using ifort :
ifort bug.f90 -g -debug -traceback -check all -implicitnone -warn all -fpe0 -fp-stack-check -ftrapuv -heap-arrays -gen-interface -warn interface
./a.out
forrtl: severe (408): fort: (2): Subscript #1 of the array TAB has value 11 which is greater than the upper bound of 10
Image PC Routine Line Source
a.out 000000000046AA2E Unknown Unknown Unknown
a.out 00000000004694C6 Unknown Unknown Unknown
a.out 0000000000422242 Unknown Unknown Unknown
a.out 0000000000404AFB Unknown Unknown Unknown
a.out 0000000000405011 Unknown Unknown Unknown
a.out 000000000040356E bug_IP_buggy_ 15 bug.f90
a.out 0000000000403252 MAIN__ 8 bug.f90
a.out 0000000000402B8C Unknown Unknown Unknown
libc.so.6 00007F9CBFFFBEA5 Unknown Unknown Unknown
a.out 0000000000402A89 Unknown Unknown Unknown
Which is also easy to understand (using line and source, you can see that main call buggy at line 8, and that buggy created the error at line 15).
Using these methods, you can locate most of bugs.
If it is not enough, or if your bug disappear using these options (can append), then you may need to use a debugger.
\\
===== Debug using a debugger =====
\\
Some says gdb is better, others valgrind is better. In fact, both are good. I am just used to valgrind, so I will present this one. Note that valgrind can also be used to profile the code, check memory leaks, test cache use, etc. We will see that in the optimization section. Note also that valgrind support MPI implementation if built with it.
Last point: valgrind will slow down A LOT your execution and is extremely talkative. If the bug appears after a long time of run, and that you know in which part of the code it occurs, you may use special flags to tell valgrind monitor only this part (see valgrind documentation).
Let's re-use our previous code. To use valgrind, you have to compile using -g option, combined with optimization flags if your code use them in normal time.
gfortran bug.f90 -g -O3
valgrind ./a.out
==25150== Memcheck, a memory error detector
==25150== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==25150== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==25150== Command: ./a.out
==25150==
==25150== Invalid read of size 4
==25150== at 0x4F13EF0: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.3.0.0)
==25150== by 0x4F15AAE: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.3.0.0)
==25150== by 0x4F165FE: ??? (in /usr/lib/x86_64-linux-gnu/libgfortran.so.3.0.0)
==25150== by 0x40093B: MAIN__ (bug.f90:15)
==25150== by 0x4007AC: main (bug.f90:9)
==25150== Address 0x5c634e8 is 0 bytes after a block of size 40 alloc'd
==25150== at 0x4C2CD7B: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==25150== by 0x4008B1: MAIN__ (bug.f90:6)
==25150== by 0x4007AC: main (bug.f90:9)
==25150==
0.00000000
==25150==
==25150== HEAP SUMMARY:
==25150== in use at exit: 0 bytes in 0 blocks
==25150== total heap usage: 23 allocs, 23 frees, 12,076 bytes allocated
==25150==
==25150== All heap blocks were freed -- no leaks are possible
==25150==
==25150== For counts of detected and suppressed errors, rerun with: -v
==25150== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 2)
OK, more difficult to understand, but valgrind locate near everything and is made for more advanced users, you will have to deal with it.
Some error report using valgrind are :
* Invalid read of size 4: you tried to read a non-existing value of float/real(4) type.
* Invalid read of size 8: you tried to read a non-existing value of double/real(8) type.
* Invalid write of size 4: you tried to write a non-existing value of float/real(4) type.
* Invalid write of size 8: you tried to write a non-existing value of double/real(8) type.
* Conditional jump or move depends on uninitialized value: you are reading a value (int/integer, float/real, etc) that has not be initialized (no value)
* Access not within mapped region at address... Stack Overflow : this error appears often using multithreading. it means your program made a stack overflow, i.e. you tried to allocate to much on the stack. Locate huge arrays, and allocate them on the heap (do not forget that in some multithread implementations like OpenMP, each sub thread allocate it's duplicated values on the main program stack. If you use too much threads, your stack will not be enough).
Last things on valgrind :
To use it in parallel, using MPI :
mpirun -np 4 valgrind ./myprog.exe
Note that valgrind will display many identical errors, even when there are only one (because you may repeat this error a lot of time). Try to find the first error, and then use this message as a starting point.
But ! Some libs (like MPI libs, etc) also contain bugs, often at start up, and valgrind will display them. I strongly suggest you add a print at the beginning of your code (at first line), and then when analyzing valgrind output, do not consider errors before this print.
===== Floating Point exception =====
====Tracing in Fortran ====
Example :
program myprog
implicit none
real(8) :: d1,d2,d3
d2 = 10.0d0
d3 = 0.0d0
d1 = d2 / d3
end program myprog
~$ gfortran -g -fbacktrace -ffpe-trap=zero,underflow,overflow,invalid myprog.f90
~$ ./a.out
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7FC0FE8877D7
#1 0x7FC0FE887DDE
#2 0x7FC0FE4DEC2F
#3 0x4006DD in myprog at myprog.f90:7
Floating point exception (core dumped)
Bug is at line 7.
====Tracing in C ====
Example :
#include
int main(int argc, char **argv)
{
feenableexcept(FE_DIVBYZERO| FE_INVALID|FE_OVERFLOW);
double d1,d2,d3;
d2 = 10.0;
d3 = 0.0;
d1 = d2 / d3;
}
~$ gcc -g myfile.c -lm
~$ ./a.out
Floating point exception (core dumped)
~$ gdb a.out
(gdb) run
Starting program: /home/spehn/a.out
Program received signal SIGFPE, Arithmetic exception.
0x0000000000400637 in main (argc=1, argv=0x7fffffffdf78) at myfile.c:9
9 d1 = d2 / d3;
(gdb)
Bug is at line 9.
===== Uninitialized variables =====
==== Tracing in Fortran ====
Example :
program myprog
implicit none
real(8) :: d1,d2
d1 = d2*10.0d0
end program myprog
~$ gfortran -Wuninitialized -g -fbacktrace myprog.f90
myprog.f90: In function ‘myprog’:
myprog.f90:6:0: warning: ‘d2’ is used uninitialized in this function [-Wuninitialized]
d1 = d2*10.0d0
^
~$ ifort -fpp -Duninitstatic myprog.f90 -g -check all
~$ ./a.out
forrtl: severe (193): Run-Time Check Failure. The variable 'myprog_$D2' is being used without being defined
Image PC Routine Line Source
a.out 0000000000402336 Unknown Unknown Unknown
libc.so.6 00007F3785537EC5 Unknown Unknown Unknown
a.out 0000000000402229 Unknown Unknown Unknown
Error is coming from variable D2. Adding -traceback would provide line information.
Other Example :
program myprog
real(8), allocatable, dimension(:) :: d1,d2
allocate(d1(1:10), d2(1:10))
d1(3) = d2(4)*10.0d0
print *,d1(3),d2(4)
deallocate(d1)
end program myprog
~$ ifort myprog.f90 -g -traceback
~$ valgrind --track-origins=yes ./a.out
[...]
==21655== Conditional jump or move depends on uninitialised value(s)
==21655== at 0x448595: cvt_ieee_t_to_text_ex (in /home/sphen/Downloads/a.out)
==21655== by 0x426F22: for__format_value (in /home/sphen/Downloads/a.out)
==21655== by 0x40AD5A: for_write_seq_lis_xmit (in /home/sphen/Downloads/a.out)
==21655== by 0x4025C6: MAIN__ (myprog.f90:7)
==21655== by 0x402335: main (in /home/sphen/Downloads/a.out)
==21655== Uninitialised value was created by a heap allocation
==21655== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21655== by 0x406518: for_alloc_allocatable (in /home/sphen/Downloads/a.out)
==21655== by 0x4024C5: MAIN__ (myprog.f90:5)
==21655== by 0x402335: main (in /home/sphen/Downloads/a.out)
[...]
Error is at line 7 and variable was created at line 5.