One of the best moments in my professional career, from a pure personal perspective, came about 6 weeks ago when I was able to find out the cause of a memory leak one of our programs was suffering, and it turned out to be a problem in a standard library function completely unrelated to memory management. I am very proud of that moment because it took me a lot of time to find the problem and I had to apply a good amount of knowledge I did not expect to apply as a professional. When I could finally prove the memory leak was in the library function, using a short (20 lines) demonstration program, I felt simply happy. The drama was over.
Our program has soft real-time requirements and is written mostly in Ada, with a few calls to C library functions via pragmas. Due to the nature of the program, it avoids memory allocation as frequently as possible and uses static-sized arrays for its data structures, and O(1) algorithms whenever possible to behave properly in the context it is being used.
I will completely skip the part of the story dealing with analysis of the source code in vain to find where the program was leaking memory, but I can tell you it took a lot of time and did not give any positive results, making us quite angry and desperate. We would have been unable to find the problem this way. As I said before, the leak was in a standard library function from an expensive software development kit for Ada, and had nothing to do with memory allocation functions. To be more precise, the memory leak was in Text_IO.Reset, a procedure that resets the state of a text file, very similar to rewind in C. I will head to the final steps, what I considered interesting.
The program runs on Solaris, so we monitored the process using pmap. This gave us precise information and told us clearly that the memory region that was growing was the heap, where memory allocation happens. I thought that, if our program was barely doing any memory allocation operations, and normally it should be doing none, according to the code, we had a good chance of catching it leaking memory. When a program in Unix needs more memory, it calls either mmap, brk or sbrk. I could not come up with more system calls that allocated memory. Normally, when you program in C or C++ you use malloc, free, new and delete. These language operators or library functions in turn manage memory blocks but request more memory to the operating system with the previous system calls I mentioned. It is explained in many books and tutorials over the Internet but I would say, and maybe I am wrong, that it is not exactly common knowledge.
My first approach, which did not work, involved creating a shared library that I would load using LD_PRELOAD, which intercepted calls to sbrk, brk and mmap. When intercepted, it would call pstack on the current process (a program that prints the call stack of any process given its pid), save the call stack to a text file and proceed with the normal system call. Hackish and clever, I thought while laughing like a maniac when I was coding that. Well, that did not work, I repeat. While the program was indeed calling sbrk as confirmed by truss (for Linux users, truss is very similar to strace), it was not calling, apparently, any function called sbrk that I could intercept. I created a test program to see if my library worked and it did, but it did not create any stack trace for the program in question.
Still, I had already started using truss to verify the program was allocating memory with sbrk, so I dived into the truss manual to see if I could use it for something else. This way I discovered that truss was able to stop the program execution when it made any system call I specified. My new approach was, then, tracing the program with truss, stopping in sbrk, then calling pstack on the PID and then telling the program to continue running. This almost worked. The printed stack did not have any symbols, probably due to the Ada compiler not populating the executable file with the debugging information as the C compiler did. So close yet so far. Our programs were indeed compiled with debugging information, and a minor change to the strategy was enough. Instead of printing the stack with pstack, I would attach the Ada debugger to the program and print the call stack. This way, I finally witnessed the program leaking memory in what seemed to be a call to Text_IO.Reset.
I thought this could be wrong, so I created a test program that read a file over and over again, calling Text_IO.Reset when reaching EOF. The test program, indeed, leaked memory at an alarming rate. Case closed, smile and surprise in my face. Well, to be honest, we replaced the calls to Text_IO.Reset with something else and tested again, to confirm the program had stopped leaking memory. But I already knew the problem had been found after running the test program.
When I came home I wondered if I could have done something similar in my Linux system. I read the man page for strace to see if it could stop programs when a specific system call was made, but I found no way of doing so. Apparently, the solution and the strategy I had employed was Solaris (or maybe UNIX) specific. Several Google searches did not give me any clue about doing the same in Linux.
Yesterday, however, GDB 7.0 was released. I took a look at the new features and I found this little sentence at the end of it:
* New command to stop execution when a system call is made
According to the documentation, you only need to set a cathpoint for the program like catch syscall sbrk to achieve the same. Two months ago, I would have read the features and forgotten them five minutes later. But, yesterday, I again smiled, then laughed like a maniac and shouted “BEGONE, MEMORY LEAKS!!!”.