Sunday, November 30, 2014

Make system library source code available to gdb on Ubuntu

While debugging an application crash issue, I worked out the following steps that allow gdb to step through source code of system libraries. The steps are:
  1. Run gdb to find out the name of the system libraries whose source code we want to step through.
  2. Install the corresponding debug package of the system library.
  3. Download the source package of the system library, and tell gdb the path to find the source code.
  4. (Optional) Store the source path setup commands in a gdb command script file, and reuse the commands in the future.
The Ubuntu package management tool tool apt-get is used in these steps. A apt-get tutorial can be found here.

The following two examples shows how the steps are used to access source code of standarc C and C++ libraries.

Example 1 - libc6

$ gdb program -c core
...
Program terminated with signal 11, Segmentation fault.
 0xb7347876 in ?? () from /lib/i386-linux-gnu/libc.so.6
(gdb) bt
#0  0xb7347876 in ?? () from /lib/i386-linux-gnu/libc.so.6
#1  0x00008090 in ?? ()
So gdb provides useful information that segmentation fault happened in libc at address 0xb7347876. But I would rather read source code than memory address. After some research I found that I could install debug package for libc:
$ sudo apt-get install libc6-dbg
With the libc debug package installed, gdb now knows the filename and line numbers of the memory address that caused segmentation fault:
$ gdb program -c core
...
Program terminated with signal 11, Segmentation fault.
#0  free_check (mem=0x40, caller=0x40) at hooks.c:246
246 hooks.c: No such file or directory.
Next step is to help gdb access the source code. The source code of libc6 can be downloaded by apt-get source libc6 command:
$ mkdir /var/tmp/source/libc6/
$ cd /var/tmp/source/libc6/
$ apt-get source libc6
$ find . -name hooks.c
./eglibc-2.15/malloc/hooks.c
So we need to tell gdb the path to hooks.c is /var/tmp/source/libc6/eglibc-2.15/malloc/. This is done with the dir command. After setting the source directory, gdb is able to list the source code:
$ gdb program -c core
...
Program terminated with signal 11, Segmentation fault.
#0  free_check (mem=0x40, caller=0x40) at hooks.c:246
246 hooks.c: No such file or directory.
(gdb) dir /var/tmp/sources/libc6/eglibc-2.15/malloc/
Source directories searched: /var/tmp/sources/libc6/eglibc-2.15/malloc:$cdir:$cwd
(gdb) list
241   mchunkptr p;
242 
243   if(!mem) return;
244   (void)mutex_lock(&main_arena.mutex);
245   p = mem2chunk_check(mem, NULL);
246   if(!p) {
247     (void)mutex_unlock(&main_arena.mutex);
248 
249     malloc_printerr(check_action, "free(): invalid pointer", mem);
250     return;
To spare the effort of entering the source directory in the future, I place the command in the text file gdb.setup, and ask gdb to execute it before starting the interactive debugging with the -x option. Notice that gdb executes the command file after it complains about not knowing where the source file is, so it might look like the command didn't work. But gdb is able to list the source file content OK:
$ echo dir /var/tmp/sources/libc6/eglibc-2.15/malloc/ > gdb.setup
$ gdb program -c core -x gdb.setup
...
Program terminated with signal 11, Segmentation fault.
#0  free_check (mem=0x40, caller=0x40) at hooks.c:246
246 hooks.c: No such file or directory.
(gdb) list
241   mchunkptr p;
242 
243   if(!mem) return;
244   (void)mutex_lock(&main_arena.mutex);
245   p = mem2chunk_check(mem, NULL);
246   if(!p) {
247     (void)mutex_unlock(&main_arena.mutex);
248 
249     malloc_printerr(check_action, "free(): invalid pointer", mem);
250     return;

Example 2 - libstdc++

$ gdb program -c core
...
(gdb) frame 3
#3  0xb752a51f in operator delete(void*) () from /usr/lib/i386-linux-gnu/libstdc++.so.6
Install the libstdc++ debug package. First we find out what libstdc++ package is installed:
$ dpkg -l | grep libstdc++
ii  libstdc++6                             4.6.3-1ubuntu5                          GNU Standard C++ Library v3
ii  libstdc++6-4.6-dev                     4.6.3-1ubuntu5                          GNU Standard C++ Library v3 (development files)
So we are looking for source code for version 4.6.3-ubuntu5. Next we check what is available:
$ apt-cache search libstdc++|grep 4.6
libstdc++6-4.6-dbg - GNU Standard C++ Library v3 (debugging files)
libstdc++6-4.6-dev - GNU Standard C++ Library v3 (development files)
libstdc++6-4.6-doc - GNU Standard C++ Library v3 (documentation files)
lib64stdc++6-4.6-dbg - GNU Standard C++ Library v3 (debugging files)
libhfstdc++6-4.6-dbg-armel-cross - GNU Standard C++ Library v3 (debugging files)
libsfstdc++6-4.6-dbg-armhf-cross - GNU Standard C++ Library v3 (debugging files)
libstdc++6-4.6-dbg-armel-cross - GNU Standard C++ Library v3 (debugging files)
libstdc++6-4.6-dbg-armhf-cross - GNU Standard C++ Library v3 (debugging files)
libstdc++6-4.6-dev-armel-cross - GNU Standard C++ Library v3 (development files)
libstdc++6-4.6-dev-armhf-cross - GNU Standard C++ Library v3 (development files)
libstdc++6-4.6-pic - GNU Standard C++ Library v3 (shared library subset kit)
libstdc++6-4.6-pic-armel-cross - GNU Standard C++ Library v3 (shared library subset kit)
libstdc++6-4.6-pic-armhf-cross - GNU Standard C++ Library v3 (shared library subset kit)
By matching the names of the installed libstdc++ development package and the debug package, we know the package we need is libstdc++6-4.6-dbg. So we install this package:
$ sudo apt-get install libstdc++6-4.6-dbg
For sanity check, verify the package version is indeed 4.6.3-1ubuntu5:
$ dpkg -l|grep libstdc++6-4.6-dbg
ii  libstdc++6-4.6-dbg                     4.6.3-1ubuntu5                          GNU Standard C++ Library v3 (debugging files)
Installing the libstdc++ debug package provides gdb the filename and line number information:
$ gdb program -c core
...
(gdb) frame 4
#4  0xb751199b in deallocate (__p=0xb1f05488 "\303\001", this=) at /build/buildd/gcc-4.6-4.6.3/build/i686-linux-gnu/libstdc++-v3/include/ext/new_allocator.h:98
98 /build/buildd/gcc-4.6-4.6.3/build/i686-linux-gnu/libstdc++-v3/include/ext/new_allocator.h: No such file or directory.
Next we fetch the source package of libstdc+.
$ mkdir /var/tmp/sources/libstdc+/
$ cd /var/tmp/sources/libstdc++/
$ apt-get source libstdc++6
$ find . -name new_allocator.h
./gcc-4.6-4.6.3/gcc-4.6-4.6.3/gcc-4.6.3/libstdc++-v3/include/ext/new_allocator.h
Since the libstdc++ debug package records the location in the source tree of the source file, such as /build/buildd/gcc-4.6-4.6.3/build/i686-linux-gnu/libstdc++-v3/include/ext/new_allocator.h, we could use gdb's substitute-path variable to specify the source tree location:
$ gdb program -c core
...
(gdb) set substitute-path /build/buildd/gcc-4.6-4.6.3/build/i686-linux-gnu/libstdc++-v3/ /var/tmp/sources/libstdc++/gcc-4.6-4.6.3/gcc-4.6-4.6.3/gcc-4.6.3/libstdc++-v3/
(gdb) frame 4
#4  0xb751199b in deallocate (__p=0xb1f05488 "\303\001", this=) at /build/buildd/gcc-4.6-4.6.3/build/i686-linux-gnu/libstdc++-v3/include/ext/new_allocator.h:98
98       { ::operator delete(__p); }
(gdb) list
93       }
94 
95       // __p is not permitted to be a null pointer.
96       void
97       deallocate(pointer __p, size_type)
98       { ::operator delete(__p); }
99 
100       size_type
101       max_size() const throw() 
102       { return size_t(-1) / sizeof(_Tp); }

Sunday, August 24, 2014

Python slows down MySQL task, in a good way

Another day to learn how Python helps my job. This time Python helps slowing down MySQL task.

We needed to delete a large number of rows from a MySQL table without interrupting the service. Without going into the details of how the approach was determined, I was asked to repeatedly delete a small number of rows from a MySQL table, followed by sleeping some time.

I searched the web and found the following useful links:
So I wrapped the database operations in the MySQL class:
class MySQL:
def __init__(self):
self.connector = mysql.connector.connect(
user='usr',
password='pwd',
host='127.0.0.1',
port='20033', # tunnel to server
database='db')
self.cursor = self.connector.cursor()
def delete(self, num_rows):
t1 = datetime.now()
query = ('SELECT id FROM events WHERE source_id=3651 LIMIT {}'.
format(num_rows))
self.cursor.execute(query)
ids = tuple(chain.from_iterable(self.cursor.fetchall()))
actual_rows = len(ids)
if actual_rows > 0:
query = 'DELETE FROM events WHERE id in {}'.format(
ids if actual_rows > 1 else '({})'.format(ids[0]))
self.cursor.execute(query)
self.connector.commit()
t2 = datetime.now()
print 'delete {} rows took {} seconds'.format(
actual_rows, (t2-t1).total_seconds())
return actual_rows == rows
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
self.cursor.close()
self.connector.close()
view raw gistfile1.py hosted with ❤ by GitHub


The constructor __init__() calls mysq.connector.connect() to connect to MySQL server. The user and password are used to login to the database server. The host and port specify the location of the database. I forward my local port 20033 to a remote database server, so the connection was made to localhost at port 20033. This is one-off script so I hard-coded the parameters.

The database connection is closed in the __exit__() method. And I instantiate the class by Python's "with" statement to ensure the database connection is closed at the end. Detailed discussion about the with statement can be found in this article.

The database deletion is done in the delete() method. Database operation is performed by passing a string containing the SQL statement to Cursor.execute(). The result of the database operation can be accessed through the Cursor. My first database operation is query for a set of ids. I fetched all results and store them in a Python tuple using the following command:
ids = tuple(chain.from_iterable(self.cursor.fetchall()))

The second database operation deletes rows using the ids from the previous query. I tried the following SQL statement:
query = 'DELETE FROM relay_event WHERE id in {}'.format(ids)
This works except when ids contains only one element. A tuple with one element has a training comma to ensure it's not mistaken as the element enclosed by parenthesis. But SQL sees the trailing comma as syntax error. So I remove the trailing comma from a single element tuple before passing it to SQL:
actual_rows = len(ids)
...
ids if actual_rows > 1 else '({})'.format(ids[0])

MySQL.delete() returns a boolean indicating if some rows were deleted.

The main program instantiates MySQL, and repeatedly call MySQL.delete() and sleep(), until MySQL.delete() can't find any row to delete:

  with MySQL() as mySQL:
    more_to_delete = mySQL.delete(rows)
    time.sleep(sleep_seconds)
    while more_to_delete:
      more_to_delete = mySQL.delete(rows)
      time.sleep(sleep_seconds)

That's all coding needed. The following is everything put together:

#!/usr/bin/env python
from datetime import datetime
import mysql.connector
from itertools import chain
import time
class MySQL:
def __init__(self):
self.connector = mysql.connector.connect(
user='usr',
password='pwd',
host='127.0.0.1',
port='20033', # tunnel to server
database='db')
self.cursor = self.connector.cursor()
def delete(self, num_rows):
t1 = datetime.now()
query = ('SELECT id FROM events WHERE source_id=3651 LIMIT {}'.
format(num_rows))
self.cursor.execute(query)
ids = tuple(chain.from_iterable(self.cursor.fetchall()))
actual_rows = len(ids)
if actual_rows > 0:
query = 'DELETE FROM events WHERE id in {}'.format(
ids if actual_rows > 1 else '({})'.format(ids[0]))
self.cursor.execute(query)
self.connector.commit()
t2 = datetime.now()
print 'delete {} rows took {} seconds'.format(
actual_rows, (t2-t1).total_seconds())
return actual_rows == rows
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
self.cursor.close()
self.connector.close()
if __name__ == '__main__':
rows = 1000
sleep_seconds = 5
with MySQL() as mySQL:
more_to_delete = mySQL.delete(rows)
time.sleep(sleep_seconds)
while more_to_delete:
more_to_delete = mySQL.delete(rows)
time.sleep(sleep_seconds)
view raw gistfile1.py hosted with ❤ by GitHub
With the 5 seconds sleep time we used, it took more than 24 hours to execute the script. To avoid interruption by loss of connection, the script was run in a GNU screen window. Over a weekend, the script slowly but steadily deleted over 32 million rows without service interruption.

Saturday, July 26, 2014

Python threading application

This is a story about how Python threading reduces a script's execution time by more than 20 times.

This is how it all started. Last week, somebody asked me to check the timestamp of the last file written to a specific directory in a remote machine.

Sure. I only needed to run the following:
$ ssh remote ls -lrt /mnt/data/ | tail -1
view raw ssh-ls.txt hosted with ❤ by GitHub
Then came the follow up question: How about checking the whole realm?

Sure. Database query returned hostnames contained in the realm. So I had a text file with 87 lines that looked like the following:
remote1
remote2
...
remote87
I opened the text file with vim, joined all lines into one:
:1,$s/\n//g
And added some code to turn this into the following bash script:
for s in remote1 remote2 ... remote87
do
ssh $s ls -lrt /var/lib/jelli/audio-recording/ | tail -1
done
view raw check.sh hosted with ❤ by GitHub
We ran the script and found some problematic hosts, fixed those hosts, and everybody went home feeling relieved.

The following day, another new request. How about keep monitoring the realm for a week? And also monitor other realms?

So I created the following Python script to read the hostnames from a text file, and check the hosts one by one:
#!/usr/bin/env python
import argparse
import subprocess
def check_one_station(station):
cmd = 'ssh {0} ls -alrt /var/lib/jelli/audio-recording/ | grep -v ^d | tail -1'.format(station)
p = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(out, err) = p.communicate()
idx = err.find('cannot access')
if idx >= 0:
cmd = 'ssh {0} ls -alrt /var/lib/jelli/{0}/audio-recording/ | grep -v ^d | tail -1'.format(station)
p = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(out, err) = p.communicate()
idx = err.find('cannot access')
if idx >= 0:
print '###', station
else:
print station, out[:-1]
else:
print station, out[:-1]
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('filename', help='file containing station names')
args = parser.parse_args()
f = open(args.filename, 'r')
line = f.read()
stations = line.split()
for station in stations:
check_one_station(station)
This script gets the job done. But it takes quite some time to check 87 stations. There must be a faster way.

To reduce the execution time, I try to use multithreading. So I learned Python multithreading from this tutorial, and modified my script to followed the example of the tutorial.

Right away, I found printing to standard output from the threads results in interleaved texts from different threads, making the output unreadable. To avoid this problem, I made all worker threads post output message to a queue. The main thread is the only thread that takes the output message from the queue and prints them to the standard output. The standard output looks good after introducing the message queue.

Then I found that joining the worker threads in the order of creation is not ideal. When one worker thread takes a long time to finish its task, other worker threads created later than the slow one might have finished their task, but the main thread is blocked while joining the slow worker thread and can't join the younger but faster threads. So I added a 5 millisecond timeout to the thread join. This means if a worker thread can't be joined in 5 milliseconds, the main thread moves on to join younger worker threads. When main thread reaches the end of the list of worker thread, it starts over from the beginning of the worker thread list and try to join worker threads that haven't been joined. The main thread keeps trying until all worker threads are joined.

The following is the script with message queue and multithreading:
#!/usr/bin/env python
import argparse
import subprocess
import threading
import Queue
import string
import time
class CheckOneStation(threading.Thread):
_err_marker = '###'
def __init__(self, station, q):
super(CheckOneStation, self).__init__()
self._station = station
self._q = q
def run(self):
result = self.do_work()
self._q.put(result)
def do_work(self):
cmd = 'ssh {0} ls -alrt /var/lib/jelli/{0}/audio-recording/ | grep -v ^d | tail -1'.format(self._station)
p = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(out, err) = p.communicate()
if self.cmd_failed(err):
cmd = 'ssh {0} ls -alrt /var/lib/jelli/audio-recording/ | grep -v ^d | tail -1'.format(self._station)
p = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(out, err) = p.communicate()
if self.cmd_failed(err):
msg = '{0}{1}'.format(self._err_marker, self._station)
else:
msg = '{0}{1} {2}'.format('' if check_file_time(out) else self._err_marker, self._station, out[:-1])
else:
msg = '{0}{1} {2}'.format('' if check_file_time(out) else self._err_marker, self._station, out[:-1])
return msg
def cmd_failed(self, err):
idx = err.find('cannot access')
if idx >= 0:
return True
idx = err.find('Connection timed out')
if idx >= 0:
return True
return False
now = time.localtime()
def check_file_time(line):
seg = string.join(line.split()[5:8])
try:
file_time = time.strptime(seg, '%b %d %H:%M')
if (file_time.tm_mon == now.tm_mon and
file_time.tm_mday == now.tm_mday and
file_time.tm_hour == now.tm_hour and
abs(file_time.tm_min == now.tm_min) < 5):
return True
else:
pass
except:
pass
return False
def print_msg(q):
while not q.empty():
s = q.get()
print s
if __name__ == '__main__':
q = Queue.Queue()
parser = argparse.ArgumentParser()
parser.add_argument('filename', help='file containing station names')
cmd_args = parser.parse_args()
f = open(cmd_args.filename, 'r')
line = f.read()
stations = line.split()
threads = []
for station in stations:
t = CheckOneStation(station, q)
t.start()
threads.append(t)
while (len(threads) > 0):
working_set = threads;
for t in working_set:
t.join(0.005)
if not t.isAlive():
threads.remove(t)
print_msg(q)
How much faster does the multithreaded script run? The following shows the elapse time of single- and multi-threaded version:

single-thread multi-theaded
95.594 sec. 4.161 sec.

The speed up is 95.594/4.161=22.974.