Memory reclaiming in Python

Running GC-based languages on embedded systems always give a challenge to limit the physical memory amount taken by the processes. Python scripting is obviously a good example what would happen if you use long-running processes and which problems you could face. Let me show my research and the way I've used to fix the memory consumption.

First, a trivial example which is used in the Internet, and which is actually wrong and doesn't show the problem:

import gc
import os
iterations = 1000000
pid = os.getpid()


def rss():
    with open('/proc/%d/status' % pid, 'r') as f:
        for line in f:
            if 'VmRSS' in line:
                return line


def main():
    print 'Before allocating ', rss(),

    l = []
    for i in xrange(iterations):
        l.append({})

    print 'After allocating  ', rss(),

    # Ignore optimizations, just try to free whatever possible

    # First kill
    for i in xrange(iterations):
        l[i] = None

    # Second kill
    l = None

    # Control shot
    gc.collect()

    print 'After free        ', rss(),

if __name__ == '__main__':
    main()


Running it shows that everything is fine (here and below I use Python 2.6.8):


Before allocating  VmRSS:    3344 kB
After allocating   VmRSS:  149216 kB
After free         VmRSS:    4748 kB


But let's use a dictionary object instead of the list now:


import gc
import os
iterations = 1000000
pid = os.getpid()


def rss():
    with open('/proc/%d/status' % pid, 'r') as f:
        for line in f:
            if 'VmRSS' in line:
                return line


def main():
    print 'Before allocating ', rss(),

    l = {}
    for i in xrange(iterations):
        l[i] = {}

    print 'After allocating  ', rss(),

    # Ignore optimizations, just try to free whatever possible

    # First kill
    for i in xrange(iterations):
        l[i] = None

    # Second kill
    l.clear()

    # Third kill
    l = None

    # Control shot
    gc.collect()

    print 'After free        ', rss(),

if __name__ == '__main__':
    main()


Let's run it:


Before allocating  VmRSS:    3348 kB
After allocating   VmRSS:  179800 kB
After free         VmRSS:  155300 kB



That doesn't look good, isn't it? Obviously Python manipulates dictionaries in a different way, but unfortunately it's not the good news for us.

The first guess is that Python uses PyMalloc which doesn't free the memory but reuse it later. It's fine for the desktop/server systems, but not so good for the embedded systems, because other processes have needs in memory too. Please notice that an operating system can behave differently for the embedded systems and might not send special signals to the processes to reclaim the memory (as in my case). Also PyMalloc's freelist memory pool for integers and floats is never claimed back to the operating system at all.

The second attempt is to recompile Python without PyMalloc:

$ ./configure --without-pymalloc
$ make -sj4
$ ./python test.py

Before allocating  VmRSS:    3304 kB
After allocating   VmRSS:  180112 kB
After free         VmRSS:  155748 kB


Mostly the same numbers, so looks like PyMalloc has nothing to do with it. And actually it's true, this it not PyMalloc behaviour, but glibc. See the bug: http://bugs.python.org/issue11849 In few words, if the process allocates a lot of small objects, glibc uses different approach and to enforce releasing the memory pool one should use malloc_trim(). The patch has been applied to Python 3.3 to improve the situation (the sample code should be slightly modified to be compatible with Py3k, I skipped it here):

$ python 3.3 test.py

Before allocating  VmRSS:    4780 kB
After allocating   VmRSS:  193776 kB
After free         VmRSS:   83288 kB

But as you can see, the problem still exists. Using memory_trim() manually in the Python memory allocator doesn't sound like a good solution, however another solution can be applied - to use not glibc memory allocator, but 3rd-side one. In my cases it's jemalloc:


$ sudo apt-get install libjemalloc1
$ LD_PRELOAD=/usr/local/lib/libjemalloc.so ./python test.py 
Before allocating  VmRSS:    3692 kB
After allocating   VmRSS:  197780 kB
After free         VmRSS:    3984 kB



Presto, problem solved! It wouldn't be so easy for certain environments, especially with prefixed API, but it's a start. Also the custom memory allocator could be applied for different processes, not only Python, and eventually it can save a lot of memory in your embedded system.

Comments

  1. Bravo! Very interesting.

    I ran your second snippet, minus the first and second kills, on Python 2.7.3 with libjemalloc and got the same results. This probably doesn't surprise you, but I thought I'd mention it.

    ReplyDelete

Post a Comment

Popular posts from this blog

DIY: Business cards in LaTeX

Python vs JS vs PHP for embedded systems

Shellcode detection using libemu