Memory reclaiming in Python
Running GC-based languages on embedded systems always give a challenge to limit the physical memory amount taken by the processes. Python scripting is obviously a good example what would happen if you use long-running processes and which problems you could face. Let me show my research and the way I've used to fix the memory consumption.
First, a trivial example which is used in the Internet, and which is actually wrong and doesn't show the problem:
import gc
import os
iterations = 1000000
pid = os.getpid()
def rss():
with open('/proc/%d/status' % pid, 'r') as f:
for line in f:
if 'VmRSS' in line:
return line
def main():
print 'Before allocating ', rss(),
l = []
for i in xrange(iterations):
l.append({})
print 'After allocating ', rss(),
# Ignore optimizations, just try to free whatever possible
# First kill
for i in xrange(iterations):
l[i] = None
# Second kill
l = None
# Control shot
gc.collect()
print 'After free ', rss(),
if __name__ == '__main__':
main()
Running it shows that everything is fine (here and below I use Python 2.6.8):
Before allocating VmRSS: 3344 kB
After allocating VmRSS: 149216 kB
After free VmRSS: 4748 kB
But let's use a dictionary object instead of the list now:
import gc
import os
iterations = 1000000
pid = os.getpid()
def rss():
with open('/proc/%d/status' % pid, 'r') as f:
for line in f:
if 'VmRSS' in line:
return line
def main():
print 'Before allocating ', rss(),
l = {}
for i in xrange(iterations):
l[i] = {}
print 'After allocating ', rss(),
# Ignore optimizations, just try to free whatever possible
# First kill
for i in xrange(iterations):
l[i] = None
# Second kill
l.clear()
# Third kill
l = None
# Control shot
gc.collect()
print 'After free ', rss(),
if __name__ == '__main__':
main()
Let's run it:
Before allocating VmRSS: 3348 kB
After allocating VmRSS: 179800 kB
After free VmRSS: 155300 kB
That doesn't look good, isn't it? Obviously Python manipulates dictionaries in a different way, but unfortunately it's not the good news for us.
The first guess is that Python uses PyMalloc which doesn't free the memory but reuse it later. It's fine for the desktop/server systems, but not so good for the embedded systems, because other processes have needs in memory too. Please notice that an operating system can behave differently for the embedded systems and might not send special signals to the processes to reclaim the memory (as in my case). Also PyMalloc's freelist memory pool for integers and floats is never claimed back to the operating system at all.
The second attempt is to recompile Python without PyMalloc:
$ ./configure --without-pymalloc
$ make -sj4
$ ./python test.py
Before allocating VmRSS: 3304 kB
After allocating VmRSS: 180112 kB
After free VmRSS: 155748 kB
Mostly the same numbers, so looks like PyMalloc has nothing to do with it. And actually it's true, this it not PyMalloc behaviour, but glibc. See the bug: http://bugs.python.org/issue11849 In few words, if the process allocates a lot of small objects, glibc uses different approach and to enforce releasing the memory pool one should use malloc_trim(). The patch has been applied to Python 3.3 to improve the situation (the sample code should be slightly modified to be compatible with Py3k, I skipped it here):
$ python 3.3 test.py
Before allocating VmRSS: 4780 kB
After allocating VmRSS: 193776 kB
After free VmRSS: 83288 kB
But as you can see, the problem still exists. Using memory_trim() manually in the Python memory allocator doesn't sound like a good solution, however another solution can be applied - to use not glibc memory allocator, but 3rd-side one. In my cases it's jemalloc:
$ sudo apt-get install libjemalloc1
Presto, problem solved! It wouldn't be so easy for certain environments, especially with prefixed API, but it's a start. Also the custom memory allocator could be applied for different processes, not only Python, and eventually it can save a lot of memory in your embedded system.
First, a trivial example which is used in the Internet, and which is actually wrong and doesn't show the problem:
import gc
import os
iterations = 1000000
pid = os.getpid()
def rss():
with open('/proc/%d/status' % pid, 'r') as f:
for line in f:
if 'VmRSS' in line:
return line
def main():
print 'Before allocating ', rss(),
l = []
for i in xrange(iterations):
l.append({})
print 'After allocating ', rss(),
# Ignore optimizations, just try to free whatever possible
# First kill
for i in xrange(iterations):
l[i] = None
# Second kill
l = None
# Control shot
gc.collect()
print 'After free ', rss(),
if __name__ == '__main__':
main()
Running it shows that everything is fine (here and below I use Python 2.6.8):
Before allocating VmRSS: 3344 kB
After allocating VmRSS: 149216 kB
After free VmRSS: 4748 kB
But let's use a dictionary object instead of the list now:
import gc
import os
iterations = 1000000
pid = os.getpid()
def rss():
with open('/proc/%d/status' % pid, 'r') as f:
for line in f:
if 'VmRSS' in line:
return line
def main():
print 'Before allocating ', rss(),
l = {}
for i in xrange(iterations):
l[i] = {}
print 'After allocating ', rss(),
# Ignore optimizations, just try to free whatever possible
# First kill
for i in xrange(iterations):
l[i] = None
# Second kill
l.clear()
# Third kill
l = None
# Control shot
gc.collect()
print 'After free ', rss(),
if __name__ == '__main__':
main()
Let's run it:
Before allocating VmRSS: 3348 kB
After allocating VmRSS: 179800 kB
After free VmRSS: 155300 kB
That doesn't look good, isn't it? Obviously Python manipulates dictionaries in a different way, but unfortunately it's not the good news for us.
The first guess is that Python uses PyMalloc which doesn't free the memory but reuse it later. It's fine for the desktop/server systems, but not so good for the embedded systems, because other processes have needs in memory too. Please notice that an operating system can behave differently for the embedded systems and might not send special signals to the processes to reclaim the memory (as in my case). Also PyMalloc's freelist memory pool for integers and floats is never claimed back to the operating system at all.
The second attempt is to recompile Python without PyMalloc:
$ ./configure --without-pymalloc
$ make -sj4
$ ./python test.py
Before allocating VmRSS: 3304 kB
After allocating VmRSS: 180112 kB
After free VmRSS: 155748 kB
Mostly the same numbers, so looks like PyMalloc has nothing to do with it. And actually it's true, this it not PyMalloc behaviour, but glibc. See the bug: http://bugs.python.org/issue11849 In few words, if the process allocates a lot of small objects, glibc uses different approach and to enforce releasing the memory pool one should use malloc_trim(). The patch has been applied to Python 3.3 to improve the situation (the sample code should be slightly modified to be compatible with Py3k, I skipped it here):
$ python 3.3 test.py
Before allocating VmRSS: 4780 kB
After allocating VmRSS: 193776 kB
After free VmRSS: 83288 kB
But as you can see, the problem still exists. Using memory_trim() manually in the Python memory allocator doesn't sound like a good solution, however another solution can be applied - to use not glibc memory allocator, but 3rd-side one. In my cases it's jemalloc:
$ sudo apt-get install libjemalloc1
$ LD_PRELOAD=/usr/local/lib/libjemalloc.so ./python test.py
Before allocating VmRSS: 3692 kB
After allocating VmRSS: 197780 kB
After free VmRSS: 3984 kB
Presto, problem solved! It wouldn't be so easy for certain environments, especially with prefixed API, but it's a start. Also the custom memory allocator could be applied for different processes, not only Python, and eventually it can save a lot of memory in your embedded system.
Bravo! Very interesting.
ReplyDeleteI ran your second snippet, minus the first and second kills, on Python 2.7.3 with libjemalloc and got the same results. This probably doesn't surprise you, but I thought I'd mention it.