Progress on the Gilectomy

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jake Edge
May 24, 2017

Python Language Summit

At the 2016 Python Language Summit, Larry Hastings introduced Gilectomy, his project to remove the global interpreter lock (GIL) from CPython. The GIL serializes access to the Python interpreter, so it severely limits the performance of multi-threaded Python programs. At the 2017 summit, Hastings was back to update attendees on the progress he has made and where Gilectomy is headed.

He started out by stating his goal for the project. He wants to be able to run existing multi-threaded Python programs on multiple cores. He wants to break as little of the existing C API as possible. And he will have achieved his goal if those programs run faster than they do with CPython and the GIL—as measured by wall time. To that end, he has done work in four areas over the last year.

He noted that "benchmarks are impossible" by putting up a slide that showed the different CPU frequencies that he collected from his system. The Intel Xeon system he is using is constantly adjusting how fast the cores run for power and heat considerations, which makes it difficult to get reliable numbers. An attendee suggested he look into CPU frequency pinning on Linux.

Atomicity and reference counts

CPU cores have a little bus that runs between them, which is used for atomic updates among other things, he said. The reference counts used by the Gilectomy garbage collection use atomic increment and decrement instructions frequently, which causes a performance bottleneck because of the inter-core traffic to ensure cache consistency.

So he looked for another mechanism to maintain the reference counts without all of that overhead. He consulted The Garbage Collection Handbook, which had a section on "buffered reference counting". The idea is to push all of the reference count updating to its own thread, which is the only entity that can look at or change the reference counts. Threads write their reference count changes to a log that the commit thread reads and reflects those changes to the counts.

That works, but there is contention for the log between the threads. So he added a log per thread, but that means there is an ordering problem between operations on the same reference count. It turns out that three of the four possible orderings can be swapped without affecting the outcome, but an increment followed by a decrement needs to be done in order. If the decrement is processed first, it could reduce the count to zero, which might result in the object being garbage collected even though there should still be a valid reference.

He solved that with separate increment and decrement logs. The decrement log can only be processed after all of the increments. This implementation of buffered reference counting has been in Gilectomy since October and is now working well. He did some work on the Py_INCREF() and Py_DECREF() macros that are used all over the CPython code; the intent was to cache the thread-local storage (TLS) pointer and reuse it over multiple calls, rather than looking it up for each.

Buffered reference counts have a weakness: they cannot provide realtime reference counts. It could be as long as a second or two before the reference count actually has the right value. That's fine for most code in Gilectomy, because that code cannot look at the counts directly.

But there are places that need realtime reference counts, the weakref module in particular. Weak references do not increment the reference count but can be used to reference an object (e.g. for a cache) until it is garbage collected because it has no strong references. Hastings tried to use a separate reference count to support weakref, but isn't sure that will work. Mark Shannon may have convinced him that resurrecting objects in __del__() methods will not work under that scheme; it may be a fundamental limitation that might kill Gilectomy, Hastings said.

More performance

Along the way, he came to the conclusion that the object allocation routines in obmalloc.c were too slow. The object allocation scheme has different classes for different sizes of objects, so he added per-class locks. When that was insufficient, he added two kinds of locks: a "fast" lock for when an object exists on the free list and a "heavy" lock when the allocation routines need to go back to the pool for more memory. He also added per-thread, per-class free lists. As part of that work, he added a fair amount of statistics-gathering code but went to some lengths to ensure that it had no performance impact when it was disabled.

There are a lots of places where things are being pulled out of TLS and profiling the code showed 370 million calls to get the TLS pointer over a seven to eight second run of his benchmark. In order to minimize that, he has added parameters to pass the TLS pointer down into the guts of the interpreter.

An attendee asked if it made sense to do that for the CPython mainline, but Hastings pointed out those calls come from what he has added; CPython with a GIL does not have that performance degradation. Another attendee thought it should only require one assembly instruction to get the TLS pointer and that there is a GCC extension to use that. Hastings said that he tried that, but could not get it to work; he would be happy to have help as it should be possible to make it faster.

The benchmark that he always uses is a "really bad recursive Fibonacci". He showed graphs of how various versions of Gilectomy fare versus CPython. Gilectomy is getting better, but is still well shy of CPython speed in terms of CPU time. But that is not what he is shooting for; when looking at wall time, the latest incarnation of Gilectomy is getting quite close to CPython's graph line. The "next breakthrough" may show Gilectomy as faster than CPython, he said.

Next breakthrough

He has some ideas for ways to get that next breakthrough. For one, he could go to a fully per-thread object-allocation scheme. Thomas Wouters suggested looking at Thread-Caching Malloc (TCMalloc), but Hastings was a bit skeptical. The small-block allocator in Python is well tuned for the language, he said. But Wouters said that tests have been done and TCMalloc is no worse than Python's existing allocator, but has better fragmentation performance and is multi-threaded friendly. Hastings concluded that it was "worth considering" TCMalloc going forward.

He is thinking that storing the reference count separate from the object might be an improvement performance-wise. Changing object locking might also improve things, since most objects never leave the thread they are created in. Objects could be "pre-locked" to the thread they are created in and a mechanism for threads to register their interest in other threads' objects might make sense.

The handbook that he looked in to find buffered reference counts says little about reference counting; it is mostly focused on tracing garbage collection. So one thought he has had is to do a "crazy rewrite" of the Python garbage collector. That would be a major pain and break the C API, but he has ideas on how to fix that as well.

Guido van Rossum thought that working on a GIL-less Python and C API would be much easier in PyPy (~~which has no GIL~~), rather than CPython. Hastings said that he thought having a multi-threaded Python would be easier to do using CPython. Much of breakage in the C API simply comes from adding multi-threading into the mix at all. If you want multi-core performance, those things are going to have to be fixed no matter what.

But Van Rossum is concerned that all of the C-based Python extensions will be broken in Gilectomy. Hastings thinks that overstates things and has some ideas on how to make things better. Someone had suggested only allowing one thread into a C extension at a time (so, a limited GIL, in effect), which might help.

The adoption of PyPy "has not been swift", Hastings said; he thinks that since CPython is the reference implementation of Python, it will be the winner. He does not know how far he can take Gilectomy, but he is sticking with it; he asked Van Rossum to "let me know if you switch to PyPy". But Van Rossum said that he is happy with CPython as it is. On the other hand, Wouters pointed out one good reason to stick with experimenting with CPython; since the implementation is similar to what the core developers are already knowledgeable about, they will be able to offer thoughts and suggestions.

Hastings also gave a talk about Gilectomy status a few days later at PyCon; a YouTube video is available for those interested.

[I would like to thank the Linux Foundation for travel assistance to Portland for the summit.]

Index entries for this article
Conference	Python Language Summit/2017

(Log in to post comments)

Progress on the Gilectomy

Posted May 24, 2017 22:12 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Guido van Rossum thought that working on a GIL-less Python and C API would be much easier in PyPy (which has no GIL)
Whut?

PyPy most definitely has a GIL: http://doc.pypy.org/en/latest/faq.html#does-pypy-have-a-g...

It's implemented a bit differently from CPython, but it's there. There was a project to add STM to PyPy but it went nowhere.

Progress on the Gilectomy

Posted May 25, 2017 0:48 UTC (Thu) by jake (editor, #205) [Link]

> PyPy most definitely has a GIL

so i see ... i must have misunderstood Guido's point somehow ... thanks for the correction ...

jake

Progress on the Gilectomy

Posted May 25, 2017 18:25 UTC (Thu) by lhastings (guest, #66451) [Link]

PyPy definitely has a GIL, which IIUC is used for stop-the-world tracing garbage collection cycles.

I couldn't tell you what I said in the rush of the moment, and I probably phrased it badly, so let me clarify. I don't think it'd be easier for CPython to get rid of its GIL per se. I think it'll be easier for CPython to get rid of its GIL *and still run Python extensions written in C*. It's easier for PyPy to get rid of its GIL, but it's enormously harder for them to run Python extensions written in C. If the end goal is "have a Python interpreter without a GIL that runs C extensions", then yes *that* is easier to achieve for CPython than it is for PyPy.

Progress on the Gilectomy

Posted May 25, 2017 16:45 UTC (Thu) by samlh (subscriber, #56788) [Link]

Guido was probably referring to the fact that PyPy has a tracing garbage collector already.

PALLOC2

Posted May 25, 2017 10:38 UTC (Thu) by linuxrocks123 (subscriber, #34648) [Link]

In addition to TCMalloc, he also might want to look into PALLOC2, since part of its goal is good performance across threads. I posted the code for it here: https://github.com/linuxrocks123/palloc2

I did tests with Python when I was writing PALLOC2. I remember seeing modest gains.

PALLOC2

Posted May 25, 2017 11:25 UTC (Thu) by linuxrocks123 (subscriber, #34648) [Link]

I'm also working on a Boehm GC-like garbage collected version of PALLOC2 this summer. He may also be interested in that, because of the reference counting issues. I sent an email to (ignore narf) larry narf at narf hastings narf dot narf org narf, but I'm commenting here in case he doesn't check that address but does check this article.

Progress on the Gilectomy

Posted May 25, 2017 13:54 UTC (Thu) by Sesse (subscriber, #53779) [Link]

If you have “calls” to get a TLS pointer, you're doing something wrong. At least on x86, it's a simple move from memory (offset by the FS or GS segment, depending on your platform); no need to call out to pthreads. Both modern C and C++ have a “thread_local” keyword to help you here.

Also, having multithreaded Fibonacci as the only benchmark? And going a year down an optimization project not knowing you can turn off frequency scaling to get more reliable microbenchmarks? Seriously?

Progress on the Gilectomy

Posted May 25, 2017 18:33 UTC (Thu) by lhastings (guest, #66451) [Link]

I used the existing CPython TLS library, which on Linux uses pthread system calls. Yes, there are non-standard extensions for both the compiler and the platform that, used together, allow TLS lookups to be only as costly as an extra pointer dereference. But I couldn't get it to play well with Python's build or something and gave up after a couple frustrating hours. It's worth revisiting in the future--really, CPython's abstracted TLS API should use it directly, and then I wouldn't have to touch anything.

However: IIUC there are some supported platforms where TLS is only available as an expensive function call, so it's better to cut down on the number of TLS lookups I need to do anyway, as that'll be a win on all systems.

> Also, having multithreaded Fibonacci as the only benchmark? And going a year down an optimization project not knowing you can turn off frequency scaling to get more reliable microbenchmarks? Seriously?

Yup. I'm actually a really bad person to undertake this project. It's just that I'm the only person who is both willing and can find the time to do so. CPython is essentially an all-volunteer project, and we're just people doing the best we can.

Progress on the Gilectomy

Posted May 25, 2017 19:11 UTC (Thu) by linuxrocks123 (subscriber, #34648) [Link]

> I used the existing CPython TLS library, which on Linux uses pthread system calls.

Are you _SURE_ it's actually making a system call? Just because you're calling pthreads doesn't mean glibc actually implements that function with a system call. If it's possible to do a MOV from FS/GS to implement the pthreads function, that's what glibc should be doing, because it's drastically more expensive to do a system call. If glibc is unnecessarily making a system call to do TLS on Linux, either glibc is broken, or we're all missing something.

In your position, I would verify glibc is doing something we think is stupid, and then ask the right mailing list "why is it doing this thing". Then you can either fix glibc, improving your own project and also every other one that uses thread-local storage, or you'll learn why glibc is implemented the way it is, and that will give you a deeper understanding of TLS that might affect how you proceed with Gilectomy. Either way, you win.

Progress on the Gilectomy

Posted May 25, 2017 19:56 UTC (Thu) by Sesse (subscriber, #53779) [Link]

pthreads isn't making a system call for TLS, but you pay the price for a shared library call (which includes spilling most of your state to the stack) and then potentially a two-level table lookup.

Progress on the Gilectomy

Posted May 25, 2017 19:56 UTC (Thu) by Sesse (subscriber, #53779) [Link]

> Yes, there are non-standard extensions for both the compiler and the platform that, used together, allow TLS lookups to be only as costly as an extra pointer dereference.

Non-standard? They're in standard C11.

Progress on the Gilectomy

Posted May 26, 2017 10:26 UTC (Fri) by ehiggs (subscriber, #90713) [Link]

Python is still stuck on C89 since they want to support building with Visual Studio. VS as of 2008 hadn't implemented C99[1], let alone C11, so it's 'non standard' if you constrain the definition to 'non standard for the standard being targeted'.

Microsoft claims that Visual Studio 2015 has support for C99[2] except for libraries which aren't used in C++ (e.g. tgmath.h). But again, this is C99 and not C11.

[1] https://mail.python.org/pipermail/python-dev/2008-July/08...
[2] https://msdn.microsoft.com/en-us/library/hh409293.aspx

Progress on the Gilectomy

Posted May 26, 2017 10:50 UTC (Fri) by Sesse (subscriber, #53779) [Link]

Well, let's restrict ourselves to a 28 year old language standard to support our fancy multicore boxes. :-)

(Visual Studio supports nearly all of C99 in recent versions; strangely enough, things have happened since 2008. They also support thread_local through C++11, or you can do #define thread_local __declspec(thread) and be done with it. Python already uses MSVC-specific constructs such as __declspec(dllexport).)

Progress on the Gilectomy

Posted May 26, 2017 12:25 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

The silliness is that, for Python 2.7 at least, it uses the 2008 toolchain, but uses 2013 or so project files, so you can't just get one Visual Studio and have it work. I haven't had to build Python 3 on Windows yet, but it'd be much better if Python moved to a build generator like CMake (patches exist to do so) or the like. Meson, waf, and scons get you in a bootstrapping problem, but they'd work too otherwise.

Progress on the Gilectomy

Posted May 28, 2017 5:18 UTC (Sun) by gutworth (guest, #96497) [Link]

Since Python 3.6, CPython uses C99. See https://www.python.org/dev/peps/pep-0007/#id4

Progress on the Gilectomy

Posted Jun 2, 2017 3:06 UTC (Fri) by lhastings (guest, #66451) [Link]

Sadly the Gilectomy branch was forked before the change permitting C99 code. At this point it'd be an actively bad idea to try and catch up. I am at times tempted to modify the Gilectomy build process so GCC will permit C99-isms, but this would mean staining my soul with the dark arts of automake.

Progress on the Gilectomy

Posted Jun 2, 2017 4:26 UTC (Fri) by gutworth (guest, #96497) [Link]

CPython doesn't use automake either.

Progress on the Gilectomy

Posted May 25, 2017 22:22 UTC (Thu) by pboddie (subscriber, #50784) [Link]

Yup. I'm actually a really bad person to undertake this project. It's just that I'm the only person who is both willing and can find the time to do so. CPython is essentially an all-volunteer project, and we're just people doing the best we can.

It's great that you are willing to spend your own time doing this work, but would it not be in the Python community's interest that such necessary work be funded? The Python Software Foundation has a fair budget to play with (noted in this article) but chooses to spend less than 10% of it - maybe not much more than 5% of it - on directly improving Python. Growing the community by funding Python events is not a bad thing, but it is all for nothing if Python itself has perceived deficiencies that cause people to use other things instead.

I'd also add that despite various companies relying on Python for their success, those companies - particularly the larger ones - always seem to hold back on contributing substantial improvements to Python, leaving various initiatives as personal or spare time projects of their employees which inevitably run out of steam because they are obviously not adequately resourced. I guess it just shows that merely entertaining corporate activity around a project like Python doesn't automatically translate to lots of benefits for the project, perhaps because it is easy for a corporate decision-maker to point at the volunteers doing all the necessary work for them for free already, and so they can justify investing nothing in the project themselves (beyond paying for a PSF sponsorship and thus buying more publicity for themselves).

And, ominously for Python, if one of those successful corporate users of Python feels that the volunteers aren't coming through with essential improvements, they can always migrate to something else. If such entities are supposed to drive Python development, there needs to be some way of getting them to invest properly in it. Otherwise, the PSF has to step into the vacuum and see that things get done regardless of whether they step on corporate toes in getting them done. But I don't see either of these things happening.

Progress on the Gilectomy

Posted May 26, 2017 1:57 UTC (Fri) by JdGordy (subscriber, #70103) [Link]

Python isnt special in this regard. This is why the core infrastructure initiative was created, of course as a reaction to heartbleed and the obvious problem that the big companies are relying on heaps of community code without any sort of financial backing.

Progress on the Gilectomy

Posted May 26, 2017 11:17 UTC (Fri) by pboddie (subscriber, #50784) [Link]

Python is somewhat special in that it has a foundation dispensing a six-figure dollar total to worthy causes, whereas things like OpenSSL and GnuPG were apparently being sustained primarily out of their contributors' own generosity.

Progress on the Gilectomy

Posted Jun 3, 2017 3:18 UTC (Sat) by njs (subscriber, #40338) [Link]

This is an ongoing discussion in the community; in fact there's a PSF board election happening right now and several of the candidates have variants on "hey lets spend money on python development" in their platforms.

But:

- historically there's been a pretty strong split between the PSF as handling the legal/social side of things and python-dev handling technical matters, so the PSF has lots of expertise and infrastructure for supporting conferences and meetups but not as much for technical stuff. This is obviously fixable but as you know changing volunteer-run institutions is never simple.

- funding development is expensive and if you mess it up then it directly impacts people's lives. The PSF has a substantial budget relative to other similar organizations, but a "six-figure dollar total" isn't necessarily enough to pay for one developer! And the PSF has been extremely conservative about hiring, because they don't want to ever be in a position where they have to be like "oops, that sponsor dropped out so surprise, you don't get paycheck next month". (OTOH if you have an idea for something that can be done for a fixed chunk of money in the $1k-$10k range then you can totally apply for a grant; e.g. the PSF gave twisted a grant to help with their py3 porting. But I'm guessing $5k wouldn't make a difference to how much energy Larry has to put into the GILectomy work, given he already has a full time job.)

- Logistically it's not as simple as just throwing money at someone. There are community concerns about how hiring core developers could create perceptions of unfairness, drive away volunteers, etc.; everyone's heard scary stories about that time Debian tried it and it was a disaster. How do you provide oversight for the first technical employee – it's a bit problematic to ask volunteers to evaluate them in their spare time. And so forth. I think this is going away over time as things like the Django fellows program demonstrate success, but it's a thing.

- Probably moon-shot projects like the GILectomy would not be the first priority for funding anyway; more likely would be Python's packaging infrastructure, or stuff like bug/PR triage.

Progress on the Gilectomy

Posted Jun 3, 2017 20:00 UTC (Sat) by pboddie (subscriber, #50784) [Link]

This is an ongoing discussion in the community; in fact there's a PSF board election happening right now and several of the candidates have variants on "hey lets spend money on python development" in their platforms.

I'll have to take a look. I admit that I don't follow things like PSF board elections any more.

The PSF has a substantial budget relative to other similar organizations, but a "six-figure dollar total" isn't necessarily enough to pay for one developer! And the PSF has been extremely conservative about hiring, because they don't want to ever be in a position where they have to be like "oops, that sponsor dropped out so surprise, you don't get paycheck next month".

I don't think you'd ever see the PSF hire someone as a developer. Then again, the PSF does appear to have some full-time positions, some of them administrative, others related to the conference/event orientation of the organisation, and I don't know whether those positions are competitively paid or not.

(OTOH if you have an idea for something that can be done for a fixed chunk of money in the $1k-$10k range then you can totally apply for a grant; e.g. the PSF gave twisted a grant to help with their py3 porting. But I'm guessing $5k wouldn't make a difference to how much energy Larry has to put into the GILectomy work, given he already has a full time job.)

As I think I noted, grants take up a rather small proportion of the PSF's spending, and one can even argue that they only seem to be done out of necessity, such as fixing up the packaging infrastructure before something really bad happens, or to support the Python 3 migration vision that hasn't been realised all by itself. And then, as you note, you have people wanting to do the work having to fit that work in around their day job, leading up to the somewhat perverse example of various core developers being hired to "work on Python" actually not really spending their work time, or very much of it, "on Python" as such.

All of this suggests a structural problem with the way people expect stuff to get done. Which ends up being expressed as "we need more volunteers", but where those volunteers can only really nibble away at the smaller problems because the focus is on funding sprints or hackathons (or whatever) as opposed to projects, initiatives and, ultimately, people. And the solution that involves getting people hired to work on Python, thus avoiding awkward issues (see next paragraph), doesn't really seem to work out in general (see previous paragraph), or it involves other obligations like doing a doctorate which can be substantial distractions in their own right.

There are community concerns about how hiring core developers could create perceptions of unfairness, drive away volunteers, etc.

I think people are justifiably worried about the response to hiring people because there is a culture of everything having to be a zero-sum game in the Python community. Someone will gladly pipe up and say how their ultra-successful, heavily-promoted (at the expense of everything else even tangentially related) project is being undermined if the PSF opens its chequebook and it is not for their benefit. And there may actually be sponsors who don't want people to work on certain projects if those projects are in any way in competition with their products.

The PSF will have to face up to this eventually, though. The only question is whether it happens before its target audience have decided that they prefer something else.

Progress on the Gilectomy

Posted Jun 4, 2017 3:48 UTC (Sun) by njs (subscriber, #40338) [Link]

Sorry, I was trying to give some perspective on why the PSF currently works the way it does and which direction it's currently heading. I'm not really sure what point you're trying to make now; there seem to be a lot of insinuations that someone is doing something wrong and that bad things will happen if they don't shape up, but I can't quite tell what you think is wrong or what should be done instead. All I can say is that from where I sit, it looks like people doing their best to chip away at some hard problems, and that things are getting slowly better over time.

Progress on the Gilectomy

Posted Jun 4, 2017 14:20 UTC (Sun) by pboddie (subscriber, #50784) [Link]

I was going to respond to this, but given that I then found myself repeating what I had already written, and given that I am apparently just sharing "insinuations", I doubt that there is any productive dialogue to be had.

Progress on the Gilectomy

Posted Jun 4, 2017 23:35 UTC (Sun) by njs (subscriber, #40338) [Link]

I'm referring to comments like 'grants take up a rather small proportion of the PSF's spending', 'core developers being hired to "work on Python" actually not really spending their work time, or very much of it, "on Python"', 'culture of everything having to be a zero-sum game', etc. - none of this matches my experience at all. For example, the PSF spends basically all of its discretionary money on grants, and last I heard they had more money than proposals, so it's hard to blame them if the things you want to be funded aren't being funded. Major projects like static typing and the pypi rewrite are currently being supported by corporations paying employees to work on them. Etc.

Progress on the Gilectomy

Posted Jun 5, 2017 14:16 UTC (Mon) by pboddie (subscriber, #50784) [Link]

PSF grants funding Python development in 2016 amounted to 6% of the total expenditure.

I rely on public knowledge to estimate how much time core developers spend working on, not working with, Python in their day jobs. And even if some people are doing so, other evidence of people's employment translating to significant progress on critical implementation issues (see the article for an example) is rather thin on the ground. (Making other people's lives harder is, apparently, a different matter.)

But these are merely "insinuations", apparently. That's why other people at those corporations consider going their own way instead. Nothing to see here, I guess.

(And as for the zero-sum game culture, I guess you've never encountered anyone who, upon being told that you're working on something similar to them and want to share perspectives, flat out asked you why you don't just work on their project instead. Start with the passive aggression towards Python 2 at the very top and then work your way down through all the people belittling each other's projects. There's plenty to see if you want to.)

Progress on the Gilectomy

Posted May 25, 2017 19:56 UTC (Thu) by excors (subscriber, #95769) [Link]

> The benchmark that he always uses is a "really bad recursive Fibonacci".

That sounds like it might produce quite misleading results for the new reference counting approach. Atomic increment/decrement can be relatively fast when the data stays in the core's cache, it only gets really expensive when ping-ponging between multiple cores. Writing all refcount operations to a big per-thread log will always be quite expensive, since that always has to go out to RAM, and I guess you need to write at least a 64-bit pointer each time.

On my CPU (4 cores plus hyperthreading) I get about 200M atomic increments per second with one thread, or 50M/sec with 8 contending threads (i.e. 32x slower per thread), or about 700M/sec with 8 non-contending threads (i.e. approximately scaling with number of physical cores used). I can write about 800M/sec 64-bit values to RAM with one thread, and about 1200M/sec with 8 threads. (These are all just very rough figures, not accurately measured at all.)

That suggests code that is constantly touching the same object in many threads will potentially be much faster with the log approach. But for single-threaded code, or multi-threaded code that mainly uses objects in a single thread at once, the difference is much less clear (especially since the log approach will slow everything else down by thrashing the cache (unless it's designed very carefully to avoid that), can use up a significant chunk of the system's memory bandwidth, and needs another thread to read and process the whole log perhaps every few msecs (if you don't want the log to grow to hundreds of megabytes when the system is very busy)).

I'd guess most real applications will be more like the second kind with only a limited amount of shared state (and most of that protected by mutexes anyway), because the only way to preserve your sanity while writing (and debugging) multi-threaded code is to avoid shared state, and they will behave very differently to microbenchmarks specifically designed to stress the refcounting system. (Or perhaps real Python applications do a lot of refcounting of read-only global class objects and function objects and stuff? I have no idea how that really works.)

Progress on the Gilectomy

Posted Jun 2, 2017 2:55 UTC (Fri) by lhastings (guest, #66451) [Link]

> > The benchmark that he always uses is a "really bad recursive Fibonacci".
> That sounds like it might produce quite misleading results for the new reference counting approach. Atomic increment/decrement can be relatively fast when the data stays in the core's cache, it only gets really expensive when ping-ponging between multiple cores. Writing all refcount operations to a big per-thread log will always be quite expensive, since that always has to go out to RAM, and I guess you need to write at least a 64-bit pointer each time.
[...]
> That suggests code that is constantly touching the same object in many threads will potentially be much faster with the log approach. But for single-threaded code, or multi-threaded code that mainly uses objects in a single thread at once, the difference is much less clear

First, it'd be the rare Python program that didn't share objects across threads. Many built-in objects are singletons in Python: None, True, False, empty string, empty tuple, and the "small integers" (-5 through 256 inclusive). Also, modules, classes, and functions are all singletons, and they all have reference counts too. And then the constants *compiled into* those modules, classes and functions are all shared and refcounted. I expect that even in a program designed around running well in multiple threads, sharing as few objects across cores as possible, will still implicitly share a lot of objects under the hood. My bad benchmark is admittedly a pathological case about object sharing, but it's not a world of difference.

Second, while I concede I don't genuinely know how atomic incr/decr works inside the CPU, obviously there must be *some* synchronization going on under the covers. The core you're running on doesn't know in advance whether or not another core is examining the memory you want to atomically incr/decr, so it *has* to synchronize somehow with the other cores. That synchronization itself seems to be expensive, and the more cores you have the more synchronization it needs to do. I assume this synchronization is a primary cause of the worse-than-linear scaling I observed when benchmarking the Gilectomy. And that's why, even with its obvious overhead, reference count logging seems to be a big performance win.

Why am I so sure it's this synchronization rather than the cache invalidation that's important? Because the cache invalidation is *still happening* with the reference count logging. The commit thread for the reference count log is continually changing the reference counts on the shared objects: the small ints, the function, the code object, the module object, etc. Those writes invalidate the cache of those objects for all the other cores. And yet I observe a big improvement win with the reference count logger. It seems to me that all the reference count log is really doing is getting rid of the synchronization overhead.

If you'd like to test your theory that atomic incr/decr isn't really so bad, you could try it with the Gilectomy. Here's how I'd do it. For a test with N cores, I'd have N modules, each with their own separate implementation of fib(). That'd ensure they weren't sharing the functions and code objects, the constants inside the code objects, or the modules. I'd then change the function so the lowest it went was 377, aka fib(14). That'd cut down on use of the small integers, though the code would still use 1 and 2, and the fib recursion level (e.g. the 14 in fib(14)) would still be using small ints. That's about as good as you're going to get with fib in Python. I predict you'll see improved scaling over my original fib benchmark but it'll still be markedly worse-than-linear, and not as gentle as with the reference count log.

Progress on the Gilectomy

Posted Jun 3, 2017 7:22 UTC (Sat) by rghetta (subscriber, #39444) [Link]

Is there some convergence with Eric Snow's separate subinterpreters work [https://lwn.net/Articles/650489/] ? While the two projects approach the gil problem almost from the opposite direction, could one benefit from the work done in the other ?

Progress on the Gilectomy

Posted May 27, 2017 19:25 UTC (Sat) by flussence (subscriber, #85566) [Link]

Good luck to them. I don't envy the journey they have ahead, having seen what it took Perl to get a future-proof language.

Progress on the Gilectomy

Posted May 29, 2017 9:42 UTC (Mon) by Yhg1s (subscriber, #101968) [Link]

I'd like to clarify the comment I made at the end: the reason I think Gilectomy is worthwhile even if it ends up easier and faster to use PyPy or another implementation, is that it provides us with the experience to evolve the CPython API in ways to make it easier for any Python implementation to deal with. Experimenting with the Gilectomy has made it obvious to me how CPython's approach to for example object allocations, non-opaque PyObject structs and borrowed references are painfully problematic, even though Larry has barely scratched the surface on these. Since CPython is authoritative, we can actually evolve the C API and deprecate the old mechanisms; it'll take years to do the deprecations, of course, but at least it allows extension modules the ability to work on more than just CPython.

Progress on the Gilectomy

Posted Jun 2, 2017 3:12 UTC (Fri) by lhastings (guest, #66451) [Link]

> Hastings tried to use a separate reference count to support weakref, but isn't sure that will work. Mark Shannon may have convinced him that resurrecting objects in __del__() methods will not work under that scheme; it may be a fundamental limitation that might kill Gilectomy, Hastings said.

Some clarification on this.

First: my approach for weakref works fine. I'm not sure how Jake got it in his notes that I wasn't sure. It's probably my fault, I'm sure I explained it badly!

Second, regarding resurrecting objects under __del__. The complicated problem was figuring out how, for a resurrected object, to know that we shouldn't free the object, and how we could safely call __del__ a second (and third...) time. I came up with a mildly complicated but safe / correct / workable scheme with relatively-little overhead--at the very least, a good starting point. And then! It turns out that there are new semantics for this as of Python 3.4, courtesy of PEP 442 "Safe object finalization". This guarantees that __del__ will only be called for an object exactly once. If an object is resurrected inside __del__, the second time its reference count drops to zero, the object is simply freed--__del__ isn't called again. These semantics are actually quite easy to support, particularly for the Gilectomy. Long story short, resurrection under __del__ is no problem.