malloc(3) (hopefully) set for 7.0
Jason Evans
jasone at frebsd.org
Wed Mar 28 22:36:38 UTC 2007
=== Introduction ===
Those of you who watch the cvs commit logs may have noticed several
rather significant changes to malloc over the past few weeks. I took a
fresh look at malloc after letting the details fade over the past year,
and saw some significant opportunities for improvement. I appreciate
the FreeBSD community's tolerance of these changes, which were perhaps
of questionable wisdom given that we are now in code slush. =)
Since these changes obsolete several aspects of the paper I presented
last year at BSDcan, it seems like a good idea to provide a public
summary of these changes and their rationale. The changes focus on two
main issues: 1) mapped page reduction during low memory usage phases of
application execution, and 2) fragmentation reduction.
=== 1) Mapped page reduction ===
First, here is a bit of relevant background. jemalloc breaks memory
into 1MB chunks that are aligned at multiples of the chunk size. This
makes it possible to find metadata for small and large user objects in
constant time. (For huge objects, it takes lg(n) time, where n is the
number of huge allocations.) The point is that chunks are a fundamental
aspect of what makes jemalloc tick. However, there is a downside to
managing memory in chunks: unless we use madvise(2), every page of a
chunk that is touched remains physically backed until the entire chunk
can be unmapped. Thus, even a single in-use page can cause retention of
up to 1MB of memory. This is a bit of a catch-22, since madvise(2) is
too expensive to enable by default, but if we really need it, then we
really should be using it for all applications, since paging is a
system-wide issue. There is no simple way out of this conundrum, which
is why I have been working to reduce the number of chunks that remain
mapped for programs that typically operate at well below their peak
memory usage.
In order to reduce the number of mapped pages, we need some way to
preferentially pack data into certain chunks, so that others can drain.
There is a simple allocation policy that in practice had been
demonstrated to work very well: prefer low addresses. jemalloc already
did this where convenient (a lesson learned from phkmalloc), but the
recent changes push this policy throughout the allocator, almost to the
maximum extreme possible.
Each arena manages a set of chunks that are carved into runs. These
runs back small and large allocations. jemalloc now uses the lowest fit
for run allocation. Equal-sized small allocations are grouped together
into runs since it is possible to fit multiple small objects in a
single run. This means that such a run may have very low utilization,
so we try to keep runs as full as possible in order to reduce the total
number of runs in use for any given size class. Previously, jemalloc
used an overly complicated run promotion/demotion system that was
intended to place several consecutive allocation requests in the same
run, reduce the overhead of switching between runs, and try to keep the
runs moderately full. However, experiments demonstrated that run
switching overhead has at most a small impact on allocator performance,
and furthermore that the promotion/demotion algorithms were not very
effective at decreasing run switching anyway. That being the case, I
experimented with increasingly expensive run switching policies.
Finally, I tried always switching to the lowest run with any unused
space, which was the ideal layout policy given my goals. This change
had the intended effect of tending to let high chunks drain. Much to my
surprise though, there was no significant impact on runtime.
I put this last experiment off a long time, because my intuition was
that it would be far too expensive to add so many red-black tree
modifications. I think there are two contributing factors to why I was
wrong: 1) the highest run switching rate I saw for any application was
~1% of the total allocation rate (though carefully crafted code could
push that up to nearly 50%), and 2) cache locality improvements tend to
help a bit. My initial expectation would be for such a layout policy to
*decrease* locality, since we have the potential to spread consecutive
allocations across a large number of independent runs. However, in the
steady state of a long-running application, we have a hard time dealing
with this problem anyway, and by packing objects as low as possible in
memory, we actually tend to do better than if we were to use non-full
runs in MRU order. Anyway, the reason I dedicate so much space to this
issue is that it still makes me uneasy, but I have been unable to find
any application for which the layout policy causes significant
performance degradation. If you think you may have found such a case,
you will be able to gain some insight by looking at the statistics
output from setting MALLOC_OPTIONS=P. Look at the 'reruns' column (# of
run switches) versus the 'requests' column in order to determine what
percentage of requests suffer the overhead of run switching. If you see
very high run switch rates for a particular application (say, >5%),
please tell me so that I can run further experiments.
=== 2) Fragmentation reduction ===
We want to fit user objects in memory as tightly as possible. jemalloc
was doing a substandard job in a couple of regards. First, it used a
binary buddy scheme to manage runs within chunks, meaning that runs were
constrained to be 2^n pages, for 0 < n < 20. This caused significant
internal fragmentation for, say, 20kB objects. To improve the
situation, I switched to using a straightforward extent-based system.
This was initially intended only as a fix for the problem just
described, but it turned out to also be useful for runs that back small
allocations. Small allocation runs can now be more precisely sized such
that the run header overhead is barely below some threshold. This
allows the bitmaps that track which small objects are in use to be as
short as possible while still keeping metadata overhead under control.
Previously, all run headers for small object runs were the same size,
regardless of how many bits were actually needed in the header bitmap to
track the run's objects. This was clearly wasteful. My memory is fuzzy
regarding the original rationale for fixed-size run headers, but I think
it had to do with the complexity of calculating run header size, number
of runs, and run size all at the same time. Indeed, this proved to be
tricky, but well worth the effort.
=== Summary ===
With these changes in place, all of my experiments indicate substantial
improvements in memory usage, as well as overall speed improvements. We
still have several months before the 7.0 release to verify that I didn't
miss anything important. Even so, I tested these changes more
thoroughly than any previous changes (excepting Kris Kennaway's testing
before jemalloc entered cvs), so I think the risk of major problems is
reasonably low.
Other than bug fixes, I do not intend to make any more changes before
7.0. I have developed some novel algorithms for essentially eliminating
thread contention on SMP systems, but it is too late in the development
cycle to introduce such changes (not to mention that I lack the hardware
to evaluate the algorithms). Thanks again for your patience and
support. Please let me know if I can be of help in diagnosing suspected
malloc issues.
Jason
More information about the freebsd-current
mailing list