In managed code we trust, our recent battles with the .NET Garbage Collector

over 12 years ago

Recently Marc blogged about some performance optimisations we implemented at Stack Overflow to work around some Garbage Collection (GC) issues.

This post evoked many emotions and opinions both on his Blog and on Hacker News. In this post I would like to cover some of the history involved with this particular problem, explain how we found it and eventually fixed it.

Jeff has a stopwatch in his brain.

When Stack Exchange developers look at our sites we get a very different view. In particular, on every page we visit there is a little red box that shows us how long it took the page to render, and other goodies. We run MiniProfiler in production, performance is a feature.

One day Jeff started noticing something weird:

Once in a while I visit a page and it takes a whole second to render, but MiniProfiler is telling me it only took 30ms. What is going on?

If the problem is not measured locally, it must be a network problem or an HAProxy we deduced. We were wrong.

I responded: “try editing your hosts file, perhaps there is a DNS issue”. This fixed nothing.

In hindsight we realised this only happened 1 out of 50 or so times he visited the page, which correlates with my stopwatch theory.

But, I am jumping a bit ahead of myself here, let me tell you a bit of history first.

The tag engine of doom

I love SQL. I am pretty good at it. I know how to think in sets and pride myself in some of the creative solutions I come up with.

One day, I discovered that we had a problem that simply can not be solved efficiently in SQL. Square peg, round hole.

We allow people to cut through our list of 2.2 million questions in 6 different sort orders. That is a problem that is pretty trivial to solve in SQL. We also present users with an aggregation of related tags on each of our sorts, something that is also reasonably simple to denormalize and cache in SQL.

However, stuff gets tricky when you start allowing rich filtering. For example if you are interested in SQL performance questions you can easily filter down the list and maintain all your sort orders.

Historically we found that the most efficient way for SQL Server to deal with these queries was using the full text search (FTS). We tried many different techniques, but nothing came close to the full text engine.

We would run queries such as this to get lists of tags to build a related tag list:

SELECT Tags
    FROM Posts p
    WHERE p.PostTypeId = 1 
    AND p.DeletionDate IS NULL 
    AND p.ClosedDate IS NULL 
    AND p.LockedDate IS NULL  
    AND CONTAINS(Tags, 'Ã©sqlÃ  AND  Ã©performanceÃ ')

A similar query was used to grab question ids.

There were a series of problems with this approach:

It did not scale out, it placed a significant load on the DB.
It caused us to ferry a large amount of information from the DB to the app servers on a very regular basis.
SQL Server FTS does not play nice with existing indexes. The FTS portion of a query always needs to grab the whole result set it needs to work on, prior to filtering or ordering based on other non FTS indexes. So, a query asking for the top 10 c# .net questions was forced to work through 50k results just to grab 10.

To solve these issues we created the so called tag engine. The concept is pretty simple. Each web server maintains a shell of every question in the site, with information about the tags it has, creation date and other fields required for sorting. This list is updated every minute. To avoid needing to scan through the whole list we maintain various pre-sorted indexes in memory. We also pre-calculate lists of questions per tag.

We layer on top of this a cache and a bunch of smart algorithms to avoid huge scans in memory and ended up with a very fast method for grabbing question lists.

And by fast, I mean fast. The tag engine typically serves out queries at about 1ms per query. Our questions list page runs at an average of about 35ms per page render, over hundreds of thousands of runs a day.

Prior to the introduction of the tag engine the question listing pages on Stack Overflow were the single biggest bottleneck on the database. This was getting worse over the months and started impacting general unrelated queries.

There was another twist, the tag engine was always very close to “lock free”. When updating the various structures we created copies of the engine in-memory and swapped to the updated copy, discarding the out-of-date copy.

The worst possible abuse of the garbage collector

The .NET garbage collector like Java’s is a generational garbage collector. It is incredibly efficient at dealing with objects that have a short life span (provided they are less than 85000 bytes in size). During an objects life it is promoted up the managed heaps from generation 0 to 1 and finally to generation 2.

The .NET garbage collector runs generation 0 sweeps most frequently, generation 1 sweeps less frequently and generation 2 sweeps least frequently. When the Server GC in .NET 4.0 runs a generation 2 GC (which is a full GC) it has to suspend all managed threads. I strongly recommend reading Rico’s excellent Garbage Collector Basics and Performance Hints. .NET 4.5 will improve this situation, however, even with the background GC in .NET 4.5 in certain cases the GC may suspend threads, a fact that is confirmed by Maoni, the chief GC architect, on her Channel 9 talk.

If you happen to have a data structure that is full of internal references that allocates a huge amount of objects that just make it into generation 2 when you abandon them, you are in a pickle.

Your blocking GC is now forced to do lots of work finding the roots of the objects. Then it needs to deallocated them, and possibly defragment the heap.

Our initial tag engine design was a GC nightmare. The worst possible type memory allocation. Lots of objects with lots of references that lived just enough time to get into generation 2 (which was collected every couple of minutes on our web servers).

Interestingly, the worst possible scenario for the GC is quite common in our environment. We quite often cache information for 1 to 5 minutes.

Diagnosing the problem

This issue was first detected by Kyle in July, we performed a few optimisations assuming this was CPU related and heavily reduced CPU usage on the tag engine as a side effect we also reduced memory churn which reduced the impact a bit. At the time we did not assume we had a GC issue, instead we assumed there was a rogue thread slowing everything down.

Fast forward to about 3 weeks I enabled some basic profiling on our web servers that I had baked in a while ago:

basic

This report is based on a simple Stopwatch that starts at the beginning of a request and finished at the end. It measures the web servers request duration and puts it into neat time buckets partitioned by controller action.

This report showed me something rather odd. The time the web server was reporting was wildly different to the time we were measuring on our load balancer. It was about 60ms off on average.

I knew this cause of Jarrod’s awesome work. We store all our HAProxy web logs in a SQL Server instance. This has been a godsend when it comes to data analysis. I ran a simple average on the time HAProxy observed and noticed the discrepancy right away.

When I graphed the response time, using free and open source Data Explorer, for any route on our site we noticed spikes.

spikes

The above graph demonstrates how a limited number of requests would “spike” in response time many of which took longer than 1 second.

The frequency of the spikes correlated to the Gen 2 Collections performance counter.

Equipped with this feedback we set out on a mission to decrease the spikes which today look like this (for the exact same number of requests).

after

These spikes had a very interesting characteristics. We now store time measured in ASP.NET as well in our logs, so can graph that as well.

with asp.net

ASP.NET did not even notice that the spikes happen. This may indicate that ASP.NET is aware a GC is about to happen and is careful to serve out all its existing requests prior to running a Gen 2 GC. Regardless, the fact MiniProfiler could not catch the spikes correlated with our empirical evidence.

Note the blue “spikes” in the graph above are caused by a process that rebuilds related question lists. In general this is done in a background thread, however in rare conditions it may happen on first render. The GC spikes are the areas in the graph where there is a pile of yellow dots on top of a blue line. Clearly, it is not that problematic in this graph, it was created on data collected after we heavily improved the situation.

###.NET Memory Profiler

I have always been a huge fan of .NET Memory Profiler. It is an awesome tool. Over the years Andreas has kept up with all the latest features. For example, the latest version allows you to attach to .NET 4 application using the new ICorProfiler attach API.

During the process of optimisation, after every round of changes, I would pull a server off the load balancer and analyze the managed heaps:

memory

For example: the profile above was taken after we refactored the tag engine. I discovered another spot with 500k objects that were alive for only a couple of minutes.

.NET Memory Profiler allows you to gather and compare snapshots of memory, so you can properly measure churn in your managed heaps. Churn in your generation 2 managed heap or large object heap is a major source of GC “freezes”.

Techniques for mitigating GC cost

When I fully understood this issue I quickly realised the solution to this problem.

Why should the tag engine cause requests that do not even consume its services to slow down. That is crazy talk. The tag engine should be isolated in its own process, then we can use a communication channel to query it and even distribute it amongst several machines issuing simultaneous queries from our webs to mitigate GC freezes. The first tag engine to respond will be served to the user. Problem solved.

When I pitched this to the team there was violent objection. Now we are going to need to maintain an API. Deploys are now going to be more complex. The list goes on. Services, SLAs and the rest of Yegge’s recommended cures for the API battle was not going to fly here.

Arguments aside, I did realise that by isolating you are only limiting the impact of the issue. You are not solving the problem. Also, there is a cost for all the new complexity this would introduce.

We also considered Microsoft’s recommendation:

If you do have a lot of latency due to heavy pause times during full garbage collections, there is a feature that was introduced in 3.5 SP1 that allows you to be notified when a full GC is about to occur. You can then redirect to another server in a cluster for example while the GC occurs.

I was not particularly keen on this suggestion:

We did not have a channel built that would allow us to coordinate this activity.
There was a non trivial risk that multiple machines would be forced to run full GC collections at the same time, leading to complex coordination efforts. (we were seeing a full GC every couple of minutes)
It does not solve the underlying problem

So Marc and I set off on a 3 week adventure to resolve the memory pressure.

The first step was rewriting the tag engine.

Index arrays as opposed to object arrays

In the tag engine we have to keep a global index of all our questions in one of six different sort orders. Trivially we used to use:

Question[] sortedByCreationDate;

The issue with this general pattern is that it is GC unfriendly. Having this array around means that each question now has one extra reference the GC is going to have to check.

Our tag engine is essentially immutable after construction or refresh. So we could simply replace the 2.2 million question list with.

int[] sortedByCreationDate;

An array containing indexes into the main immutable question list.

This is a pattern many application use internally, including Lucene.

For large objects sets that are immutable, this is a great solution. Nothing particularly controversial here.

Classes to Structs where needed

The above optimisation got us a fair distance however, we were left with a 2.2 million question list full of class pointers. The GC still needs to work real hard to crawling each Question object to deallocate when we swap in the new Question list. This was partially mitigated by reusing large chunks of our data during the refresh. However, it was not solved.

The cheapest solution from the GCs perspective was to have a simple struct array that it could allocate and deallocate in a single operation.

Structs however come with big warnings from Microsoft:

Do not define a structure unless the type has all of the following
characteristics:
It logically represents a single value, similar to primitive types
(integer, double, and so on).
It has an instance size smaller than 16 bytes.
It is immutable.
It will not have to be boxed frequently.

Migrating these millions of objects to Structs was going to violate all these rules. The classes were bigger than 16 bytes, mutable during refresh and represented multiple values.

The unspoken reality is that Microsoft violate these rules in data structures we use daily like Dictionary.

Working efficiently with structs means you need a deep understanding of the C# semantics, for example, understanding this code should be second nature if you are to use structs.

var stuff = new MyStruct[10];
var thing = stuff[0];
thing.A = 1; 
Console.WriteLine(stuff[0].A); // returns 0 

stuff[0].A = 1;  
Console.WriteLine(stuff[0].A); // returns 1 

DoSomething(stuff[0]); 
Console.WriteLine(stuff[0].A); // returns 1 (can not do anything, it got a copy) 

DoSomethingElse(ref stuff[0]);
Console.WriteLine(stuff[0].A); // returns 2

You have to be real careful not to operate on copies of data as opposed to the real data. Assignments copy. With this in mind, when dealing with huge collections of entities you can use structs.

As a general rule: I would only consider porting to structs for cases where there is a large number (half a million or more) of medium lived objects. By medium lived, I mean, objects that are released shortly after they reach generation 2. If they live longer, you may be able to stretch it a bit. However, at some point, the sheer number of objects can be a problem, even if they are not deallocated. The days of 64 bit computing are upon us, many enterprise .NET apps consume 10s of gigs in the managed heaps. Regular 2 second freezes can wreak havoc on many applications.

Even with all the upcoming exciting changes with the GC, the reality is that doing a ton less work is always going to be cheaper than doing a lot of work. Even if the work is done in the background and does not block as much.

Huge medium to long term object stores in a generational GC require careful planning, so the same principles apply to Java (even if its GC is less blocking).

.NET and other generational garbage collected frameworks are incredibly efficient at dealing with short lived objects. I would not even consider converting any of our short lived classes to structs. It would not make one iota of a difference.

Avoiding large amounts of int arrays

On our Question object we have a tags array.

struct Question 
{
   int[] Tags {get; set;}
}

Even after we got rid of these 2.2 million refs we were still stuck with a large number of pointers to arrays that live in the struct.

To eliminate this we used a custom struct that pre-allocates 10 spots for tags and provides similar array semantics. Our implementation uses fixed int data[MaxLength]; as the internal backing store. However, you could implement something safe that is identical by having N int fields on the internal embedded struct.

The fixed size structure is a trade-off, we potentially take up more space, however avoid a lot of small memory allocations and deallocations.

I also experimented with Marshal.AllocHGlobal but managing lifetime for an embedded structure was way too complicated.

The above techniques are the primitives we use to implement our system, all of this is wrapped up in cleaner object oriented APIs.

In this spirit, if I were to implement a managed version of say memcached in C# I would use huge byte[] to store data for multiple entries. Even though performance of a managed memcached will never match a c++ implementation, you could design a managed store in C# that does not stall.

Conclusion

Some of our techniques may seem radical on first read. Others a lot less so. I can not stress how important it is to have queryable web logs. Without having proper visibility and the ability to analyze progress, progress can not be made.

If you face similar problems in a high scale .NET application you may choose the large block allocation solution, you could rewrite components in c++ or isolate problem services.

Here is a graph plotting our progress over the months, it shows the percentage of times Question/Show takes longer than 100ms, 500ms or 1000ms.

The remaining delays these days are mostly GC related, however they are way shorter than they used to be. We practically erased all the 1 second delays. Our load these days is much higher than it was months ago.

I take full credit for any mistakes in this article and approach, but praise should go to Marc for implementing many of the fixes and to Kyle and Jeff for communicating the urgency and scale of the problem and to Jarrod for making our web logs queryable.

I am glad Jeff has a stopwatch in his brain.

Posted by: Sam Permalink | Comments (50)

Comments

Shawn_Neal over 12 years ago

That was really interesting. It's not every day I get to read an article about .NET GC â€œtuningâ€; it seems like the JVM exposes more in the way of tuning the GC.

I would be interested in hearing more about your queryable web logs. How they are configured? What kind of reporting and monitoring do you use?

Sam Saffron over 12 years ago

Thanks,

We have a background process that parses the raw logs and bulk inserts all the records into a very large, dedicated, SQL Server instance. The hardest thing here is keeping up with the rate the data is generated. The deep analysis is done with a dedicated data explorer instance ( http://data.stackexchange.com/ ) it is an open source project.

Basic daily analysis is done using custom pages on our various management interfaces.

The last report on this blog took a few hours to generate, we really need to index it a bit better.

A key trick with the SQL log is one table per day. (we could use partitioned tables as well, but it requires a more expensive license)

Zz over 12 years ago

I'm not really close to this level of C# proficiency, but I would have assumed that something like a tag cache would be implemented using manual memory management (I seem to remember being able to do that in C#)

Sam Saffron over 12 years ago

That is the Marshal.AllocHGlobal option (or c++ interop), it gets very hard to maintain and is fairly risky.

Gregor_Suttie over 12 years ago

When working on a large project recently we were seeing 70MG of Gen 2 memory being used and every 20 minutes this would be cleared out, took us an age to work out what was going on â€“ in this instance it was related to Oracle's Coherence and our setup.

We used Ants Memory Profiler to look at the Memory Issues which helped greatly.

Mark_Johnston over 12 years ago

Maybe it is time for you guys to drop the .NET stack for the high performance stuff?

Mwp over 12 years ago

At some point, you have to ask yourself if all this extra complexity and effort is worth it? If you're having to use all sorts of hacks to get around the intended behaviour of the system, why keep fighting it? .NET is clearly not suitable for certain types of systems.

Sam Saffron over 12 years ago

I agree it is not suitable for all systems, but these days we are running at an average of 32ms to serve our most popular page on the site, it barely ever spikes over 500ms. It is working quite well for us. Our web servers are all running at ridiculously low load due to the efficiency of the system.

I don’t really consider much of our designs hacks, it is simply the proper way to design storage of large object graphs, similar approaches exist in c++.

Marc_Gravell over 12 years ago

@Mark /@MWP seriously? High performance code is always something you need to give effort to in any language/platform, and it works very nicely indeed in our current architecture. As a company, we'll happily use any appropriate tech that gets the job done â€“ and .NET is not the issue here. C++ (for example) would add introduce a lot more complexity and risk, for pretty little gain.

The fact that it works so well says to me: actually, it is perfectly suitable. But as with any architecture; at scale, you need to address a few issues. Which issues varies between platform, but there is no magic bullet simply by choosing another one.

Marc_Gravell over 12 years ago

Or to put that another way:

.NET simply isn't slow; sorry if that derails your FUD, but we know this platform extremely well, and any myths of poor performance have been greatly exaggerated.

Tim_Meers over 12 years ago

Being a .Net developer I can only hope I will get the opportunity to flex the framework like you guys have. As far as all the nay sayers here, obviously .Net is capable of doing the kind of performance they are looking for. After all, their doing it. Day in and day out. Sometimes you really have to go that extra mile to make it work how its expected. Keep up the good work guys.

Jerry_Hewett over 12 years ago

[rises from his chair to give Marc Gravell a standing ovation]

Aamir over 12 years ago

@ .NET Slayers: You really need to talk to somebody at Facebook/Twitter/Flickr etc who does similar kind of tuning to their Apps which, as a matter of fact, are not .NET based.

As Marc said, this will happen with any app which is scaled to such a proportion. I have done similar kind of stuff with COM+ servers. There is no Silver bullet as far as performance is concerned.

Lucas_Vogel over 12 years ago

Another important yet often overlooked aspect of GC tuning is how it behaves with applications in Debug mode vs. Release mode. I've got a .NET library in C++/CLI where provide a .NET object layer to some high perf C++ code, and I've had to set up build-level GC optimization code based on whether I'm in Debug or Release mode in my library.

Js over 12 years ago

Most XNA games avoid GC when dealing with lots of objects by re-using themâ€¦ is there a reason that approach would not have worked here? It seems like it would be simpler.

Jd_Conley over 12 years ago

Maybe you thought this was redundant because of Marc's post, but the whole point of this refactor was that the struct/array enabled you to shove all these things in the LOH as one big set and thus avoid impacting the Gen 2 GC sweep. And by the way, I love seeing you guys show the world that .NET doesn't suck.

Joel_Spolsky over 12 years ago

Obligatory Old Timey Story: in the 1980s garbage collection was a real performance problem on Lisp machines, which were lucky to have a meg of memory. There were a few elite Lisp hackers who practiced what they called â€œcons-less programmingâ€ â€“ cons is the only thing that allocates memory in Lisp â€“ to avoid GC. Since cons is really the fundamental building block of just about everything you do in Lisp this was kinda difficult, but the technique is almost identical to what you're describing here.

Christopher_Allen_Poole over 12 years ago

What's with the accent egu and accent gave?

Ã©sqlÃ AND Ã©performanceÃ

It's present in the page source too.

Unicode issue? (Using Firefox 7, Win. 7)

Sam Saffron over 12 years ago

actually its a workaround for stop words, we allow for tags such as and, or etc … so this makes it much easier to maintain.

Steve_J over 12 years ago

Having experienced similar â€œmedium-lived objectâ€ issues with the JVM before concurrent GC, I'd say you handled this exceptionally well. I wouldn't trade a GC for segfault wack-a-mole, but it does make debugging things like this more fun!

Ken_Egozi over 12 years ago

First â€“ very deep explanation. As people noted here â€“ it is not very common to see such a deep GC analysis in .NET â€“ lots more of this happen in Java world. So it's great to have such a detailed explanation ready for people to read and learn.

However, per your original problem â€“ what if in order to overcome the shortages of FTS you were to use Lucene for that index? It already deal with the GC problem (as you mention in your post), should perform well for such queries, has well-known scale up and out techniques and afaik already used in your system for regular search

Sam Saffron over 12 years ago

We already use Lucene for search. The thing is, our custom tag engine is just way faster than Lucene in this case cause it is custom built and optimised for one very specific problem.

Mark_Johnston over 12 years ago

@Marc: The fact is that when you have to go through these hoops something is not right â€“ either your tools are wrong (managed) or your general architecture is wrong (ever thought about scaling some stuff out instead of jamming everything in the ASP.NET process?).

A pure unmanaged stack will not suffer from GC related perf. issues and, if made correctly, it will perform wastly better than any managed stack around.
There is a reason that managed stacks are avoided if you want continuous near-real time performance.

Chris over 12 years ago

If anyone wants to taste of performance-critical .NET, without requiring a website that in the top 10 on the web as Marc and the team have, then writing an iPhone app in Monotouch is along the same lines. Everything is AOT, with 46mb of ram for your process, forcing you away from the usual hardware-is-cheaper-than-software server model and forcing you (or me anyway) to learn a lot about CLR.

Artsrc over 12 years ago

Why not have a cache process, and then create a new one, switch and blow away the old one?

One downside of the separate process approach is the IPC.

Ideally they never do a full GC, and you throw the whole thing away rather than pick through references.

Sam Saffron over 12 years ago

The “process isolation” approach was shot down by the team, this is essentially the same idea. Seemed to be avoiding the problem as opposed to solving.

Alex over 12 years ago

Out of curiosity, have you guys considered using some 3rd part distributed caching products to serve as a backend for tag engine? You do mention memcached about halfway through the article but what about products like NCache, ScaleOut or Microsoft's recent AppFabric/Velocity? Granted, most of them are not free but I am wondering about the tradeoff between all the custom development/debugging you had to do to get this online and working.

We've had a fair amount of success using some of them for a similar problem space within a commercial application.

Sam Saffron over 12 years ago

we actually use redis to great success, Does Stack Exchange use caching and if so, how? - Meta Stack Exchange

Andy_Rondeau over 12 years ago

Would you ever consider putting the cache in a separate AppDomain, and then using a new AppDomain every time you rebuild the cache? My understanding is that releasing an AppDomain is much cheaper then running a Gen 2 GC on the objects within in. The overhead is that you'd still have to binary serialize on the barrier between your main AppDomain and the instance of your cache.

Sam Saffron over 12 years ago

Andy,

I actually tested this, you only get one GC thread per process (in the server GC) regardless of the number of app domains you have, if you have a massive graph in any app domain you are risking GC freezes. It is possible that deallocation is faster, however we needed an expensive process prior to dumping the graph so even with GC notifications we could not be sure the new data would be ready in time.

These days we reuse much of the graph so the approach of dump old cache with app domain would not work.

Olivier_Deheurles over 12 years ago

Hi Sam,

Thank you for the interesting story ;)

I will try to add some constructive comments..

I think you can attack GC2 from different angles: â€“ making it quicker: reducing number of references, using CPU cache friendly data structure like arrays or using arrays of structs that the GC can skip straight away, etc. basically reducing the amount of work the GC has to do over the heap during Gen2 collection, â€“ reduce their frequency or ultimately prevent GC2 from happening

The rule with GC is quite simple: objects should live very shortly (in the case of a web server for the duration of a request) or very long: for the duration of the application. Once the application is in steady state, breaking this rule will result in gen2 at some point.

You designed your cache as an immutable data structure for concurrency reasons I imagine: a background thread creates the cache data structure and use a CAS to make it visible atomicaly to the threads handling the requests. The problem is that immutability does not play nicely with GC: since you can not mutate the cache, you have to throw it away (each 1-5 minute) and replace it with a new one, and this is adding pressure on your gen2 heap.

What I was wondering is if you could not proceed like that: â€“ at application startup you allocate 2 caches C1 and C2, fill them, and make one of them visible to your request threads, â€“ when your cache has to be updated the background thread uses the C2 and modifies what changed during the last 1-5 minutes instead of creating a new one, â€“ the background thread Interlocked.Exchange the 2 caches to make the new cache visible to request threads, and you keep a reference to C1 that you will reuse for the next cache update.

The cache does not need to be immutable, it just needs to not be modified while accessed by request threads.

Not sure if it is applicable in your case, there might be evils hidden in the detail ;)

The following paper is an interesting reading about object recycling, their system does 0 GC but you don't have to go that far in your case and can use some of the concepts: http://download.microsoft.com/download/9/9/C/99CA11E6-774E-41C1-88B5-09391A70AF02/RapidAdditionWhitePaper.pdf

Sam Saffron over 12 years ago

Very interesting idea. The first big dip in our “progress report” at the end of my post happened when we added data structures that reuse huge portions of the graph after a refresh, avoiding massive deallocations. A huge win.

Essentially, virtually nothing changes every refresh (20 out of millions of questions) … maximizing reuse was key.

The struct foo was there to take us the rest of the way. We still attempt to reuse as much data structures as we can after a refresh.

Marius over 12 years ago

I see that you are already using Redis for other tasks. Did you try using Redis as your tag engine? It has built in set operations and you would avoid GC.

Sam Saffron over 12 years ago

There are no Redis primitives that would fit our custom cache required for the tag engine.

Olivier_Deheurles over 12 years ago

It's quite funny how people react as soon as you propose something a bit out of the road ;)

I really like as well how people call something they do not understand â€œa hackâ€. Let's be clear: it's not because you do not understand something at your level that it is not something trivial for somebody else. Understanding how GC works and what are the pros and cons of value types and using them properly is not a hack, it is what any decent .NET developer should be able to do.

Now day's servers are massive beasts generally completely under-used CPU wise and memory wise and scale out is often NOT the right solution. Except if you sell hardware ;)

A properly architectured system can serve huge amounts of transactions (100K+/sec) from a single box. Of course you can shard if required (scale out) but there are not many systems out there doing so many transactionsâ€¦

In an ideal world, all your data should live within your process and you can serve all requests without ever blocking your request threads (thread waiting on a lock or on I/O).

Calling out a distributed store may seem quick (and solve the GC issue) but it raises several other issues: you are blocking your request thread to synchronously call another process. Blocking is bad. Under heavy load you will ultimately starve your thread pool and response time will go to the roof. If the cache API maintains an in memory copy of the data you can have contention issues: multiple threads are accessing the same piece of data, this data has to be protected and some mutex will be sitting there serializing the access of all your request threads. Or the API will be using immutable data structures and you will hit the GC wall.

As soon as you have to interact synchronously with an external system (a database, a cache, an app server, etc.) you quickly face scalability issues. I believe that as soon as you are in such context you should switch to an asynchronous model with push over HTTP (COMET, WebSocket, etc.) or asynchronous response mechanism and interact asynchronously with external systems.

A few other interesting links: http://www.infoq.com/presentations/LMAX http://martinfowler.com/articles/lmax.html

Olivier

Ashwin_Jayaprakash over 12 years ago

Good explanation. You'll be surprised to see how many Java/.Net devs have no idea how to correlate GC logs with application slowness.

BTW, was there something like a measurable â€œheap sizeâ€ that your app was using other than just process mem? A few GBs?

If you are interested about reading what Java guys do â€“ http://javaforu.blogspot.com/2011/09/offloading-data-from-jvm-heap-little.html

Marc_Gravell over 12 years ago

@Olivier you are right on the money CPU wise â€“ our web tier runs with the CPUs barely idling (I'm not kidding), and memory to spare â€“ scaling out to more machines is unnecessary. Re blocking â€“ agree fully, which is why we wrote our own redis client to be fully pipelined / multiplexed rather than blocking request/response/request/response.

Re the data-structures â€“ we mix both; semi-immutable data (mutated appropriately, but carefully â€“ with only the bare minimum done with full mutex), and a few custom list-like structures to allow thread-safe (lock-less) extension into the lists spare capacity, etc.

Simon_Woods over 12 years ago

Slightly tangetially, have you looked at the Disruptor work at LMAX? (http://code.google.com/p/disruptor/)

BTW, I'm not offering it as an alternative solution because it may not be directly relevant (and I'm pretty sure you'd be loathed to move away from using a DB at this stage! ;â€“) ), but rather as another (very interesting, imv) way in which perf problems pertaining to locks/GC were approached.

Olivier_Deheurles over 12 years ago

The Disruptor is not directly relevant to this post but can definitly help in an ansyc architecture. The .net port is available here http://code.google.com/p/disruptor-net I am maintaining it so feel free to ask if there is any question.

Sam Saffron over 12 years ago

Nice one! Very insightful comments btw.

Simon_Woods over 12 years ago

@Olivier â€¦ sorry I hadn't noticed your reference to LMAX above!

Gerardo_Orozco over 12 years ago

Excellent read, Sam.

We had a similar issue, having naively-coded a tree structure typically required hundreds of millions of nodes, we experienced SERIOUS GC-Gen2 delays!

We solved it in a similar fashion than you did: first we converted the tree node class to a struct, then we used managed arrays to hold the nodes.

We still got bit by the limitation that no array can be bigger than 2GB even in a 64 bit environment â€“ but we solved it by using multiple arrays. The 32 bit node index is split: a part of it indexes the managed sub-array containing the node, while the other part indexes the actual node inside the array. Of course we had to create an auxiliary structure to hold the indexes of the â€œfreeâ€ nodes to be reused, and as a downside we had to explicitly free them (somehow defeating the â€œmanagedâ€ spirit of things, but completely quenching the GC problem, as now we only held references to a few managed arrays).

Fortunately, this double de-referencing operation introduced a negligible performance hit and I think the solution could be seen as a managed pattern to tree-like structures that might grow large and exist for long time.

I'm going to experiment rewriting the tree using managed C++, moving again to a (non-managed) instance-per-node solution, hoping it gives me the best of both worlds!

Thanks for sharing,

Cheers!

Sam Saffron over 12 years ago

Awesome and thanks, I did not know about that 2GB limitation.

Deleo over 12 years ago

you basicly refactored your solution from using HEAP to STACK. All arrays are reference type, which means they are located on the heap. You wanted to remove the GC, which operates on the heap, and you did the right thing. You created an array inside a valuetype, and declared it fixed. What this does it telling the compiler to allocate the array on the stack, kinda like alloc(). Great post! This is not a hack at all, but a clever use of the .NET compiler. However, when you start using stack rather than heap, you put alot of stress on the server. It needs alot of memory. So in situations where speed is crucial and RAM capacity is not an issue, then go for it :)

Sam Saffron over 12 years ago

keep in mind … just because it is a value type, it does not mean it is stored on the stack … c# - Arrays, heap and stack and value types - Stack Overflow

and the canonical http://blogs.msdn.com/b/ericlippert/archive/2010/09/30/the-truth-about-value-types.aspx

Bryan_Migliorisi over 12 years ago

I read this blog entry a few days ago and then this morning stumbled into this article about ways to avoid garbage collection. It seems to explain a bit of the problem you encountered (and solved by switching to int's)

http://www.simple-talk.com/dotnet/.net-framework/5-tips-and-techniques-for-avoiding-automatic-gc-collections/

Sam Saffron over 12 years ago

good find, thanks for the link

Nick_Masao over 12 years ago

This is software engineering at its best. You guys are good in what you do and stackoverflow is really really fast, even for us in Africa with speeds of 456Kbps. Keep up the good work. Cheers

Abelito over 12 years ago

Even as a java developer, I find your posts on .NET development fascinating. Very interesting read, thank you for it!

Scott Koon over 9 years ago

I’m the last one to see this excellent blog post I think. I ended up here via Matt Warrens excellent performance posts. It’s great to see people really dig into the CLR and memory management. I’ve believe that devs should know what’s going on under the hood, even if they never have to fine tune their application.

Sam Saffron