github twitter
Field Notes: ElasticSearch at Petabyte Scale on AWS
Jun 17, 2016

I manage a somewhat sizable fleet of ElasticSearch clusters. How large? Well, “large” is relative these days. Strictly in ElasticSearch data nodes, it’s currently operating at the order of:

  • several petabytes of provisioned data-node storage
  • thousands of Xeon E5 v3 cores
  • 10s of terabytes of memory
  • indexing 10s of billions of events a day / >50TB of new data a day

And growing. Individual clusters tend to range anywhere from 48TB to over a petabyte. When I said “petabyte scale”, that includes individual clusters:

img

It’s not that large in terms of modern systems from a resource perspective, but seems fairly large in the ElasticSearch world. Having spent a little over a year (with ES versions 0.90.x - 2.2.x) managing these systems has yielded countless designs, operations strategies and other discoveries that I feel are worth sharing.

Every topic discussed in this post could each be exploded into individual blog posts or translate to several days or weeks worth of testing for the reader. This post would be 100 pages to dive into every area with great detail, so my goal is to hit high level areas I find important when operating ElasticSearch at some level of scale.

I’m going to assume you’ve heard twelve times over that ElasticSearch is basically an abstraction layer of distributed and replicated database mechanics on top of Apache Lucene with a (great) search API strapped on. That an index is a Lucene instance. What replicas are. Whether or not to have a quorum of dedicated masters. Etc. I’m more interested in sharing general thought patterns about how we operate that are (hopefully) adaptable to your own situation.

So anyway, let’s do it!

Capacity Planning

and general cluster design

I primarily spec around two drivers: I have an indexing driven workload and specific requirements around data retention (in days). Conversations around shard counts tend to relegate to references of black magic or art, but it’s actually a bit more straightforward than that (…or is it?).

Basically, consider a shard a unit of performance. It has a particular performance profile over time (e.g. under constant indexing, resource demand increases in relation to the segment sizes during merging, until leveling out at the max segment size). Then there’s expected costs based on activity (e.g. larger merges more heavily stress the garbage collector, querying larger data sets occupies the query thread pool for more time, added query complexity and regex burns more CPU, etc.). Finally, every operation lives within a world of explicit limits. Merging, search, indexing, etc - all require threads. These threads are specifically defined (and mostly configurable) in count. We know how many threads per shard or per node that we can consume, and that each activity type has it’s own particular resource demands characteristics (CPU, disk, memory, or combinations). This means everything is measurable.

The crux of ElasticSearch, or any complex system for that matter, is that capacity planning feels like black magic or an art because most things go unmeasured or uncontrolled.

This is our theme: measuring and controlling. Here’s how I approach this with our clusters.

(Roughly) Measuring

Our standard data node is the d2.2xlarge instance. It has 8 cores of Xeon E5-2676v3, 61GB of memory, and 6x 2TB direct-attach “probably good SATA or nearline-SAS” drives. It’s configured with a 6-way LVM stripes for 11TB usable storage (after fs and other overhead) that can sustain a solid, consistent ~900MB/s sequential read/write*.

*(AWS doesn’t over-provision storage on d2s; the full hypervisor is 24 spindles, which you’re allocated all 24 at the largest 8xl instance. With each size down, you’re just halving access to the resources, so you get 6 disks / 8 cores / 61GB with the 2xl. In testing, the hypervisor HBA also supports simultaneous saturation of all disks - hence the consistent performance you’d see.)

Each data node is basically a “shard container” for our “units of performance”. Take your standard build, whatever it is, stick a single-shard index on it and light it up. You’ll see an ebb and flow consistency to the resource utilization. For instance, if we saturate it with indexing and wait until the max merged segment sizes are hit, we’ll see a cycle of CPU, storage IO, CPU, storage IO. That’s compression/decompression inside segment merging (CPU) and flushing the new, larger segments out to disk (big, sequential IO). When the merges hit max load, you’ll see a specific number of cores saturated on merge threads (the size of the merge thread pool; if not explicitly defined, it’s calc’d based on CPU cores. Check _nodes/_local/hot_threads - you’ll see a bunch of [[index-name][shard-num]: Lucene Merge Thread #xxx threads.). And when the merge flushes trigger, you’ll see the disks saturated for a somewhat consistent period of seconds. Merge, flush, merge, flush. That’s indexing.

Indexing is much more predictable than query. You’ll find some docs/sec. rate (your doc size and structure really matters, too) that you can safely index into a single shard until the indexing/bulk threads queued hovers near it’s limit (if you had to monitor one thing, stats from _cat/thread_pool would probably be it) or your bulk indexing time is simply too high (which I care more about). In our case, we’ll say that’s roughly 2,500 docs/sec. per shard. At this rate, we will saturate 3 cores on merging (let’s say we locked our merge thread pool at 3 per shard for predictability / scaling purposes, which I’ll mention shortly) and ocassionally peg the storage on new segment flushes for about 3-5s. While capacity planning revolves around your own requirements, I have an indexing priority workload; this leaves 5 open cores and a storage system that’s not overloaded. I can slap on another shard on this node and squeeze 5K docs/sec. per node. That’s roughly our per-node indexing capacity.

Single shard at peak utilization, locked at 3 merge threads. New line separated hot_threads output. img

So, these numbers may actually be made up, but the concept is real. Follow this ramble for a second. Rather than getting lost right off the bat thinking about arbitrary cluster sizes or trial/error shard counts, you establish performance baselines per shard as it runs on our standard build. You then look at our desired workload and determine the number of shards necessary to fulfill that workload. Finally, you think about what workload density you want to run per box (controlled with our shards at peak utilization per box count). How much headroom do we want to leave on our nodes for other activity— query, rebalances, fault tolerance, replication? We let calculated shard counts and headroom requirements dictate node counts.

Want to run 40K docs/sec. and don’t care too much about query?

40,000 / 2,500 = 16 shards per index / 2 shards-per-node = 8 nodes

Want replication? Roughly double it, go with 16 nodes (although the scaling isn’t exactly linear - there’s additional memory, networking and other overhead; this is easily a whole post).

Seriously, this mostly works. I roughly scale by speculation using this sort of calculation, then later scale-to-fit according to the realized performance characteristics (query for instance varies greatly across clusters). It’s been successful from 500 doc/sec. clusters to well over 150K.

For instance, if we saturated our 2 shards/node example at 5K docs/sec., we’d eventually leave 6 cores pegged solid when the max segment size merges are at play (+ bulk thread overhead*). Are 2 free cores enough for your query load? Maybe. Definitely not if you’re expecting to regularly perform complex regex queries against 50GB of index data.

higher workload density, less headroom: img

same workload at lower density, more headroom: img

One major caveat is that the observed indexing resource demand depends on many factors. In order to see the described behavior, you would likely require specific tuning that deviates from the ES default. As mentioned, my workload is indexing priority. For performance and scaling purposes, it’s preferable to get events to disk and let merging run as fast as possible. My resource demand patterns behave this way, intentionally, largely due to configuring index.translog.durability to ‘async’ and indices.store.throttle.type to ‘none’, in addition to favoring fewer/larger bulk requests and having large event sizes. If you’re using defaults, have small events and issue many requests with great concurrency, you’ll likely burn CPU on bulk threads and would experience a very different scaling pattern.

The goal of capacity planning and coming up with a cluster spec is understanding the general mechanics of ElasticSearch and how they relate/compete for resources. At least from my perspective, it’s a essentially game of establishing per-shard performance baselines and building a cluster large enough to house the required number of “scaling units” while leaving enough headroom for other activity.

Unfortunately this has all been the easy part, and we got to enjoy it by isolating one single function of ElasticSearch (indexing). Sizing becomes more difficult when you introduce other user and cluster activities such as queries, shard rebalances, and long GC pauses. The recently mentioned “headroom for other activity” is that difficult part, and the best approach to sanity is controlling inputs; the second half of our theme and next topic.

Controlling

ElasticSearch stability is all about controlling inputs. While in the measuring section I talked about capacity planning in order to handle known input volumes such as indexing, it’s desirable to manage all resource-demanding activity, beyond indexing. Otherwise figuring out a spec at all sort of doesn’t matter, unless you just round way up (and own flip flops made out of hundred dollar bills).

I consider there to be 3 primary categories of control to think about:

  • indexing
  • query
  • general cluster activity (shard replication, rebalance, etc.)

I’ll just hit each of these in order.

For indexing: we use a homebrew indexing service. I consider it absolutely critical that indexers have configurable or self-tuning rate limiters (be it at the indexer itself or at an upstream throttling service feeding the indexers downstream), as well as controls on batch indexing parameters (such as batch size / timeout flush triggers). You’ve probably read about using bulk indexing - it’s all true, follow that advice. One important piece I’d like to add that is probably less often considered is the total number of outstanding indexing requests your indexers can hold against a cluster. If you reference the capacity planning section, your nodes each have an active and queued bulk thread limit; your total number of outstanding bulk index requests shouldn’t exceed this.

If we build spec a cluster to 60K docs/sec., we lock our indexers at 60K and queue upstream if the total ingest exceeds the ES capacity specification (and of course fire off an espresso and scale the cluster). Indexing control is incredibly valuable and I tend to see it skipped quite often and it addressed by slapping on more resources.

Performance note from the previous section mentioning the cost of bulk threads: while playing with your bulk request rate vs size, I would balance the performance you’re squeezing from your indexers against the bulk thread utilization on the ES data nodes. For instance if you max everything out, see how far you can scale back your request rates at the indexers without impacting total performance. You’ll likely open up some CPU capacity on the ES cluster.

For query: basically same story. We have a homebrew query service that treats queries just like indexing. Indexing and query are both a resource demanding workload, right? We only allow so many concurrent queries with a TTL’d query queue. One caveat is that until ElasticSearch has improved running query controls, the actual impact of a single query can be considered an almost non-deterministic workload if you’re exposing it to external users. Unless you syntactically prevent it, there’s currently nothing that can prevent a user from submitting a single query that will chew up 40 minutes on 1,100 cores followed by cataclysmic garbage collections. Query rate and concurrency on the other hand, is at least an approachable control point.

For cluster activity: I think this one is pretty neat. So in the early days of ElasticSearch in production (like version 0.90 era), there’d be fantastic outages triggered by say an initial imbalance of primary shards over the available data nodes. One data node would hit the high watermark and kick off a shard rebalance on an older index, which I of course left totally unthrottled. The combined double indexing load (it wasn’t designed in this scenario for 2 primaries on a single node) and shard migration would cause full GC pauses that exceeded the cluster timeout, ejecting the node from the cluster.

Then you start writing wonky one-liners. “What’s the shard distribution over my nodes?”:

$ curl -s localhost:9200/_cat/shards | awk '{if($3~/p/) p[$(NF)]++}; {if($3~/r/) r[$(NF)]++} END {for(x in p) if(p[x] >1) print x,p[x]," primaries"} END {for(x in r) if(r[x] >1) print x,r[x]," replicas"}'
xxx-search05 3  primaries
xxx-search06 2  primaries
xxx-search07 6  primaries
xxx-search08 3  primaries
xxx-search05 3  replicas
xxx-search06 2  replicas
xxx-search07 3  replicas
xxx-search08 4  replicas

The real fix was to simply control this activity. A lot of it came down to changes in settings (e.g. the primary shard isolation weight [now deprecated I believe]). The rest was solved by quickly whipping together a settings scheduler service I’ve named sleepwalk. It allows us to apply transient cluster settings on a time schedule. For instance, “allow 0 rebalances and more aggressively limit transfer rates between 9AM and 9PM, otherwise, go nuts.”

Since most workloads fluctuate with daytime highs to nighttime lows, it was easy to establish an index lifecycle. I lock down general cluster activity during peak indexing, optimize and replicate (more on that later) the index using other tooling, then open the cluster up for whatever it needs to do:

img

Here, we allow 5 concurrent shard rebalances at fairly unrestricted transfer rates. Then right around the ‘sleepwalk end’ annotation, our tasks finish and the indexing rate starts to kick back up.

Naturally, if you set out to control cluster activity as a load inducing input, you’ll figure it’s equally important to monitor things like shard movements or recoveries like you would GCs or storage utilization. Definitely monitor that stuff. Tracking shard movements/states is tremendously useful.

For a quick-n-easy hack, it works incredibly well and completes our 3rd control point. Having all of these measures implemented (in addition to a lot of time testing and tuning between instance types, storage and cluster topologies, JVM foolery), we’ve brought tremendous performance and stability improvements to our fleet.

Other considerations

You probably still have some thoughts in mind. “So should I use SSDs?”, “Do I need 128GB of memory for the page cache?”

I’m not a fan of prescribing generalized performance recommendations without context. Why not? Let me conjure up a personal quote on systems design:

Saying “an HDD is too slow” is like saying “steel is too heavy, it can’t float”, concluding you shouldn’t build a ship from it.

Context and specifications matter. Think about requirements in terms of “search x must return in 100ms” or “max size segment flushes should never take more than 5 seconds”. Start designing your system while considering how each component helps you tick off those reqs.

In the case of SSDs of course, no matter what, the latency (and very likely sequential IO) will be significantly faster than HDDs. But does it matter for you? Take my workload for instance, it’s indexing driven and retention is everything. 200TB of ElasticSearch data is borderline pocket change. We have relatively large documents at ~2KB+ and absolutely burn cores on indexing (lz4 decompression/compression cycles or DEFLATE in ES 2.x). If I saturate a standard box on indexing where every core is pegged solid, the disks are mostly idle, saturated with a big sequential write, idle, saturated, idle, etc. I could swap the storage for SSDs and probably chop that flush time from 5s to 2s. But do I care? Not when data retention is my secondary capacity spec and I just ramped storage costs up by some multiple, yet acquired no meaningful gain according to the expected user experience standards.

What about query you ask? I would think about your worst and average case latency tolerances. The app that my ES clusters back is asked the type of questions that you might ask a Hadoop based app, so 5s query responses are relatively “fast” (having a single cluster with an aggregate sequential read throughput of > 70GB/s sort of cool & helpful). Most really long query responses are almost always complex regex that’s CPU rather than storage bound. That said, I tried up to i2.2-8xlarge instances. Basically, it generally sucked and definitely wasn’t worth it for us; the balance of storage to CPU perf was totally off. The indexing performance was worse than c3 instances with EBS optimized channels and GP2 storage simply because of the doc rate/size and associated merge compression overhead. But the i2 may be the best instance for you.

But don’t get me wrong about hardware. Personally, my house is full of SSD everything because flash is awesome. The fact is that usually someone is paying you to design systems rather than to be awesome. My advice is to stick to thinking about systems design rather than subscribing to generalizations.

Ultimately if I did have to generalize, you’ll probably need more CPU resources than most people would initially guess. On AWS, c4 and d2 instances are your friend. If you have low data volumes, lots of cash or are query latency sensitive: yeah, slap on the fastest storage you afford.

An additional factor in your general cluster design that should dictate data node spec is minimum transfer times across the cluster. A number every system builder should remember is that with perfect saturation, you’re going to move a theoretical max of no more than 439 GB/hr. for each 1Gb of network throughput. If you thought stockpiling 20TB per node with 1Gb links was a good idea because it covers your indexing/query bandwidth and will save rack space, you’re asking for it. If you experience a node fault and need to evacuate the entire data set from it, measuring recovery in days sounds frightening.

In our larger cluster designs, there’s definitely a factor of weighing the storage density and total volumes against attributes like cluster aggregate storage throughput and bisectional network bandwidth. I think about questions like “What timelines should I expect if I have to re-replicate these 20 indices that hold 300TB of data?”.

Misc Reliability Ideas

and further performance thoughts

Reliability in many ways is similar to capacity planning, and they definitely relate. I’ll talk about two subclasses of reliability - stability and durability.

In terms of stability, this was mostly addressed in the previous section (size right and control inputs). I’d say the largest stability factor outside of that is query complexity and volume. I feel like I see two categories of query workloads:

  • many and small
  • few and large

The many and small being those apps that serve up “____ thousand queries per second” but the query complexity and time ranges (or referenced data set sizes) are relatively small. The few and large is what I deal with: “let’s take these 25 billion docs and make a 7 layer regex dip out of it”. The greatest impact from really complex queries targeting large datasets is sustained CPU usage and the follow up massive GCs and long pauses. If you can’t syntactically prevent it (let’s say those wild queries are actually required), the best bet is to probably go with the query throttling method I previously mentioned and aggressively restrict the active query concurrency.

If you truly want to scale up higher volumes of complex queries or really high rates of small queries, there’s a variety of tiered node cluster architectures that are often implemented. For instance, if you had a relatively low indexing volume and huge query volume: build a pool of indexing nodes (defined via tags) and a separate pools of query nodes and crank up the replicas count. At this point, you could view your the replica nodes as a single chunk of capacity (cores, heap memory) set aside for query. Need more query capacity? Grow the pool from 240 cores / 960GB heap to 360 cores 1.44TB heap and spin up the replica count.

In regards to durability, ElasticSearch’s most obvious control functionality is shard replication. Excluding the galactic topic of byzantine style failures, replication strategy has long reaching performance and total capacity considerations. Out of the box, I’m going to guess most people are cutting indexes with the default 1 replica per primary setting. The obvious cost to this is doubling the storage consumption. The less obvious is that indexing into a replicated index has (obviously) worse performance than writing into a non-replicated index. What’s less obvious until you measure it is that the performance consistency with replication has always been erratic, at least in my experience. Indexing latency and resource usage fluctuates over a greater range than it does writing into a non-replicated index. Is it important? Maybe, maybe not.

It did give me some ideas, though. It sounds crazy but you should ask yourself: “Do I actually need the active index to be replicated?”. If it’s not a hard yes, having no replicas is actually pretty incredible. Cluster restart times are really fast (this has diminished in newer versions of ElasticSearch due to improved recovery mechanics) and you can seriously squeeze awesome indexing performance per node. And the more predictable performance consistency of a non-replicated index means that I can run data nodes closer to the edge of my headroom / spare capacity line without randomly flipping too far into it and starving out other workload demands. Sounds dangerous, right? Of course it is. Basically, have a good replay mechanism and checkpoint often by rolling the active index and replicating the one you just phased out. Lose a node? Cut another index to pick up new writes, allocate the missing shard in the impacted index and replay your data. Older replicated indices will be recovering on their own in the background.

Similarly, you should ask yourself: “Does my data lose value as it ages?”. If so, don’t carry replicas for the entire lifecycle of the index. Off-cluster storage (such as s3 or the massive warm storage in your datacenter) is very likely that of a lower $/GB cost. If you hold on to 90 days worth of indices but 95% of your queries hit the most recent 7 days, considering sending snapshots off-cluster and flipping older indices to no replication. If it fits in your SLO, it’s an easy shot at slashing the footprint of a huge cluster.

Lastly, both of these topics talk a lot about playing with shard counts. I should caveat that shards are not free (even when they carry no data) in that every shard occupies some space in the cluster state. All this metadata is heartbeated to every node in the cluster, and I’ve definitely experienced some strange reliability problems in high shard / high node count clusters. Always aim for the fewest shards you can get away with, all things considered.

Upgrades

and other live cluster mutations

Distributed database upgrades, everyone’s favorite. I think one of the first questions I always get about our fleet (strangely) is “How long do your rolling restarts take? Ours take days!” Oh man. That’s unreasonably slow.

Option 1 is obvious. If you’re super SLA relaxed and can schedule an outage window, do a full stop > upgrade > start, and it’s good to go in ~3 minutes (I made it sound simple but there actually is some pre-flight sanity checks and other operations stuff that happens).

Option 2, if you live in reality and zero downtime is your goal, I actually dislike the rolling upgrade/restart method. So much data shuffling, coordination and time. Option 2 is to:

  1. Splice new nodes into your cluster (with either rebalances set to 0 or some allocation exclusion on them so rebalances don’t kick off willy nilly).

img

  1. Cut a new index that’s allocated strictly to the new nodes. Start writing inbound data to this index.
  2. Then, set exclusions older indices and throttle the rebalances (using the cluster.routing.allocation.node_concurrent_recoveries and indices.recovery.max_bytes_per_sec settings).

img

  1. When the original nodes are clear of data, remove them.

img

The new nodes could be simply new configurations, new version of ElasticSearch (but don’t try this across major versions ;)), or even part of a recovery operation (we don’t try to fix nodes; at the first sight of a fault, we simply stick a new one in the cluster and set an exclusion on the faulted node).

Once you have this operationally wired, it’s pretty incredible to watch. Chuck in the new nodes and simply spin (or schedule with sleepwalk) the node_concurrent_recoveries and recovery.max_bytes_per_sec knobs, wait until the deprecated nodes are clear, then terminate.

We have a ton of ElasticSearch tricks, but I have to say this is probably the most powerful relative to how simple it is. From my perspective, it was the best way to slide around petabytes of ES multiple times with zero downtime.

Final Words

Hopefully this post struck an interesting balance between helpful systems ideology and some light ranting. It’s a little glimpse into one of the many incredibly cool projects that I’ve been lucky enough to work on.


Back to posts