A/B Testing/Experimentation

Making the LinkedIn experimentation engine 20x faster

Co-authors: Alexander IvaniukJingbang Liu

At LinkedIn, we like to say that experimentation is in our blood because no production release at the company happens without experimentation; by “experimentation,” we typically mean “A/B testing.” The company relies on employees to make decisions by analyzing data. Experimentation is a data-driven foundation of the decision-making process, which helps with measuring the precise impact of every change and release, and evaluating whether expectations meet reality.

LinkedIn’s experimentation platform operates at an extremely large scale:

  • It serves up to 800,000 QPS of network calls,
  • It serves about 35,000 concurrently running A/B experiments,
  • It handles up to 23 trillion experiment evaluations per day,
  • Average latency of experiment evaluation is 700 ns and the 99th percentile is 3 μs,
  • It is used in about 500 production services.

At the heart of this platform is the LinkedIn experimentation engine, or more simply, the “Lix Engine.” With the high number of evaluations and QPS that the Engine is expected to handle, and its wide adoption across the company, the library must be highly performant, be resource-efficient, and follow a stringent testing, verification, and release process. We know that by optimizing the Engine at least a little, we can have a big impact on the performance of production services across the company. Recently, we completed a major revamp to meet our growing needs.

diagram-of-linkedin-services-and-the-lix-engine

Figure 1. LinkedIn services and Lix Engine

Figure 1 shows how LinkedIn services interact with the Lix Engine. First, the Lix Engine evaluates the A/B test request and returns either treatment or control. Based on the result from Lix Engine, services return the feature page to the member.

What is Lix Engine and what does it do?

Conceptually, the Lix Engine is software that understands a domain-specific language for experimentation (i.e., Lix DSL), which is able to perform three functions:

  1. Perform randomized splitting of the population,
  2. Perform segmentation of the population,
  3. Perform assignment of an experiment variant for a given entity involved in experimentation.

One of the prerequisites of running A/B testing is the ability to split test populations into random, independent buckets. The Lix Engine can perform such a split seamlessly when provided relative weights of population buckets.

example-of-randomized-population-splitting

Figure 2. Randomized population splitting with 1:4.5 weighting

Another important aspect of experimentation is the ability to target a particular sub-population to release a feature to (e.g., all new college grads). Lix Engine can split the population into multiple cohorts (called segments) and independently perform randomized splitting for each segment:

example-ofsegmentation-of-population

Figure 3. Segmentation of population

Both segmentation and randomized splits of population are wrapped in a package of Lisp-like experimentation DSL. For example, the figure above would be expressed in the DSL as (ab (all) [treatment 18.2 control 81.8]), and the example from Figure 2 as (ab (is-student) [treatment 50 control 50] (is-job-seeker) [treatment 25 control 75]).

Lix DSL has multiple benefits:

  1. It is flexible: Lix DSL programs are delivered to services independently of code and configuration deployments, which allows the experimentation lifecycle to run independently of the code release lifecycle. Since Lix DSLs are shipped to production lightning-fast, they are also used as feature flags and for traffic routing configuration when speed of change or roll back is paramount.
  2. It is deterministic: Given the same experiment and experimentation DSL, a member will always be assigned the same treatment after evaluation, which means that we do not have to keep track of prior member allocations or perform remote calls to retrieve such information.
  3. It is restrictive on purpose: The DSL can only perform an essential set of operations and, for example, cannot execute loops or make recursive calls. This prevents language abuse and allows for easier static analysis.

Status two years ago

The Lix Engine was created around 2012 and was originally written in Clojure. Since then, it has been running at LinkedIn and evaluating experiments. Over time, it began to suffer from a number of different issues that stemmed from design flaws and the specifics of the chosen implementation language. These challenges are areas that we worked to fix in our recent revamping of the system.

Weak typing, edge cases, and permissive syntax
In order to support experiment evaluations, Clojure Lix Engine defined the following data types in the DSL, including primitives and collections:

diagram-of-legacy-type-system

Figure 4. Legacy type system

This looked like a well-structured and well-defined type system, but it did not work well for two reasons:

  • There was no type-checking in the Engine core. All validations had to be performed at every step, which did not scale well.
  • There was no support for polymorphism, either static or dynamic. This made it hard to handle comparisons like (= “string” “string”) or (= 1 2.0)—all of these had to go through the same entry point in the code, and developers were responsible for properly identifying all possible combinations of input values and handling them in the proper order.

Problems with memory, garbage collection, and execution speed. Over the years, we identified multiple performance issues in the Clojure Lix Engine. 

Memory. One major inefficiency was memory management. Due to the nature of Clojure’s lazy evaluation, a significant number of temporary objects (e.g., lazy sequences) were created in the JVM’s heap. These temporary objects were occupying at least 30% of the Java heap of the services using the Clojure Engine. 

Garbage Collection. Clojure’s immutable data structures were taking too much space in memory compared to regular Java collections. Large numbers of objects and a significant memory footprint caused sizeable GC pauses in production (~0.5s), as temporary objects could survive multiple GC cycles and move to the JVM’s old generation. 

Speed. Another major performance issue was speed. There was not a single issue that led to bad performance; rather, it was a combination of: 

  • Frequent use of reflection APIs,
  • Dynamic object and method discovery, 
  • Weak typing, and lazy evaluation on the Clojure side and inefficient type casting, 
  • Exception handling and locking on the Lix DSL side.

No Clojure developers = no development. The other problem was the absence of proficient Clojure developers. There were very few engineers at the company who could understand and, more importantly, write Clojure code, so it became a drag on our productivity.

Rewrite

Goals
Maintaining one of the most frequently used libraries in the company made us care a lot about things that were missing or flawed in the first version of the Engine, so we started developing v2 with the following goals in mind.

The Engine must be fast with

  • Low memory and garbage footprint.
  • Fast execution speed.

The Engine must be easy to develop and maintain:

  • The code must be easy to understand.
  • Adding a new operation into the language must not take long.

The Engine must be safe:

  • Type safety is a must.
  • It must be hard for developers to make mistakes.
  • It must have extensive test coverage.

Choices
Before we started, we had to make a number of decisions:

  • Language choice. We made a proof-of-concept language parser and evaluator in Java. The results were astonishing: our code achieved 2-3 times better performance than the previous version without much optimization work! Given that Java is the most widely used language in the company and has the most extensive technology stack, the decision to go with Java became very easy for us.
  • Interpretation vs. byte-code generation. We ran a number of benchmarks and understood that, while byte-code generation could offer us the best performance, Lix DSL interpretation was good enough and would take us significantly less time to implement. As a result, we decided to interpret DSLs by parsing them into evaluation trees and then executing them.
  • Compile-time code generation vs. Java Reflection APIs. We made a decision that we want to have support for operation overloads to automatically route execution to a method that is compatible with the parameters and whose signature is the “best” one for them. According to our benchmarks, a proper type resolution code is about 3 times faster than Java reflection, at 15ns versus 45ns per call.

Implementation

During the development of the Engine v2, we came up with many interesting ideas and made a lot of discoveries. We’ll share some of the most relevant ones here. 

Specification
To make the new implementation type safe, we had to specify the behavior of each element of the language, which included their contracts (namely: input argument types and return types for the operations). Such a specification also contains additional metadata, which makes it possible, for example, to read it and provide an editing experience in our experimentation UI. See the gist here for an example.

The specification file is used to auto-generate a Java implementation for the operations, which includes abstract methods for all defined operation overloads and includes parameter resolution code, which will be discussed in a later section of this post.

Evaluation tree
Similar to the DSL syntax tree, we introduced a DSL evaluation tree, where each node can compute a value based on an input entity (e.g., member/guest), context, and values returned from subtrees. 

The tree is defined in the form of two different classes: AbstractEvalTree and AbstractEvalNode. The EvalTree is an advanced implementation, because instead of storing children in each node’s object, we moved all the information into the evaluation tree class and stored ithat in the form of three arrays (example can be found in Figure 5):

  • AbstractEvalNode[] nodes, which contain all nodes of the tree and where the root node is always at index 0,

  • byte[] childCounts, where i-th element is the number of children of node nodes[i] from the above array,

  • short[] childListStartPositions, where i-th element defines the start index of a list of children of node nodes[i] in the array nodes. Every node’s children occupy contiguous space in the array nodes, so in order to iterate over all the children of nodes[i], one must walk from nodes[childListStartPositions[i]] to nodes[childListStartPositions[i] + childCounts[i] -1].

Let’s consider the evaluation tree for the following DSL expression:

(and (= (string-property “osVersion”) “1.2.3”) (in (country-code) [“us” “gb”]))

diagram-of-evaluation-tree

Figure 5. Evaluation tree

The evaluation tree is stored in the following data structures.

diagram-of-storage-of-evaluation-tree

Figure 6. Storage of evaluation tree

In the gist here, we consider a naive approach of implementing the tree in the form of nodes with the following structure.

Compared to the naive one, our current approach has a number of advantages:

  • Low memory overhead. Java is quite inefficient in terms of memory representation of objects. The overhead for implementing the tree above in 32-bit JVM or 64-bit JVM with compressed OOPs enabled will be 60 bytes for the current approach:

                12 bytes per object

                48 bytes for 3 arrays (16 bytes per array)

        and 208 bytes for the naive approach:

                12 bytes per object

                196 bytes for 4-array list (48 bytes per each array list).

  • Faster execution. It is surprising, but we measured a 2.5x improvement in performance by switching from the naive approach to the current approach because:
    CPUs prefer sequential memory access and smaller data structures, as CPU caches are small and main memory access is quite slow
    By switching to three plain arrays, we have also eliminated the overhead of virtual calls on Java ArrayList data structure.
    Our nodes generally perform lightweight operations, so tree traversal speed matters a lot.

Type resolution and code generation
Each evaluation node has its specific processing logic, but it also has to be built to handle any situation, which means that:

  1. The node at compile time does not know what types of values will be returned from children nodes.
  2. The node should be able to resolve overloads of an operation either at the time of building the evaluation tree or at run-time.

As we discussed before, we decided not to use Java Reflection API and have code to resolve appropriate overload methods based on value types returned by children nodes. To avoid repeated writing of boiler-plate code, we decided to auto-generate stubs for operation overloads in the form of abstract methods and to auto-generate the parameter resolution code based on the DSL language specification.

Using auto-generated code significantly reduces implementation complexity, as developers only need to implement processing logic for specific argument types. We are also able to adjust the behavior of all DSL operations’ implementations, which saves us a lot of time.

Remote call, short circuiting
Being in full control of the language, we were also able to perform some advanced optimizations. Let’s discuss one of these.

To perform segmentation, operations usually require member attribute data, which results in a network call and increased latency of evaluation. While full local evaluation can be executed as fast as 50ns, the remote call has a p99 latency of 4ms. For example, the is-student operation we talked about in Figure 1 needs to fetch members’ education data to process, but some operations don’t require remote call, so we can use that to our advantage and avoid doing the expensive query. In the example below, we will not be performing the remote call if string-property “osVersion” does not return “7.1.1”:

(ab (and (ge (connection-count) 30) (= (string-property “osVersion”) “7.1.1”)) [treatment 50]) 

Technically, the optimization is to disable short-circuiting of “and” and “or” operations during local execution and to try to find at least one child branch of them that can be fully executed locally and returns “false.” 

Figure 7 shows the evaluation process with and without the remote call optimization.

animation-of-evaluation-with-and-without-remote-call-optimization

Figure 7. Evaluation with and without remote call optimization

Performance optimization results
After all the work we did and all the optimizations we performed, we achieved the following improvements:

  • 20x faster sequential evaluation and 14.7x faster concurrent evaluation,
  • 10.2x memory footprint improvement,
  • 6x smaller temporary object generation rate and no GC hiccups > 50ms.

Verification

Arguably, the problem of verifying a programming language is a hard one because you have to deal with testing a highly dynamic system with an infinite amount of possible states. Therefore, we performed the verification and release of the new DSL carefully and gradually.

Unit tests. We started with writing a huge amount of unit tests to reach line coverage of at least 80-90% for the Engine runtime code.

Test case generator. In order to declare a full feature parity with the old code, we had to prove that v1 and v2 of the Engine generate equal results. We figured out that a perfect way of doing that was taking all production DSLs and running v1 and v2 Engines on them while triggering all possible execution branches of evaluation trees. This brought us to the idea of automatically generating test data, which was very simple. By knowing the structure of a program and how each operator should work, we can parse the tree and recursively generate input data to trigger different program branches.

As an example, we can take the following DSL expression from Figure 4: 

(and (= (string-property “osVersion”) “1.2.3”) (in (country-code) [“us” “gb”])) 

The following execution sequences are possible:

  1. (= (string-property “osVersion”) “1.2.3”) returns false and the return value of (in (country-code) [“us” “gb”]) does not matter.
       Property “osVersion” should then be assigned a random value to trigger the branch.
  2. (= (string-property “osVersion”) “1.2.3”) returns true but (in (country-code) [“us” “gb”]) returns false.
    Property “osVersion” should be assigned “1.2.3”
    “Country-code” member attribute should be set to anything but “us” or “gb.”

  3. Both (= (string-property “osVersion”) “1.2.3”) and (in (country-code) [“us” “gb”]) return true.
    Property “osVersion” should be assigned “1.2.3”
    “Country-code” member attribute should be set to either “us” or “gb.”

diagram-of-auto-generated-test-cases

Figure 8. Auto-generated test cases

By building a test suit for each of the parse tree branches and combining them recursively, we were able to build a comprehensive set of tests, which triggered 99.9% of all DSL execution branches, with 99% of branches returning more than one distinct value.

Distributed offline verification in Hadoop. Having 99.9% confidence was not good enough for us, so we took 4,000 distinct combinations of runtime parameters and computed the cross product of those, with 2,000,000 distinct sets of members’ attributes in our production Hadoop cluster, which resulted in 8,000,000,000 test cases. We used those to run a distributed verification on thousands of cores, which compared v1 and v2 of the Engine and reached 99.9998% of branch coverage of all production DSLs.

Ramping

Finishing the implementation is not the end of the story, however. In fact, rolling out the library and migrating hundreds of LinkedIn services to use the v2 of the Engine is one of the most challenging parts of this project. 

We developed a rollout plan for the library with the following goals:

  1. Migration is transparent to experimentation platform users. 
  2. Ramping is controllable by the team.
  3. Impact is measurable.

To control the ramping in a centralized manner, we released a version of the experiment client library with an internal A/B test to switch between v1 and v2 of the Engine. The version was upgraded to all the services that use the library, and we were able to control the ramping and measure the impact with the A/B test.

diagram-of-rollout-process

Figure 9. Rollout process

It took us 37 iterations and 8 months to ramp, measure, update, and iterate. During the rollout, we found a few issues with integration of the Engine code into target services that we were not able to catch locally, even after such rigorous testing. We also understood the importance of remote call caching midway through the ramping process and A/B tested up to three different “flavors” of caching logic in production simultaneously.

Impact

Given that our library is one of the most heavily used in LinkedIn’s infrastructure, any change we make to it has a tremendous impact. In this case:

  • We improved the performance of all API endpoints in the company by 0.5% on average;
  • Some critical services became as much as two times faster;
  • We released more than 4 Terabytes of used memory across all the services;
  • We achieved consistently better garbage collection performance across many thousands of machines.

Acknowledgements

We would like to thank all members of the T-REX team—without their hard work in the experimentation platform, this project would not have been possible. Big thanks to Igor Perisic, Kapil Surlaker, Ya Xu, Suja Viswesan, Vish Balasubramanian, and Shaochen Huang from the management team and T-REX alumni Shao Xie for their continued investment and guidance. Many teams across LinkedIn collaborated with us to roll out the Engine. We are thankful for their collaboration. Special thanks to the Experiment Data Science team and the EPC SRE team for the support.