MotionMark:
A New Graphics Benchmark

Sep 21, 2016

by Jon Lee

Co-written with Said Abou-Hallawa and Simon Fraser

Today, we are pleased to introduce MotionMark, a new graphics benchmark for web browsers.

We’ve seen the web grow in amazing ways, making it a rich platform capable of running complex web apps, rendering beautiful web pages, and providing user experiences that are fast, responsive, and visibly smooth. With the development and wide adoption of web standards like CSS animations, SVG, and HTML5 canvas, it’s easier than ever for a web author to create an engaging and sophisticated experience. Since these technologies rely on the performance of the browser’s graphics system, we created this benchmark to put it to the test.

We’d like to talk about how the benchmark works, how it has helped us improve the performance of WebKit, and what’s in store for the future.

Limitations of Existing Graphics Benchmarks

We needed a way to monitor and measure WebKit rendering performance, and looked for a graphics benchmark to guide our work. Most graphics benchmarks measured performance using frame rate while animating a fixed scene, but we found several drawbacks in their methodology.

First, some test harnesses used setTimeout() to drive the test and calculate frame rate, but that could fire at more than 60 frames per second (fps), causing the test to try to render more frames than were visible to the user. Since browsers and operating systems often have mechanisms to avoid generating frames that will never be seen by the user, such tests ran up against these throttling mechanisms. In reality, they only tested the optimizations for avoiding work when a frame was dropped, rather than the capability of the full graphics stack.

Second, most benchmarks we found were not written to accommodate a wide variety of devices. They failed to scale their tests to accommodate hardware with different performance characteristics, or to leave headroom for future hardware and software improvements.

Finally, we found that benchmarks often tested too many things at once. This made it difficult to interpret their final scores. It also hindered iterative work to enhance WebKit performance.

The Design of MotionMark

We wanted to avoid these problems in MotionMark. So we designed it using the following principles:

Peak performance. Instead of animating a fixed scene and measuring the browser’s frame rate, MotionMark runs a series of tests and measures how complex the scene in each test can become before falling below a threshold frame rate, which we chose to be 60 fps. Conveniently, it reports the complexity as the test’s score. And by using requestAnimationFrame() instead of setTimeout(), MotionMark avoids drawing at frame rates over 60 fps.
Test simplicity. Rather than animating a complicated scene that utilized the full range of graphics primitives, MotionMark tests draw multiple rendering elements, each of which uses the same small set of graphics primitives. An element could be an SVG node, an HTML element with CSS style, or a series of canvas operations. Slight variations among the elements avoid trivial caching optimizations by the browser. Although fairly simple, the chosen effects aim to reflect techniques commonly used on the web. Tests are visually rich, and are designed to stress the graphics system rather than JavaScript.
Quick to run. We wanted the benchmark to be convenient and quick to run while maintaining accuracy. MotionMark runs each test within the same period of time, and calculates a score from a relatively small sample of animation frames.
Device-agnostic. We wanted MotionMark to run on a wide variety of devices. It adjusts the size of the drawing area, called the stage, based on the device’s screen size.

Mechanics

MotionMark’s test harness contains three components:

The animation loop
The stage
A controller that adjusts the difficulty of the test

The animation loop uses requestAnimationFrame() to animate the scene. Measurement of the frame rate is done by taking the difference in frame timestamps using performance.now().

For each frame in the animation loop, the harness lets the test animate a scene with a specified number of rendering elements. That number is called the complexity of the scene. Each element represents roughly the same amount of work, but may vary slightly in size, shape, or color. For example, the “Suits” test renders SVG rects with a gradient fill and a clip, but each rect’s gradient is different, its clip is one of four shapes, and its size varies within a narrow range.

The stage contains the animating scene, and its size depends on the window’s dimensions. The harness classifies the dimensions into one of three sizes:

Small: 568 x 320, targeting mobile phones
Medium: 900 x 600, targeting tablets and laptops
Large: 1600 x 800, targeting desktops

The controller has two responsibilities. First, it monitors the frame rate and adjusts the scene complexity by adding or removing elements based on this data. Second, it reports the score to the benchmark when the test concludes.

MotionMark uses this harness for each test in the suite, and takes the geometric mean of the tests’ scores to report a single score for the run.

The Development of MotionMark

The architectural modularity of the benchmark made it possible for us to do rapid iteration during its development. For example, we could iterate over how we wanted the controller to adjust the complexity of the tests it was running.

Our initial attempts at writing a controller tried to arrive at the exact threshold, or change point, past which the system could not maintain 60 fps. For example, we tried having the controller perform a binary search for the right change point. The measurement noise inherent in testing graphics performance at the browser level required the controller to run for a long time, which did not meet one of our requirements for the benchmark. In another example, we programmed a feedback loop using a technique found in industrial control systems, but we found the results unstable on browsers that behaved differently when put under stress (for example dropping from 60 fps directly down to 30 fps).

So we changed our focus from writing a controller that found the change point at the test’s conclusion, to writing one that sampled a narrow range which was likely to contain the change point. From this we were able to get repeatable results within a relatively short period of time and on a variety of browser behaviors.

The controller used in MotionMark animates the scene in two stages. First, it finds an upper bound by exponentially increasing the scene’s complexity until it drops significantly below 60 fps. Second, it goes through a series of iterations, repeatedly starting at a high complexity and ending at a low complexity. Each iteration, called a ramp, crosses the change point, where the scene animates slower than 60 fps at the higher bound, and animates at 60 fps at the lower bound. With each ramp the controller tries to converge the bounds so that the test runs across the most relevant complexity range.

With the collected sample data the controller calculates a piecewise regression using least squares. This regression makes two assumptions about how increased complexity affects the browser. First, it assumes the browser animates at 60 fps up to the change point. Second, it assumes the frame rate either declines linearly or jumps to a lower rate when complexity increases past the change point. The test’s score is the change point. The score’s confidence interval is calculated using a method called bootstrapping.

MotionMark’s modular architecture made writing new tests fast and easy. We could also replicate a test visually but use different technologies including DOM, SVG, and canvas by substituting the stage.

Creating a new test required implementing the rendering element and the stage. The stage required overriding three methods of the Stage class:

animate() updates the animation and renders one frame. This is called within the requestAnimationFrame() loop.
tune() is called by the controller when it decides to update the complexity of the animation. The stage is told how many elements to add or remove from the scene.
complexity() simply returns the number of rendering elements being drawn in the stage.

Because some graphics subsystems try to reduce its refresh rate when it detects a static scene, tests had to be written such that the scenes changed on every frame. Moreover, the amount of work tied to each rendering element had to be small enough such that all systems could handle animating at least one of them at 60 fps.

What MotionMark’s Tests Cover

MotionMark’s test suite covers a wide wariety of graphics techniques available to web authors:

Multiply: CSS border radius, transforms, opacity
Arcs and Fills: Canvas path fills and arcs
Leaves: CSS-transformed <img> elements
Paths: Canvas line, quadratic, and Bezier paths
Lines: Canvas line segments
Focus: CSS blur filter, opacity
Images: Canvas getImageData() and putImageData()
Design: HTML text rendering
Suits: SVG clip paths, gradients and transforms

We hope to expand and update this suite with more tests as the benchmark matures and graphics performance improves.

Optimizations in WebKit

MotionMark enabled us to do a lot more than just monitor WebKit’s performance; it became an important tool for development. Because each MotionMark test focused on a few graphics primitives, we could easily identify rendering bottlenecks, and analyze the tradeoffs of a given code change. In addition we could ensure that changes to the engine did not introduce new performance regressions.

For example, we discovered that WebKit was spending time just saving and restoring the state of the graphics context in some code paths. These operations are expensive, and they were happening in critical code paths where only a couple properties like the transform were being changed. We replaced the operations with setting and restoring those properties explicitly.

On iOS, our traces on the benchmark showed a subtle timing issue with requestAnimationFrame(). CADisplayLink is used to synchronize drawing to the display’s refresh rate. When its timer fired, the current frame was drawn, and the requestAnimationFrame() handler was invoked for the next frame if drawing completed. If drawing did not finish in time when the timer fired for the next frame, the timer was not immediately reset when drawing finally did finish, which caused a delay of one frame and effectively cut the animation speed in half.

These are just two examples of issues we were able to diagnose and fix by analyzing the traces we gathered while running MotionMark. As a result, we were able to improve our MotionMark scores:

Conclusion

We’re excited to be introducing this new benchmark, and using it as a tool to improve WebKit’s performance. We hope the broader web community will join us. To run it, visit http://browserbench.org/MotionMark. We welcome you to file bugs against the benchmark using WebKit’s bug management system under the Tools/Tests component. For any comments or questions, feel free to contact the WebKit team on Twitter at @WebKit or Jonathan Davis, our Web Technologies Evangelist, at @jonathandavis.