The in game performance of OpenRW was worse than it should be. I investigated why and made great performance improvements on an important platform: laptop GPUs. This post covers the investgation and the changes made.

Testing Performance

OpenRW has a “test mode” that spawns several vehicles and pick-ups in a central location. Performance in this spot isn’t great, so I started there. I took samples from two angles, looking at the horizon and looking at the ground. There was quite a difference in performance between the two situations.

View Frame time Draws
Ground 7.3 ms 191
Horizon 17.7 ms 1035

CPU Profiling with Perf

To try and understand what is happening I began profiling with perf. I ran perf in the following way, which spawns sleep but records samples from rwgame (the process name for OpenRW), an easy way to record for a fixed length of time.

$ perf record -p $(pidof rwgame) sleep 15

Once the process exits, perf record writes out a file containing the samples it took. The data can be viewed in the terminal by running perf report. I recorded a few sessions looking at the horizon, and the report was roughly the same each time.

Perf Report Output

The top two most sampled methods are within the Intel GPU driver, followed by bullet (I’m not sure what causes the odd symbol names like 0x0000000000007e45). It is clear that a lot of time is spent in the driver, more than I would expect for such a simple scene.

Since the CPU profile shows a lot of time in the driver, I suspected that the OpenGL usage was far from optimal. To see what was happening between OpenRW and the driver, I needed to take a look at the GL calls. There’s some good tools for this these days: RenderDoc and apitrace being the most portable.

OpenGL debugging with apitrace

OpenRW doesn’t do anything complicated so using apitrace to inspect the raw OpenGL calls is easy enough. I captured a trace of the Horizon scene like so:

$ apitrace trace -o test_mode.trace ./rwgame/rwgame -t

The trace can be viewed with the GUI trace viewer QApitrace, or inspected with command line tools.

QApitrace showing repeated calls to glEnable(GL_BLEND)

This trace shows a couple of issues:

  • Enable(GL_BLEND) is called repeatedly to no effect: a bug.
  • BufferData is called for each draw to upload data: expensive.
  • Each draw requires a new VAO be bound: expensive

State Caching

Calling Enable(GL_BLEND) repeatedly isn’t going to do us any favours, and could be obscuring some bugs elsewhere. This is the first thing I decided to fix as it shouldn’t have been happening in the first place. The rendering system is intended to cache the set state and only change it if we have to, but the method responsible, OpenGLRenderer::setBlend(bool) didn’t even check if we were trying to turn it on, it just called glEnable(GL_BLEND) no matter what. I corrected that and immediatley a few things broke:

Transparent elements not rendering correctly

The issue was just down to the map and particle system not enabling the blending state, even though they relied on it. This was overlooked previously because the bug in setBlend left the state enabled. After updating those call sites to set the correct state, it was fixed.

Buffer Updates

The next issue on the list is the per-draw data updates. Calling BufferData for each draw is expensive. I consulted the “Approaching Zero Driver Overhead” presentation for guidance, and decided to implement a simple solution:

  1. Determine the largest UBO size available from the driver.
  2. Decide how many draw calls worth of data can fit inside a single UBO.
  3. Then, Per-Draw:
    1. If we have run out of space in our UBO, reset the counter and get new storage with BufferData.
    2. use MapBuffer with the offset for our current draw data, copy the data into the buffer and unmap.
    3. Bind the UBO at the offset for the current draw, and Draw.

This isn’t as extreme as using persistent mapping like the AZDO presentation suggests, but should at least avoid some of the pressure of re-allocating storage for each draw that the driver is still trying to use. It also has the benefit of not requiring OpenGL 4.0, unlike persistent mapping.

The result is a significant increase in performance, slicing several milliseconds off most frames. Revisiting the original test shows what impact the buffer update strategy has, a pretty reasonable 30% improvement.

View Frame time Draws
Ground 6.7 ms 250
Horizon 12.2 ms 1063

What’s Next

While 30% is a modest improvement in performance, there is more that can be gained by applying some further changes to how rendering works:

  • Allocating one buffer for multiple vertex and index data, to reduce buffer changes.
  • Copying draw data to buffers in larger batches.

But these can wait. Performance has improved enough already to make OpenRW playable so it is more important to fix gameplay issues and make sure it’s actually worth playing.