Context

The context behind this blog post is that during the creation process of my Debugging Gizmos (part 2) blog, I got the chance to run my engine on a friends PC (thanks Ben :D). This opportunity raised a fair few issues with the engine’s launch process - the issues being that the code was making assumptions that it was not safe to make. But what it also highlighted is that the current performance of the engine leaves a lot to be desired. The performance is not overly bad, but for what is being rendered, it should be a lot better.

This then brings us to the point at which I am implementing the code for my Frustum Culling blog, which is one of the more obvious performance improvements that needed to be made. But I was not happy with just leaving it at that, so the idea for this blog was born.

The flow of this blog is going to be to first profile the current state of the engine, identify potential performance issues, then make changes, and then re-profile to see if the changes have made any improvements.

Profile Setup

Before actually profiling the code and gathering data, a test setup needs to be established, so that the tests are roughly the same every time. For this aim, I created the following scene:

Which contains a debug camera as well as three game cameras viewing a cluster of cubes, and a couple of high mesh count drones. The goal of this test scene was to create a scene that was currently intensive, but which realistically shouldn’t be difficult to run as each of the three game cameras are rendering at 1080p, and the editor camera is rendering at 1440p. And there are very few triangles in view.

Data Gathering

How the data is going to be gathered is through using NVIDIA NSight Graphics to launch the project, and then use its frame debugger to capture a frame’s worth of data. Using this frame capture, the timings and render commands for the frame can be seen, which allows for debugging of which sections are taking the most amount of time, and if there are an excess of certain commands being run.

First Profile

For this first profile, I am taking a capture on my PC. And, after the first improvement, I am going to be gathering data using a friends PC as well. (Ideally this would have been done right from the start, but the launch process had some issues with depth texture formats that were taking a while to fix). After all changes have been made I am going to re-capture on my friend’s PC to ensure that the performance improvements are not just local to my system. For all captures in-between the first and last, the only PC being profiled will be mine.

Here are the PC specs and timings gathered for the initial profile:

PC-1:

Specs

I7 6700k (4GHz, 4 cores)
3070 ti (8GB VRAM)
32GB RAM (3200 Mhz DDR4)

Data

Total frame render time: 5.965ms, split into:

Area	Benchmark Time (ms)
Opaque models render	2.1
AO generation	3.4
Deferred buffers collation	0.11
OIT Skip	0.18
Post Processing	0.11
ImGui	0.09
Overall	5.965

Areas to Investigate

After looking through the events window in NVIDIA Nsight, there were some glaringly obvious issues. These were:

Many many MANY calls to VkMapMemory and VkUnmapMemory
Creation and deletion of image views for every model rendered
Lots of pipeline barriers in use

It is worth stating here that none of these issues were created intentionally, in fact, the opposite is true. Whilst writing the rendering flow, steps were taken, however unsuccessfully, to reduce all of these problem areas pre-emptively.

Improvements

Memory Mapping

The first thing on the agenda for improving performance is to remove the approximately 2800 calls to VkMapMemory or VkUnmapMemory every frame.

This looked to be by far the most performance draining element of the frame creation, taking around 1.5ms on the CPU to just simply run the commands. This is NOT good. Ideally there would be 0 calls mapping memory, as it can just be mapped once and unmapped before deletion.

After investigating the code, it turned out that the flow the buffers were going through for a data update was as follows:

Map buffer section that is to be updated
Copy across new data into that mapped section
Unmap buffer

This flow results in the 2800 calls to VkMapMemory or VkUnmapMemory for the test frame. It also means that the memory being used by buffers is all using the flag of ‘VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT’, which essentially means that it is not the fastest VRAM available on the GPU and would run slower than a better selection.

What this flow should be, and after applying the changes, has been swapped to:

Copy across new data into the pre-mapped staging buffer
Use ‘vkCmdCopyBuffer’ to copy across the staging buffer data into another buffer which uses the VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT flag

This sounds like a fairly straightforwards change, but in reality it was very finicky to get working correctly and required a bunch of changes to how models are loaded, and how the scene performs frustum culling - not to mention re-writing most of my buffer handling code :(

Re-Profiling

After making the changes outlined above, there was another profiling session using the same test scene. The aim of this test was to ensure that the visual result was the same, and that the performance has improved from before. And, I am very happy to report, the 2800 calls to a memory mapping function are no longer there. Reducing the number of events registered significantly.

Here is the performance data for before and after the changes.

Area	Before Change (ms)	After Change (ms)	Time Saved (ms)
Opaque models render	2.1	0.52	1.58
AO generation	3.4	2.91	0.49
Deferred buffers collation	0.11	0.10	0.01
OIT Skip	0.18	0.17	0.01
Post Processing	0.11	0.10	0.01
ImGui	0.09	0.06	0.03
Overall	5.965	3.48	2.485

💡

It is worth noting here that the summed individual times do not match the overall time taken. This is likely due to there being less syncing required between the CPU and GPU, reducing bottlenecks generally throughout the frame being rendered.

The largest area of improvement came in the rendering of the opaque models, with it going down by a whopping 1.58ms. Which seeing as it was only 2.1ms to begin with, that change is massive!

The general improvements in other areas across the board likely came from a combination of the cameras in the test scene likely not being in exactly the same position and rotation, as well as the buffers they are pulling their data from in memory being a faster, more optimal, type. Speeding everything up overall. And, with there being no memory map calls, there is a lot less syncing of the CPU and GPU, which also speeds everything up overall.

So far these changes look very impressive!

Here are the PC specs for the second PC being used in the testing, as well as the timings for this frame (taken after the memory improvements).

PC-2 Specs

Ryzen 9 7900X (5.6Ghz 12 cores)
RX 7900xtx (24GB VRAM)
32 GB (6000 Mhz DDR5)

PC-2 Data:

Area	Benchmark Time (ms)
Opaque models render	0.10
AO generation	0.25
Deferred buffers collation	0.01
OIT Skip	0.01
Post Processing	0.04
ImGui	0.01
Overall	0.779

Image Views

Something else that was noticed during analyzing the frame data is that for every opaque render there were new image views being created and destroyed. At first seeing this was baffling, as that is not how the code works. It goes to great lengths to re-use data such as image views. However, this didn’t seem to be reflected in the output.

After looking into this the issue seemed that it was entirely not my fault!

The issue was with how AMD’s Fidelity FX handles image views. What it was doing was creating a new image view for every image passed in, for every frame, and then destroying them again. This is probably an issue with the version of FFX that I am using (it’s a couple of updates out of date), but If I was to update to a newer version then it would need to be re-hooked into the engine, which is not a fun process. And all of the custom modifications I have made would need to be re-applied. So instead I went into the ambient occlusion calculation code and un-commented out a large chunk of code that was there to fix the exact issue I was having. And…. it didn’t change anything. Guess there was a good reason for it to be commented out. The best solution here is likely to update to the most recent version of AMD FFX, but that is not something I want to do anytime soon.

So this improvement is going to have to wait for a future blog post.

Pipeline Barriers

Pipeline barriers are barriers you can insert into your program to achieve a bunch of different goals. In this context they are being used to transition from one image layout to another.

For example going from:

VK_IMAGE_LAYOUT_GENERAL

to being:

VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL

Why this matters is that you state what layout that an image is going to be in at the start of a render-pass. And, if the image provided is not in the right layout when you start the pass, then you get a validation error.

So just have the textures in the right layout, right?

It’s not quite that simple. Sometimes it is very hard to know what layout an image is currently, or going to be. This is due to certain parts of the render flow being skipped under certain situations, or different branches being followed.

To bypass this issue I had been relying on pipeline barriers to transfer the layout from one to the other. The issue with this is that they are, by definition, a barrier in the render pipeline. Which makes them not the most ideal thing to use very often, as they add a sync point in the frame’s rendering.

To rectify this I had to become more familiar with the existing render flow from an image layout’s perspective, to ensure that the images are in the right layout at the right time, and make sure that render-passes were using a logical start/final layout combo to hook into the rest of the frame.

There is no real examples of the code changes I can show here as it was more of a general change to remove unneeded calls to add pipeline barriers.

Re-Profiling

After making the changes the amount of barriers in a frame went from 189 down to 61 (as there is still a requirement to use them sometimes). But in terms of time saved, it was very minimal.

Area	Before change (ms)	After change (ms)	Time saves (ms)
Opaque models render	0.52	0.48	0.04
AO generation	2.91	2.85	0.06
Deferred buffers collation	0.10	0.08	0.02
OIT Skip	0.17	0.15	0.02
Post Processing	0.10	0.09	0.01
ImGui	0.06	0.06	0.00
Overall	3.48	3.34	0.12

Lowering Quality of Features

Another potential avenue for improving frame times is to lower the quality of different elements within the frame. The only real options for the engine at this point would to lower the resolution of the camera’s images (which I am not going to do for the time being), and lowering the ambient occlusion (AO) generation quality.

Currently the AO quality setting is FFX_CACAO_QUALITY_HIGHEST, but I felt like seeing how much of an impact lowering the setting would make. Here are the timings for the AO buffer on each setting:

Setting	Time (ms)	Time Saved Over Default (ms)
FFX_CACAO_QUALITY_HIGHEST	2.85	0.00
FFX_CACAO_QUALITY_HIGH	2.27	0.58
FFX_CACAO_QUALITY_MEDIUM	1.63	1.22
FFX_CACAO_QUALITY_LOW	1.17	1.68
FFX_CACAO_QUALITY_LOWEST	0.77	2.08

Additionally there is a separate flag called ‘useDownsampledSsao‘ which is recommend to be turned on to improve performance. It is currently turned off, and turning it on actually resulted in worse performance. Why that is I can only guess. My best idea is that I am running it on a NVIDIA GPU, not an AMD one. So the driver optimizations and hardware support that AMD cards would provide simply are not there on my GPU.

For now I am going to leave the AO generation settings on FFX_CACAO_QUALITY_HIGHEST with the other flag being turned off, and look into seeing if NVIDIA provide an SDK for AO. So that at some point in the future I can get the hardware support advantage for both AMD and NVIDIA.

Multi-threading

One final modification that was really tempting me to add, was to make the render thread run using a thread pool instead of generating the command buffers sequentially. What this would mean is that instead of the engine running this list of events one after the other:

enum class RenderedFrameSegmentType : unsigned int
{
    IMAGE_AVALIABLE = 0,
    OPAQUE_UI_RENDER = 1,

    // If opaque meshes on screen //
        OPAQUE_MODELS_RENDER = 2,
        AO_GENERATION = 3,
        DEFERRED_BUFFERS_COLLATED = 4,

    TEXT = 5,

    // No opaque meshes on screen //
        OPAQUE_MODELS_ALTERNATE_FLOW = 6,

    // If transparent meshes on screen //
        COPY_ACROSS_DEPTH_BUFFER = 7,
        TRANSPARENT_MODELS_RENDER = 8,
        OIT_COLLATE = 9,

    // No transparent meshes on screen //
        TRANSPARENT_MODELS_ALTERNATE_FLOW = 10,

    SCREEN_SPACE_REFLECTIONS = 11,

    POST_PROCESSING = 12,

#ifdef _ENGINE_BUILD
    DEBUG_GIZMOS = 13,
    IMGUI_RENDER = 14,
#endif
};

It will run the functions which are associated with each segment in parallel. Meaning that there is a lot less waiting around for the CPU to finish rendering a frame, allowing for a lower frame time overall. But since thinking up this idea I have come up with a way more ambitious one that will be the target of the next blog post!

Other Potential Changes

As a side-note to this section, it is also recommended to keep the number of command buffer submits to a minimum. And, at the moment, one frame consists of around 8 submits. Which can be reduced a bit by combining multiple frame segments into single buffers. For example, if OPAQUE_MODELS_RENDER runs then AO_GENERATION and DEFERRED_BUFFERS_COLLATED will also always run. So there is no reason for them to be in separate command buffers.

As another side-note, it is not actually required to re-record every command buffer every frame. If nothing has changed from the previous frame then the same command buffer can just be re-submitted with no CPU processing. This has not been added in this change, but will be looked into at some point in the future.

💡

As a side point, I have also done a bunch of project settings changes during this process. Namely going up to the highest available compiler warning level (Level 4), and turning on Unity builds (not the game engine) for some of my solution’s projects.

Final Profile

As mentioned in the ‘first profile’ section, I have re-captured the timings on my PC as well as my friend’s PC.

Here is the data after all of the changes, along with the analysed changes.

PC-1:

Data

Area	Benchmark Time (ms)	Final Time (ms)	Time Saved (ms)
Opaque models render	2.1	0.47	1.63
AO generation	3.4	2.03	1.37
Deferred buffers collation	0.11	0.08	0.03
OIT Skip	0.18	0.15	0.03
Post Processing	0.11	0.09	0.02
ImGui	0.09	0.06	0.03
Overall	5.965	2.48	3.485

Overall the time taken to render each frame is now 41.5% of what it was before! Or in other words, 58.5% of the render time has been saved.

It is a real shame that the AMD ambient occlusion generation is taking up so much of the render frame. It is accounting for 81.8% of the total render time, and if it was simply assisted by drivers then it would be much faster.

PC-2

Data

Area	Benchmark Time (ms)	Final Time (ms)	Time Saved (ms)
Opaque models render	0.10	0.04	0.06
AO generation	0.25	0.17	0.08
Deferred buffers collation	0.01	0.01	0
OIT Skip	0.01	0.01	0
Post Processing	0.04	0.03	0.01
ImGui	0.01	0.01	0
Overall	0.779	0.395	0.384

For this PC setup the overall time for each frame is now 50.7% of what it was originally, and the original profile was after the memory mapping improvements. On this PC the AMD ambient occlusion generation took 43% of the overall render time, which is much better than the other PC, but still not ideal.

Conclusion

To conclude: PROFILE YOUR CODE!!

When creating the original versions of what has now been optimized in this blog post, I fully thought that what I was writing was super efficient, and that this would be reflected in the frame-times. I was wrong. But only through profiling and looking at exactly what the issues were could I fix them and get the performance back up to a point at which I was happy.

The tools I recommend for profiling code are:

For the performance of the engine now, given that it now only takes 47% of the time to render on average across both PCs tested, I am very happy with the improvements in performance overall.

Thanks for reading :D

Performance Testing a Game Engine

Context

Profile Setup

Data Gathering

First Profile

PC-1:

Specs

Data

Areas to Investigate

Improvements

Memory Mapping

Re-Profiling

PC-2 Specs

PC-2 Data:

Image Views

Pipeline Barriers

Re-Profiling

Lowering Quality of Features

Multi-threading

Other Potential Changes

Final Profile

PC-1:

Data

PC-2

Data

Conclusion

Subscribe to my newsletter

Platwo

Platwo