Friday, February 17, 2017

Stingray Renderer Walkthrough #5: RenderDevice

Stingray Renderer Walkthrough #5: RenderDevice

Overview

The RenderDevice is essentially our abstraction layer for platform specific rendering APIs. It is implemented as an abstract base class that various rendering back-ends (D3D11, D3D12, OGL, Metal, GNM, etc.) implement.

The RenderDevice has a bunch of helper functions for initializing/shutting down the graphics APIs, creating/destroying swap chains, etc. All of which are fairly straightforward so I won’t cover them in this post, instead I will put my focus on the two dispatch functions consuming RenderResourceContexts and RenderContexts:


class RenderDevice {
public: 
    virtual void dispatch(uint32_t n_contexts, RenderResourceContext **rrc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;

    virtual void dispatch(uint32_t n_contexts, RenderContext **rc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};

Resource Management

As covered in the post about RenderResourceContexts, they provide a free-threaded interface for allocating and deallocating GPU resources. However, it is not until the user has called RenderDevice::dispatch() handing over the RenderResourceContexts as their representation gets created on the RenderDevice side.

All implementations of a RenderDevice have some form of resource management that deals with creating, updating and destroying of the graphics API specific representations of resources. Typically we track the state of all various types of resources in a single struct, here’s a stripped down example from the DX12 RenderDevice implementation called D3D12ResourceContext:


struct D3D12VertexBuffer
{
    D3D12_VERTEX_BUFFER_VIEW view;
    uint32_t allocation_index;
    int32_t size;
};

struct D3D12IndexBuffer
{
    D3D12_INDEX_BUFFER_VIEW view;
    uint32_t allocation_index;
    int32_t size;
};

struct D3D12ResourceContext 
{
    Array<D3D12VertexBuffer> vertex_buffers;
    Array<uint32_t> unused_vertex_buffers;

    Array<D3D12IndexBuffer> index_buffers;
    Array<uint32_t> unused_index_buffers;

    // .. lots of other resources

    Array<uint32_t> resource_lut;
};

As you might remember, the linking between the engine representation and the RenderDevice representation is done using the RenderResource::render_resource_handle. It encodes both the type of the resource as well as a handle. The resource_lut is an indirection to go from the engine handle to a local index for a specific type (e.g vertex_buffers or index_buffers in the sample above). We also track freed indices for each type (e.g. unused_vertex_buffers) to simplify recycling of slots.

The implementation of the dispatch function is fairly straight forward. We simply iterate over all the RenderResourceContexts and for each context iterate over its commands and either allocate or deallocate resources in the D3D12ResourceContext. It is important to note that this is a synchronous operation, nothing else is peeking or poking on the D3D12ResourceContext when the dispatch of RenderResourceContexts is happening, which makes our life a lot easier.

Unfortunately that isn’t the case when we dispatch RenderContexts as in that case we want to go wide (i.e. forking the workload and process it using multiple worker threads) when translating the commands into API calls. While we don’t allow allocating and deallocating new resources from the RenderContexts we do allow updating them which mutates the state of the RenderDevice representations (e.g. a D3D12VertexBuffer).

At the moment our solution for this isn’t very nice, basically we don’t allow asynchronous updates for anything else than DYNAMIC buffers. UPDATABLE buffers are always updated serially before we kick the worker threads no matter what their sort_key is. All worker threads access resources through their own copy of something we call a ResourceAccessor, it is responsible for tracking the worker threads state of dynamic buffers (among other things). In the future I think we probably should generalize this and treat UPDATABLE buffers in a similar way.

(Note: this limitation doesn’t mean you can’t update an UPDATABLE buffer more than once per frame, it simply means you cannot update it more than once per dispatch).

Shaders

Resources in the D3D12ResourceContext are typically buffers. One exception that stands out is the RenderDevice representation of a “shader”. A “shader” on the RenderDevice side maps to a ShaderTemplate::Context on the engine side, or what I guess we could call a multi-pass shader. Here’s some pseudo code:


struct ShaderPass
{
    struct ShaderProgram
    {
        Array<uint8_t> bytecode;
        struct ConstantBufferBindInfo;
        struct ResourceBindInfo;
        struct SamplerBindInfo;
    };
    ShaderProgram vertex_shader;
    ShaderProgram domain_shader;
    ShaderProgram hull_shader;
    ShaderProgram geometry_shader;
    ShaderProgram pixel_shader;
    ShaderProgram compute_shader;

    struct RenderStates;
};

struct Shader
{
    Vector<ShaderPass> passes;
    enum SortMode { IMMADIATE, DEFERRED };
    uint32_t sort_mode;
};

The pseudo code above is essentially the RenderDevice representation of a shader that we serialize to disk during data compilation. From that we can create all the necessary graphics API specific objects expressing an executable shader together with its various state blocks (Rasterizer, Depth Stencil, Blend, etc.).

As discussed in the last post the sort_key encodes the shader pass index. Using Shader::sort_mode, we know which bit range to extract from the sort_key as pass index, which we then use to look up the ShaderPass from Shader::passes. A ShaderPass contains one ShaderProgram per active shader stage and each ShaderProgram contains the byte code for the shader to compile as well as “bind info” for various resources that the shader wants as input.

We will look at this in a bit more detail in the post about “Shaders & Materials”, for now I just wanted to familiarize you with the concept.

Render Context translation

Let’s move on and look at the dispatch for translating RenderContexts into graphics API calls:

class RenderDevice {
public: 
    virtual void dispatch(uint32_t n_contexts, RenderContext **rc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};

The first thing all RenderDevice implementation do when receiving a bunch of RenderContexts is to merge and sort their Commands. All implementations share the same code for doing this:

void prepare_command_list(RenderContext::Commands &output, unsigned n_contexts, RenderContext **contexts);

This function basically just takes the RenderContext::Commands from all RenderContexts and merges them into a new array, runs a stable radix sort, and returns the sorted commands in output. To avoid memory allocations the RenderDevice implementation owns the memory of the output buffer.

Now we have all the commands nicely sorted based on their sort_key. Next step is to do the actual translation of the data referenced by the commands into graphics API calls. I will explain this process with the assumption that we are running on a graphics API that allows us to build graphics API command lists in parallel (e.g. DX12, GNM, Vulkan, Metal), as that feels most relevant in 2017.

Before we start figuring out our per thread workloads for going wide, we have one more thing to do; “instance merging”.

Instance Merging

I’ve mentioned the idea behind instance merging before [1,2], basically we want to try to reduce the number of RenderJobPackages (i.e. draw calls) by identifying packages that are similar enough to be merged. In Stingray “similar enough” basically means that they must have identical inputs to the input assembler as well as identical resources bound to all shader stages, the only thing that is allowed to differ are constant buffer variables. (Note: by todays standards this can be considered a bit old school, new graphics APIs and hardware allows to tackle this problem more aggressively using “bindless” concepts. )

The way it works is by filtering out ranges of RenderContexts::Commands where the “instance bit” of the sort_key is set and all bits above the instance bit are identical. Then for each of those ranges we fork and go wide to analyze the actual RenderJobPackage data to see if the instance_hash and the shader are the same, and if so we know its safe to merge them.

The actual merge is done by extracting the instance specific constants (these are tagged by the shader author) from the constant buffers and propagating them into a dynamic RawBuffer that gets bound as input to the vertex shader.

Depending on how the scene is constructed, instance merging can significantly reduce the number of draw calls needed to render the final scene. The instance merger in itself is not graphics API specific and is isolated in its own system, it just happens to be the responsibility of the RenderDevice to call it. The interface looks like this:

namespace instance_merger {

struct ProcessMergedCommandsResult
{
    uint32_t n_instances;
    uint32_t instanced_batches;
    uint32_t instance_buffer_size;
};

ProcessMergedCommandsResult process_merged_commands(Merger &instance_merger, 
    RenderContext::Commands &merged_commands);

}

Pass in a reference to the sorted RenderContext::Commands in merged_commands and after the instance merger is done running you hopefully have fewer commands in the array. :)

You could argue that merging, sorting and instance merging should all happen before we enter the world of the RenderDevice. I wouldn’t argue against that.

Prepare workloads

Last step before we can start translating our commands into state / draw / dispatch calls is to split the workload into reasonable chunks and prepare the execution contexts for our worker threads.

Typically we just divide the number of RenderContext::Commands we have to process with the number of worker threads we have available. We don’t care about the type of different commands we will be processing and trying to load balance differently. The reasoning behind this is that we anticipate that draw calls will always represent the bulk of the commands and the rest of the commands can be considered as unavoidable “noise”. We do, however, make sure that we don’t do less than x-number of commands per worker threads, where x can differ a bit depending on platform but is usually ~128.

For each execution context we create a ResourceAccessors (described above) as well as make sure we have the correct state setup in terms of bound render targets and similar. To do this we are stuck with having to do a synchronous serial sweep over all the commands to find bigger state changing commands (such as RenderContext::set_render_target).

This is where the Command::command_flags bit-flag comes into play, instead of having to jump around in memory to figure out what type of command the Command::head points to, we put some hinting about the type in the Command::command_flags, like for example if it is a “state command”. This way the serial sweep doesn’t become very costly even when dealing with large number of commands. During this sweep we also deal with updating of UPDATABLE resources, and on newer graphics APIs we track fences (discussed in the post about Render Contexts).

The last thing we do is to set up the execution contexts with create graphics API specific representations of command lists (e.g. ID3D12GraphicsCommandList in DX12),

Translation

When getting to this point doing the actual translation is fairly straight forward. Within each worker thread we simply loop over its dedicated range of commands, fetch its data from Command::head and generate any number of API specific commands necessary based on the type of command.

For a RenderJobPackage representing a draw call it involves:

  • Look up the correct shader pass and, unless already bound, bind all active shader stages
  • Look up the state blocks (Rasterizer, Depth stencil, Blending, etc.) from the shader and bind them unless already bound
  • Look up and bind the resources for each shader stage using the RenderResource::render_resource_handle translated through the D3D12ResourceAccessor
  • Setup the input assembler by looping over the RenderResource::render_resource_handles pointed to by the RenderJobPackage::resource_offset and translated through the D3D12ResourceAccessor
  • Bind and potentially update constant buffers
  • Issue the draw call

The execution contexts also holds most-recently-used caches to avoid unnecessary binds of resources/shaders/states etc.

Note: In DX12 we also track where resource barriers are needed during this stage. After all worker threads are done we might also end up having to inject further resource barriers between the command lists generated by the worker threads. We have ideas on how to improve on this by doing at least parts of this tracking when building the RenderContexts but haven’t gotten around looking into it yet.

Execute

When the translation is done we pass the resulting command lists to the correct queues for execution.

Note: In DX12 this is a bit more complicated as we have to interleave signaling / waiting on fences between command list execution (ExecuteCommandList).

Next up

I’ve deliberately not dived into too much details in this post to make it a bit easier to digest. I think I’ve manage to cover the overall design of a RenderDevice though, enough to make it easier for people diving into the code for the first time.

With this post we’ve reached half-way through this series, we have covered the “low-level” aspects of the Stingray rendering architecture. As of next post we will start looking at more high-level stuff, starting with the RenderInterface which is the main interface for other threads to talk with the renderer.

Tuesday, February 14, 2017

Stingray Renderer Walkthrough #4: Sorting

Stingray Renderer Walkthrough #4: Sorting

Introduction

This post will focus on ordering of the commands in the RenderContexts. I briefly touched on this subject in the last post and if you’ve implemented a rendering engine before you’re probably not new to this problem. Basically we need a way to make sure our RenderJobPackages (draw calls) end up on the screen in the correct order, both from a visual point of view as well as from a performance point of view. Some concrete examples,

  1. Make sure g-buffers and shadow maps are rendered before any lighting happens.
  2. Make sure opaque geometry is rendered front to back to reduce overdraw.
  3. Make sure transparent geometry is rendered back to front for alpha blending to generate correct results.
  4. Make sure the sky dome is rendered after all opaque geometry but before any transparent geometry.
  5. All of the above but also strive to reduce state switches as much as possible.
  6. All of the above but depending on GPU architecture maybe shift some work around to better utilize the hardware.

There are many ways of tackling this problem and it’s not uncommon that engines uses multiple sorting systems and spend quite a lot of frame time getting this right.

Personally I’m a big fan of explicit ordering with a single stable sort. What I mean by explicit ordering is that every command that gets recorded to a RenderContext already has the knowledge of when it will be executed relative to other commands. For us this knowledge is in the form of a 64 bit sort_key, in the case where we get two commands with the exact same sort_key we rely on the sort being stable to not introduce any kind of temporal instabilities in the final output.

The reasons I like this approach are many,

  1. It’s trivial to implement compared to various bucketing schemes and sorting of those buckets.
  2. We only need to visit renderable objects once per view (when calling their render() function), no additional pre-visits for sorting are needed.
  3. The sort is typically fast, and cost is isolated and easy to profile.
  4. Parallel rendering works out of the box, we can just take all the Command arrays of all the RenderContexts and merge them before sorting.

To make this work each command needs to know its absolute sort_key. Let’s breakdown the sort_key we use when working with our data-driven rendering pipe in Stingray. (Note: if the user doesn’t care about playing nicely together with our system for data-driven rendering it is fine to completely ignore the bit allocation patterns described below and roll their own.)

sort_key breakdown

Most significant bit on the left, here are our bit ranges:

MSB [ 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ] LSB
      ^ ^       ^  ^                                   ^^                 ^
      | |       |  |                                   ||                 |- 3 bits - Shader System (Pass Immediate)
      | |       |  |                                   ||- 16 bits - Depth
      | |       |  |                                   |- 1 bit - Instance bit
      | |       |  |- 32 bits - User defined
      | |       |- 3 bits - Shader System (Pass Deferred)
      | - 7 bits - Layer System
      |- 2 bits - Unused

2 bits - Unused

Nothing to see here, moving on… (Not really sure why these 2 bits are unused, I guess they weren’t at some point but for the moment they are always zero) :)

7 bits - Layer System

This 7-bits range is managed by the “Layer system”. The Layer system is responsible for controlling the overall scheduling of a frame and is set up in the render_config file. It’s a central part of the data-driven rendering architecture in Stingray. It allows you to configure what layers to expose to the shader system and in which order these layers should be drawn. We will look closer at the implementation of the layer system in a later post but in the interest of clarifying how it interops with the sort_key here’s a small example:


default = [
  // sort_key = [ 00000000 10000000 00000000 00000000 00000000 00000000 00000000 00000000 ]
  { name="gbuffer" render_targets=["gbuffer0", "gbuffer1", "gbuffer2", "gbuffer3"]
     depth_stencil_target="depth_stencil_buffer" sort="FRONT_BACK" profiling_scope="gbuffer" }

  // sort_key = [ 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ]
  { name="decals" render_targets=["gbuffer0" "gbuffer1"] depth_stencil_target="depth_stencil_buffer"
     profiling_scope="decal" sort="EXPLICIT" }

  // sort_key = [ 00000001 10000000 00000000 00000000 00000000 00000000 00000000 00000000 ]
  { resource_generator="lighting" profiling_scope="lighting" }

  // sort_key = [ 00000010 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ] LSB
  { name="emissive" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer"
    sort="FRONT_BACK" profiling_scope="emissive" }
]

Above we have three layers exposed to the shader system and one kick of a resource_generator called lighting (more about resource_generators in a later post). The layers are rendered in the order they are declared, this is handled by letting each new layer increment the 7 bits range belonging to the Layer System with 1 (as can be seen in the sort_key comments above).

The shader author dictates into which layer(s) it wants to render. When a RenderJobPackage is recorded to the RenderContext (as described in the last post) the correct layer sort_keys are looked up from the layer system and the result is bitwise ORed together with the sort_key value piped as argument to RenderContext::render().

3 bits - Shader System (Pass Deferred)

The next 3 bits are controlled by the Shader System. These three bits encode the shader pass index within a layer. When I say shader in this context I refer to our ShaderTemplate::Context which is basically a wrapper around multiple linked shaders rendering into one or many layers. (Nathan Reed recently blogged about “The Many Meanings of “Shader””, in his analogy our ShaderTemplate is the same as an “Effect”)

Since we can have a multi-pass shader rendering into the same layer we need to encode the pass index into the sort_key, that is what this 3 bit range is used for.

32 bits - User defined

We then have 32 user defined bits, these bits are primarily used by our “Resource Generator” system (I will be covering this system in the post about render_config & data-driven rendering later), but the user is free to use them anyway they like and still maintain compatibility with the data-driven rendering system.

1 bit - Instance bit

This single bit also comes from the Shader System and is set if the shader implements support for “Instance Merging”. I will be covering this in a bit more detail in my next post about the RenderDevice but essentially this bit allows us to scan through all commands and find ranges of commands that potentially can be merged together to fewer draw calls.

16 bits - Depth

One of the arguments piped to RenderContext::render() is an unsigned normalized depth value (0.0-1.0). This value gets quantized into these 16 bits and is what drives the front-to-back vs back-to-front sorting of RenderJobPackages. If the sorting criteria for the layer (see layer example above) is set to back-to-front we simply flip the bits in this range.

3 bits - Shader System (Pass Immediate)

A shader can be configured to run in “Immediate Mode” instead of “Deferred Mode” (default). This forces passes in a multi-pass shader to run immediately after each other and is achieved by moving the pass index bits into the least significant bits of the sort_key. The concept is probably easiest to explain with an artificial example and some pseudo code:

Take a simple scene with a few instances of the same mesh, each mesh recording one RenderJobPackages to one or many RenderContexts and all RenderJobPackages are being rendered with the same multi-pass shader.

In “Deferred Mode” (i.e pass indices encoded in the “Shader System (Pass Deferred)” range) you would get something like this:

foreach (pass in multi-pass-shader)
  foreach (render-job in render-job-packages)
    render (render-job)
  end
end

If shader is configured to run in “Immediate Mode” you would instead get something like this:

foreach (render-job in render-job-packages)
  foreach (pass in multi-pass-shader)
    render (render-job)
  end
end

As you probably can imagine the latter results in more shader / state switches but can sometimes be necessary to guarantee correctly rendered results. A typical example is when using multi-pass shaders that does alpha blending.

Wrap up

The actual sort is implemented using a standard stable radix sort and happens immediately after the user has called RenderDevice::dispatch() handing over n-number of RenderContexts to the RenderDevice for translation into graphics API calls.

Next post will cover this and give an overview of what a typical rendering back-end (RenderDevice) looks like in Stingray. Stay tuned.

Friday, February 10, 2017

Stingray Renderer Walkthrough #3: Render Contexts

Stingray Renderer Walkthrough #3: Render Contexts

Render Contexts Overview

In the last post we covered how to create and destroy various GPU resources. In this post we will go through the system we have for recording a stream of rendering commands/packages that later gets consumed by the render backend (RenderDevice) where they are translated into actual graphics API calls. We call this interface RenderContext and similar to RenderResourceContext we can have multiple RenderContexts in flight at the same time to achieve data parallelism.

Let’s back up and reiterate a bit what was said in the Overview post. Typically in a frame we take the result of the view frustum culling, split it up into a number of chunks, allocate one RenderContext per chunk and then kick one worker thread per chunk. Each worker thread then sequentially iterates over its range of renderable objects and calls their render() function. The render() function takes the chunk’s RenderContext as one of its argument and is responsible for populating it with commands. When all worker threads are done the resulting RenderContexts gets “dispatched” to the RenderDevice.

So essentially the RenderContext is the output data structure for the second stage Render as discussed in the Overview post.

The RenderContext is very similar to the RenderResourceContext in the sense that it’s a fairly simple helper class for populating a command buffer. There is one significant difference though; the RenderContext also has a mechanics for reasoning about the ordering of the commands in the buffer before they get translated into graphics API calls by the RenderDevice.

Ordering & Buffers

We need a way to reorder commands in one or many RenderContexts to make sure triangles end up on the screen in the right order, or more generally speaking; to schedule our GPU work.

There are many ways of dealing with this but my favorite approach is to just associate one or many commands with a 64 bit sort key and when all commands have been recorded simply sort them on this key before translating them into actual graphics API calls. The approach we are using in Stingray is heavily inspired by Christer Ericsson’s blog post “Order your graphics draw calls around!”. I will be covering our sorting system in more details in my next post, for now the only thing important to grasp is that while the RenderContext records commands it does so by populating two buffers. One is a simple array of a POD struct called Command:

struct Command
{
    uint64_t sort_key;
    void *head;
    uint32_t command_flags;
};
  • sort_key - 64 bit sort key used for reordering commands before being consumed by the RenderDevice, more on this later.
  • head - Pointer to the actual data for this command.
  • command_flags - A bit flag encoding some hinting about what kind of command head is actually pointing to. This is simply an optimization to reduce pointer chasing in the RenderDevice, it will be covered in more detail in a later post.

Render Package Stream

The other buffer is what we call a RenderPackageStream and is what holds the actual command data. The RenderPackageStream class is essentially just a few helper functions to put arbitrary length commands into memory. The memory backing system for RenderPackageStreams is somewhat more complex than a simple array though, this is because we need a way to keep its memory footprint under control. For efficiency, we want to recycle the memory instead of reallocating it every frame, but depending on workload we are likely to get some RenderContexts becoming much larger than others. This creates a problem when using simple arrays to store the commands as the workload will shift slightly over time causing all arrays having to grow to fit the worst case scenario, resulting in lots of wasted memory.

To combat this we allocate and return fixed size blocks of memory from a pool. As we know the size of each command before writing them to the buffer we can make sure that a command doesn’t end up spanning multiple blocks; if we detect that we are about to run out of memory in the active block we simply allocate a new block and move on. If we detect that a single command will span multiple blocks we make sure to allocate them sequentially in memory. We return a block to the pool when we are certain that the consumer of the data (in this case the RenderDevice) is done with it. (This memory allocation approach is well described in Christian Gyrling’s excellent GDC 2015 presentation Parallelizing the Naughty Dog Engine Using Fibers)

You might be wondering why we put the sort_key in a separate array instead of putting it directly into the header data of the packages written to the RenderPackageStream, there are a number of reasons for that:

  1. The actual package data can become fairly large even for regular draw calls. Since we want to make the packages self contained we have to put all data needed to translate the command into an graphics API call inside the package. This includes handles to all resources, constant buffer reflections and similar. I don’t know of any way to efficiently sort an array with elements of varying sizes.

  2. Since we allocate the memory in blocks, as described above, we would need to introduce some form of “jump label” and insert that into the buffer to know how and when to jump into the next memory block. This would further complicate the sorting and traversal of the buffers.

  3. It allows us to recycle the actual package data from one draw call to another when rendering multi-pass shaders as we simply can inject multiple Commands pointing to the same package data. (Which shader pass to use when translating the package into graphic API calls can later be extracted from the sort_key.)

  4. We can reduce pointer chasing by encoding hints in the Command about the contents of the package data. This is what we do in command_flags mentioned earlier.

Render Context interface

With the low-level concepts of the RenderContext covered let’s move on and look at how it is used from a users perspective.

If we break down the API there are essentially three different types of commands that populates a RenderContext:

  1. State commands - Commands affecting the state of the rendering pipeline (e.g render target bindings, viewports, scissoring, etc) + some miscellaneous commands.
  2. Rendering commands - Commands used to trigger draw calls and compute work on the GPU.
  3. Resource update commands - Commands for updating GPU resources.

1. State Commands

“State commands” are a series of commands getting executed in sequence for a specific sort_key. The interface for starting/stopping the recording looks like this:

class RenderContext
{
    void begin_state_command(uint64_t sort_key, uint32_t gpu_affinity_mask = GPU_DEFAULT);
    void end_state_command();
};
  • sort_key - the 64 bit sort key.
  • gpu_affinity_mask - I will cover this towards the end of this post but, for now just think of it as a bit mask for addressing one or many GPUs.

Here’s a small example showing what the recording of a few state commands might look like:

rc.begin_state_command(sort_key);
for (uint32_t i=0; i!=MAX_RENDER_TARGETS; ++i)
    rc.set_render_target(i, nullptr);
rc.set_depth_stencil_target(depth_shadow_map);
rc.clear(RenderContext::CLEAR_DEPTH);
rc.set_viewports(1, &viewport);
rc.set_scissor_rects(1, &scissor_rect);
rc.end_state_command();

While state commands primarily are used for doing bigger graphics pipeline state changes (like e.g. changing render targets) they are also used for some miscellaneous things like clearing of bound render targets, pushing/poping timer markers, and some other stuff. There is no obvious reasoning for grouping these things together under the name “state commands”, it’s just something that has happened over time. Keep that in mind as we go through the list of commands below.

Common commands

  • set_render_target(uint32_t slot, RenderTarget *target, const SurfaceInfo& surface_info);

    • slot - Which index of the “Multiple Render Target” (MRT) chain to bind
    • target - What RenderTarget to bind
    • surface_info - SurfaceInfo is a struct describing which surface of the RenderTarget to bind.
    struct SurfaceInfo {
        uint32_t array_index; // 0 in all cases except if binding a texture array
        uint32_t slice;       // 0 for 2D textures, 0-5 for cube maps, 0-n for volume textures
        uint32_t mip_level;   // 0-n depending on wanted mip level
    };
    
  • set_depth_stencil_target(RenderTarget *target, const SurfaceInfo& surface_info); - Same as above but for depth stencil.

  • clear(RenderContext::ClearFlags flags); - Clears currently bound render targets.

    • flags - enum bit flag describing what parts of the bound render targets to clear.
    enum ClearFlags {
        CLEAR_SURFACE   = 0x1,
        CLEAR_DEPTH     = 0x2,
        CLEAR_STENCIL   = 0x4
    };
    
  • set_viewports(uint32_t n_viewports, const Viewport *viewports);

    • n_viewports - Number of viewports to bind.
    • viewports - Pointer to first Viewport to bind. Viewport is a struct describing the dimensions of the viewport:
    struct Viewport {
        float x, y, width, height;
        float min_depth, max_depth;
    };
    

    Note that x, y, width and height are in unsigned normalized [0-1] coordinates to decouple render target resolution from the viewport.

  • set_scissor_rects(uint32_t n_scissor_rects, const ScissorRect *scissor_rects);

    • n_scissor_rects - Number of scissor rectangles to bind
    • scissor_rects - Pointer to the first ScissorRect to bind.
    struct ScissorRect {
        float x, y, width, height;
    };
    

    Note that x, y, width and height are in unsigned normalized [0-1] coordinates to decouple render target resolution from the scissor rectangle.

A bit more exotic commands

  • set_stream_out_target(uint32_t slot, RenderResource *resource, uint32_t offset);
    • slot - Which index of the stream out buffers to bind
    • resource - Which RenderResource to bind to that slot (has to point to a VertexStream)
    • offset - A byte offset describing where to begin writing in the buffer pointed to by resource.
  • set_instance_multiplier(uint32_t multiplier);
    Allows the user to scale the number instances to render for each render() call (described below). This is a convenience function to make it easier to implement things like Instanced Stereo Rendering.

Markers

  • push_marker(const char *name)
    Starts a new marker scope named name. Marker scopes are both used for gathering RenderDevice statistics (number of draw calls, state switches and similar) as well as for creating GPU timing events. The user is free to nestle markers if they want to better group statistics. More on this in a later post.
  • pop_marker(const char *name)
    Stops an existing marker scope named name.

2. Rendering

With most state commands covered let’s move on and look at how to record commands for triggering draw calls and compute work to a RenderContext.

For that we have a single function called render():

class RenderContext
{
    RenderJobPackage *render(const RenderJobPackage* job,
        const ShaderTemplate::Context& shader_context, uint64_t interleave_sort_key = 0,
        uint64_t shader_pass_branch_key = 0, float job_sort_depth = 0.f,
        uint32_t gpu_affinity_mask = GPU_DEFAULT);
};

job

First argument piped to render() is a pointer to a RenderJobPackage, and as you can see the function also returns a pointer to a RenderJobPackage. What is going on here is that the RenderJobPackage piped as argument to render() gets copied to the RenderPackageStream, the copy gets patched up a bit and then a pointer to the modified copy is returned to allow the caller to do further tweaks to it. Ok, this probably needs some further explanation…

The RenderJobPackage is basically a header followed by an arbitrary length of data that together contains everything needed to make it possible for the RenderDevice to later translate it into either a draw call or a compute shader dispatch. In practice this means that after the RenderJobPackage header we also pack RenderResource::render_resource_handle for all resources to bind to all different shader stages as well as full representations of all non-global shader constant buffers.

Since we are building multiple RenderContexts in parallel and might be visiting the same renderable object (mesh, particle system, etc) simultaneously from multiple worker threads, we cannot mutate any state of the renderable when calling its render() function.

Typically all renderable objects have static prototypes of all RenderJobPackages they need to be drawn correctly (e.g. a mesh with three materials might have three RenderJobPackages - one per material). Naturally though, the renderable objects don’t know anything about in which context they will be drawn (e.g. from what camera or in what kind of lighting environment) up until the point where their render() function gets called and the information is provided. At that point their static RenderJobPackages prototypes somehow needs to be patched up with this information (which typically is in the form of shader constants and/or resources).

One way to handle that would be to create a copy of the prototype RenderJobPackage on the stack, patch up the stack copy and then pipe that as argument to RenderContext::render(). That is a fully valid approach and would work just fine, but since RenderContext::render() needs to create a copy of the RenderJobPackage anyway it is more efficient to patch up that copy directly instead. This is the reason for RenderContext::render() returning a pointer to the RenderJobPackage on the RenderPackageStream.

Before diving into the RenderJobPackage struct let’s go through the other arguments of RenderContext::render():

shader_context

We will go through this in more detail in the post about our shader system but essentially we have an engine representation called ShaderTemplate, each ShaderTemplate has a number of Contexts.

A Context is basically a description of any rendering passes that needs to run for the RenderJobPackage to be drawn correctly when rendered in a certain “context”. E.g. a simple shader might declare two contexts: “default” and “shadow”. The “default” context would be used for regular rendering from a player camera, while the “shadow” context would be used when rendering into a shadow map.

What I call a “rendering pass” in this scenario is basically all shader stages (vertex, pixel, etc) together with any state blocks (rasterizer, depth stencil, blend, etc) needed to issue a draw call / dispatch a compute shader in the RenderDevice.

interleave_sort_key

RenderContext::render() automatically figures out what sort keys / Commands it needs to create on it’s command array. Simple shaders usually only render into one layer in a single pass. In those scenarios RenderContext::render() will create a single Command on the command array. When using a more complex shader that renders into multiple layers and/or needs to render in multiple passes; more than one Command will be created, each command referencing the same RenderJobPackage in its Command::head pointer.

This can feel a bit abstract and is hard to explain without giving you the full picture of how the shader system works together with the data-driven rendering system which in turn dictates the bit allocation patterns of the sort keys, for now it’s enough to understand that the shader system somehow knows what Commands to create on the command array.

The shader author can also decide to bypass the data-driven rendering system and put the scheduling responsibility entirely in the hands of the caller of RenderContext::render(), in this case the sort key of all Commands created will simply become 0. This is where the interleave_sort_key comes into play, this variable will be bitwise ORed with the sort key before being stored in the Command.

shader_pass_branch_key

The shader system has a feature for allowing users to dynamically turn on/off certain rendering passes. Again this becomes somewhat abstract without providing the full picture but basically this system works by letting the shader author flag certain passes with a “tag”. A tag is simply a string that gets mapped to a bit within a 64 bit bit-mask. By bitwise ORing together multiple of these tags and piping the result in shader_pass_branch_key the user can control what passes to activate/deactivate when rendering the RenderJobPackage.

job_sort_depth

A signed normalized [0-1] floating point value used for controlling depth sorting between RenderJobPackages. As you will see in the next post this value simply gets mapped into a bit range of the sort key, removing the need for doing any kind of special trickery to manage things like back-to-front / front-to-back sorting of RenderJobPackages.

gpu_affinity_mask

Same as the gpu_affinity_mask parameter piped to begin_state_command().

RenderJobPackage

Let’s take a look at the actual RenderJobPackage struct:

struct RenderJobPackage
{
    BatchInfo batch_info;
    #if defined(COMPUTE_SUPPORTED)
        ComputeInfo compute_info;
    #endif

    uint32_t size;                          // size of entire package including extra data

    uint32_t n_resources;                   // number of resources assigned to job.
    uint32_t resource_offset;               // offset from start of RenderJobPackage to first RenderResource.

    uint32_t shader_resource_data_offset;   // offset to shader resource data
    RenderResource::Handle shader;          // shader used to execute job

    uint64_t instance_hash;                 // unique hash used for instance merging

    #if defined(DEVELOPMENT)
        ResourceID resource_tag;            // debug tag associating job to a resource on disc
        IdString32 object_tag;              // debug tag associating job to an object
        IdString32 batch_tag;               // debug tag associating job to a sub-batch of an object
    #endif
};

batch_info & compute_info

First two members are two nestled POD structs mainly containing the parameters needed for doing any kind of drawing or dispatching of compute work in the RenderDevice:

struct BatchInfo
{
    enum PrimitiveType {
        TRIANGLE_LIST,
        LINE_LIST
        // ...
    };
    enum FrontFace {
        COUNTER_CLOCK_WISE = 0,
        CLOCK_WISE = 1
    };

    PrimitiveType primitive_type;
    uint32_t vertex_offset;         // Offset to first vertex to read from vertex buffer.
    uint32_t primitives;            // Number of primitives to draw
    uint32_t index_offset;          // Offset to the first index to read from the index buffer
    uint32_t vertices;              // Number of vertices in batch (used if batch isn't indexed)
    uint32_t instances;             // Number of instances of this batch to draw
    FrontFace front_face;           // Defines which triangle winding order
};

Most of these are self explanatory, I think the only thing worth pointing out is the front_face enum. This is here to dynamically handle flipping of the primitive winding order when dealing with objects that are negatively scaled on an uneven number of axes. For typical game content it’s rare that we see content creators using mesh mirroring when modeling, for other industries however it is a normal workflow.

struct ComputeInfo
{
    uint32_t thread_count[3];
    bool async;
};

So while BatchInfo mostly holds the parameters needed to render something, ComputeInfo hold the parameters to dispatch a compute shader. The three element array thread_count containing the thread group count for x, y, z. If async is true the graphics API’s “compute queue” will be used instead of the “graphics queue”.

resource_offset

Byte offset from start of RenderJobPackage to an array of n_resources with RenderResource::Handle. Resources found in this array can be of the type VertexStream, IndexStream or VertexDeclaration. Based on the their type and order in the array they get bound to the input assembler stage in the RenderDevice.

shader_resource_data_offset

Byte offset from start of RenderJobPackage to a data block holding handles to all RenderResources as well as all constant buffer data needed by all the shader stages. The layout of this data blob will be covered in the post about the shader system.

instance_hash

We have a system for doing what we call “instance merging”, this system figures out if two RenderJobPackages only differ on certain shader constants and if so merges them into the same draw call. The shader author is responsible but not required to implement support for this feature. If the shader supports “instance merging” the system will use the instance_hash to figure out if two RenderJobPackages can be merged or not. Typically the instance_hash is simply a hash of all RenderResource::Handle that the shader takes as input.

resource_tag & object_tag & batch_tag

Three levels of debug information to make it easier to back track errors/warning inside the RenderDevice to the offending content.

3. Resource updates

The last type of commands are for dynamically updating various RenderResources (Vertex/Index/Raw buffers, Textures, etc).

The interface for updating a buffer with new data looks like this:

class RenderContext
{
    void *map_write(RenderResource *resource, render_sorting::SortKey sort_key,
        const ShaderTemplate::Context* shader_context = 0,
        shader_pass_branching::Flags shader_pass_branch_key = 0,
        uint32_t gpu_affinity_mask = GPU_DEFAULT);
};

resource

This function basically returns a pointer to the first byte of the buffer that will replace the contents of the resource. map_write() figures out the size of the buffer by casting the resource to the correct type (using the type information encoded in the RenderResource::render_resource_handle). It then allocates memory for the buffer and a small header on the RenderPackageStream and returns a pointer to the buffer.

sort_key & shader_context & shader_pass_branch_key

In some rare situations you might need to update the same buffer with different data multiple times within a frame. A typical example could be the vertex buffer of a particle system implementing some kind of level-of-detail system causing the buffers to change depending on e.g camera position. To support that the user can provide a bunch of extra parameters to make sure the contents of the GPU representation of the buffer is updated right before the graphics API draw calls are triggered for the different rendering passes. This works in a similar way how RenderContext::render() can create multiple Commands on the command array referencing the same data.

Unless you need to update the buffer multiple times within the frame it is safe to just set all of the above mentioned parameters to 0, making it very simple to update a buffer:

void *buf = rc.map_write(resource, 0);
// .. fill bits in buffer ..

Note: To shorten the length of this post I’ve left out a few other flavors of updating resources, but map_write is the most important one to grasp.

GPU Queues, Fences & Explicit MGPU programming

Before wrapping up I’d like to touch on a few recent additions to the Stingray renderer, namely how we’ve exposed control for dealing with different GPU Queues, how to synchronize between them and how to control, communicate and synchronize between multiple GPUs.

New graphics APIs such as DX12 and Vulkan exposes three different types of command queues: Graphics, Compute and Copy. There’s plenty of information on the web about this so I won’t cover it here, the only thing important to understand is that these queues can execute asynchronously on the GPU; hence we need to have a way to synchronize between them.

To handle that we have exposed a simple fence API that looks like this:

class RenderContext
{
    struct FenceMessage
    {
        enum Operation { SIGNAL, WAIT };
        Operation operation;
        IdString32 fence_name;
    };
    void signal_fence(IdString32 fence_name, render_sorting::SortKey sort_key,
        uint32_t queue = GRAPHICS_QUEUE, uint32_t gpu_affinity_mask = GPU_DEFAULT);
    void wait_fence(IdString32 fence_name, render_sorting::SortKey sort_key,
        uint32_t queue = GRAPHICS_QUEUE, uint32_t gpu_affinity_mask = GPU_DEFAULT);
};

Here’s a pseudo code snippet showing how to synchronize between the graphics queue and the compute queue:

uint64_t sort_key = 0;

// record a draw call
rc.render(graphics_job, graphics_shader, sort_key++);

// record an asynchronous compute job
// (ComputeInfo::async bool in async_compute_job is set to true to target the graphics APIs compute queue)
rc.render(async_compute_job, compute_shader, sort_key++);

// now lets assume the graphics queue wants to use the result of the async_compute_job,
// for that we need to make sure that the compute shader is done running
rc.wait_fence(IdString32("compute_done"), sort_key++, GRAPHICS_QUEUE);
rc.signal_fence(IdString32("compute_done"), sort_key++, COMPUTE_QUEUE);

rc.render(graphics_job_using_result_from_compute, graphics_shader2, sort_key++);

As you might have noticed all methods for populating a RenderContext described in this post also takes an extra parameter called gpu_affinity_mask. This is a bit-mask used for directing commands to one or many GPUs. The idea is simple, when we boot up the renderer we enumerate all GPUs present in the system and decide which one to use as our default GPU (GPU_DEFAULT) and assign that to bit 1. We also let the user decide if there are other GPUs present in the system that should be available to Stingray and if so assign them bit 2, 3, 4, and so on. By doing so we can explicitly direct control of all commands put on the RenderContext to one or many GPUs in a simple way.

As you can see that is also true for the fence API described above, on top of that there’s also a need for a copy interface to copying resources between GPUs:

class RenderContext
{
    void copy(RenderResource *dst_resource, RenderResource *src_resource,
        render_sorting::SortKey sort_key, Box *src_box = 0, uint32_t dst_offsets[3] = 0,
        uint32_t queue = GRAPHICS_QUEUE, uint32_t gpu_affinity_mask = GPU_DEFAULT,
        uint32_t gpu_source = GPU_DEFAULT, uint32_t gpu_destination = GPU_DEFAULT);
};

Even though this work isn’t fully completed I still wanted to share the high-level idea of what we are working towards for exposing explicit MGPU control to the Stingray renderer. We are actively working on this right now and with some luck I might be able to revisit this with more concrete examples when getting to the post about the render_config & data-driven rendering.

Next up

With that I think I’ve covered the most important aspects of the RenderContext. Next post will dive a bit deeper into bit allocation ranges of the sort keys and the system for sorting in general, hopefully that post will become a bit shorter.

Wednesday, February 1, 2017

Stingray Renderer Walkthrough #2: Resources & Resource Contexts

Stingray Renderer Walkthrough #2: Resources & Resource Contexts

Render Resources

Before any rendering can happen we need a way to reason about GPU resources. Since we want all graphics API specific code to stay isolated we need some kind of abstraction on the engine side, for that we have an interface called RenderDevice. All calls to graphics APIs like D3D, OGL, GNM, Metal, etc. stays behind this interface. We will be covering the RenderDevice in a later post so for now just know that it is there.

We want to have a graphics API agnostic representation for a bunch of different types of resources and we need to link these representations to their counterparts on the RenderDevice side. This linking is handled through a POD-struct called RenderResource:

struct RenderResource
{
    enum {
        TEXTURE, RENDER_TARGET, DEPENDENT_RENDER_TARGET, BACK_BUFFER_WRAPPER,
        CONSTANT_BUFFER, VERTEX_STREAM, INDEX_STREAM, RAW_BUFFER,
        BATCH_INFO, VERTEX_DECLARATION, SHADER,
        NOT_INITIALIZED = 0xFFFFFFFF
    };

    uint32_t render_resource_handle;
};

Any engine resource that also needs a representation on the RenderDevice side inherits from this struct. It contains a single member render_resource_handle which is used to lookup the correct graphics API specific representation in the RenderDevice.

The most significant 8 bits of render_resource_handle holds the type enum, the lower 24 bits is simply an index into an array for that specific resource type inside the RenderDevice.

Various Render Resources

Let’s take a look at the different render resource that can be found in Stingray:

  • Texture - A regular texture, this object wraps all various types of different texture layouts such as 2D, Cube, 3D.
  • RenderTarget - Basically the same as Texture but writable from the GPU.
  • DependentRenderTarget - Similar to RenderTarget but with logics for inheriting properties from another RenderTarget. This is used for creating render targets that needs to be reallocated when the output window (swap chain) is being resized.
  • BackBufferWrapper - Special type of RenderTarget created inside the RenderDevice as part of the swap chain creation. Almost all render targets are explicitly created by the user, this is the only exception as the back buffer associated with the swap chain is typically created together with the swap chain.
  • ShaderConstantBuffer - Shader constant buffers designed for explicit update and sharing between multiple shaders, mainly used for “view-global” state.
  • VertexStream - A regular Vertex Buffer.
  • VertexDeclaration - Describes the contents of one or many VertexStreams.
  • IndexStream - A regular Index Buffer.
  • RawBuffer - A linear memory buffer, can be setup for GPU writing through an UAV (Unordered Access View).
  • Shader - For now just think of this as something containing everything needed to build a full pipeline state object (PSO). Basically a wrapper over a number of shaders, render states, sampler states etc. I will cover the shader system in a later post.

Most of the above resources have a few things in common:

  • They describe a buffer either populated by the CPU or by the GPU
  • CPU populated buffers has a validity field describing its update frequency:
    • STATIC - The buffer is immutable and won’t change after creation, typically most buffers coming from DCC assets are STATIC.
    • UPDATABLE - The buffer can be updated but changes less than once per frame, e.g: UI elements, post processing geometry and similar.
    • DYNAMIC - The buffer frequently changes, at least once per frame but potentially many times in a single frame e.g: particle systems.
  • They have enough data for creating a graphics API specific representation inside the RenderDevice, i.e they know about strides, sizes, view requirements (e.g should an UAV be created or not), etc.

Render Resource Context

With the RenderResource concept sorted, we’ll go through the interface for creating and destroying the RenderDevice representation of the resources. That interface is called RenderResourceContext (RRC).

We want resource creation to be thread safe and while the RenderResourceContext in itself isn’t, we can achieve free threading by allowing the user to create any number of RRC’s they want, and as long as they don’t touch the same RRC from multiple threads everything will be fine.

Similar to many other rendering systems in Stingray the RRC is basically just a small helper class wrapping an abstract “command buffer”. On this command buffer we put what we call “packages” describing everything that is needed for creating/destroying RenderResource objects. These packages have variable length depending on what kind of object they represent. In addition to that the RRC can also hold platform specific allocators that allow allocating/deallocating GPU mapped memory directly, avoiding any additional memory shuffling in the RenderDevice. This kind of mechanism allows for streaming e.g textures and other immutable buffers directly into GPU memory on platforms that provides that kind of low-level control.

Typically the only two functions the user need to care about are:

class RenderResourceContext
{
public:
  void alloc(RenderResource *resource);
  void dealloc(RenderResource *resource);
};

When the user is done allocating/deallocating resources they hand over the RRC either directly to the RenderDevice or to the RenderInterface.

class RenderDevice
{
public:
    virtual void dispatch(uint32_t n_contexts, RenderResourceContext **rrc, uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};

Handing it over directly to the RenderDevice requires the caller to be on the controller thread for rendering as RenderDevice::dispatch() isn’t thread safe. If the caller is on any other thread (like e.g. one of the worker threads or the resource streaming thread) RenderInterface::dispatch() should be used instead. We will cover the RenderInterface in a later post so for now just think of it as a way of piping data into the renderer from an arbitrary thread.

Wrap up

The main reason of having the RenderResourceContext concept instead of exposing allocate()/deallocate() functions directly in the RenderDevice/RenderInterface interfaces is for efficiency. We have a need for allocating and deallocating lots of resources, sometimes in parallel from multiple threads. Decoupling the interface for doing so makes it easy to schedule when in the frame the actual RenderDevice representations gets created, it also makes the code easier to maintain as we don’t have to worry about thread-safety of the RenderResourceContext.

In the next post we will discuss the RenderJobs and RenderContexts which are the two main building blocks for creating and scheduling draw calls and state changes.

Stay tuned.

Stingray Renderer Walkthrough #1: Overview

Stingray Renderer Walkthrough #1: Overview

Introduction

When we started writing Bitsquid back in mid 2009 all platforms we intended to run on were already multi-core architectures. This and the fact that we had some prior experience trying to get our last engine to run efficiently on the PS3 answered the question how not to architecture an efficient renderer that scales to many cores. We knew we needed more than functional parallelism, we wanted data-parallelism.

To solve that we divide the CPU view of a rendered frame into three stages:

  1. Culling - Filter out visible renderable objects with respect to a camera from a potentially huge set of different type of objects (meshes, particle systems, lights, etc).
  2. Render - Iterate over the filtered result from Culling and “record” an intermediate representation of draw calls/state switches to a command buffer.
  3. Dispatch - Take result from Render and translate that into actual render API calls (D3D, OGL, Metal, GNM, etc).

As you can see each stage pipes its result into the next. Rendering is typically very simple in that sense; we tend to have a one way flow of our data: [[user input or time affects state, state propagates into changes of the renderable objects (transforms, shader constants, etc), figure out what need to be rendered, iterate over that and finally generate render API calls. Rinse & Repeat :]]

If we ignore the problem of ordering the final API calls in the rendering backend it’s fairly easy to see how we can achieve data parallelism in this scenario. Just fork at each stage splitting the workload into a n-chunks (where n is however many worker threads you can throw at it). When all workers are done for a stage take the result and pipe into the next stage.

In essence this is how all rendering in Stingray works. Obviously I’ve glanced over some rather important and challenging details but as you will see they are not too hard to solve if you have good control over your data flows and are picky about when mutation of the data happens.

Design Philosophies & Concepts

The rendering code in Stingray tends to be heavily influenced by Data Oriented Programming principles. When designing new systems our biggest efforts usually goes into structuring our data efficiently and thinking about its flow through the systems, more so than writing the actual code that transforms the data from one form to another.

To achieve data-parallelism throughout the rendering code the first thing to realize is that we have to be very picky about when mutation of the renderable objects happens. Multiple worker threads will run over our objects and its not unlikely that more than one thread visits the same object at the same time, hence we must not mutate the state of our objects in its render function. Therefore all of our render() functions are const.

To further guard ourselves from the outer world (i.e gameplay, physics, etc) the renderer operates in complete isolation from the game logics. It has its own representation of the data it needs, and only the data relevant for rendering. While the gameplay logics usually wants to reason about high-level concepts such as game entities (which basically groups a number of meshes, particle systems, lights, etc together), we on the rendering side don’t really care about that. We are much more interested in just having an array of all renderable objects in a game world, in a memory layout that makes it efficient to access.

Another nice thing with decoupling the representation of the renderable objects from the game objects is that it allows us to run simulation in parallel with rendering (functional parallelism). So while simulation is updating frame n the renderer is processing frame n-1. Some of you might argue that overlaying rendering on top of simulation doesn’t give any performance improvements if the work in all systems is nicely parallelized. In reality though this isn’t really the case. We still have systems that don’t go wide, or have certain sections where they need to do synchronous processing (last generation graphics APIs: e.g DX11, OpenGL are good examples). This creates bubbles in the frame slowing us down.

By overlaying simulation and rendering we get a form of bubble filling among the worker threads which in most cases gives a big enough speed improvement to justify the added complexity that comes from this architecture. More specifically:

  1. Double buffering of state - since the simulation might mutate the state of an object for frame n at the same time as the renderer is processing frame n-1 any mutable state needs to be double buffered.
  2. Life scope tracking of immutable data - while immutable/read only state such as static vertex and index buffers are safe to read by both simulation and renderer we still need to be careful not pulling the rug under the renderers feet by freeing anything still being in use by the renderer.

Here’s a conceptual graph showing the benefits of overlaying simulation and rendering:

So basically what we got here is two “controller threads”: simulation and render both offloading work to the worker threads. In the case that a controller thread is blocked waiting for some work to finish it will assist the worker threads striving to never sit idle. One thing to note is that to prevent frames from stacking up, we never allow the simulation thread to run more than one frame ahead of the render thread.

As a comparison here’s the same workload with simulation and rendering running in sequence.

As you can see we get significantly more idle time (bubbles) on the worker threads due to certain parts of both the simulation and rendering not being able to go wide.

Next up

I think this pretty much covers the high level view of the core rendering architecture in Stingray. Now lets go into some more detail.

Since Andreas Asplund recently covered both how we handle propagation of state from simulation to the renderer (we call this “State reflection” in Stingray): http://bitsquid.blogspot.se/2016/09/state-reflection.html as well as how our view frustum culling system(s) works: http://bitsquid.blogspot.se/2016/10/the-implementation-of-frustum-culling.html I won’t be covering that in this series.

Instead I will jump straight into how creating and destroying GPU resources works, and from there go through all the building blocks needed to implement the second stage Render mentioned above.

Stingray Renderer Walkthrough

Stingray Renderer Walkthrough

Welcome

To simplify knowledge transferring inside the Autodesk development teams and in an attempt to improve my writing skills I’ve decided to do a walkthrough of the Stingray rendering architecture. The idea is to do this as a series of blog posts over the coming weeks starting from the low-level aspects of the renderer chewing my way up to more high-level concepts as I go.

I’ve covered some of these topics before in various presentations over the years but those have been more focused on how our data driven aspects of the renderer works and less on the core architecture behind it. This is an attempt to do a more complete walk-through of the entire rendering architecture.

When I started thinking about this it felt like an almost impossible undertaking considering how much slower I am at expressing myself in text than in code, but after spending a couple of days going through the entire stingray code base doing some spring cleaning it felt a bit more manageable so I’ve now decided to give it a try.

(Note: this has nothing at all to do with me feeling the pressure from Niklas Frykholm who’s currently doing a complete walk-through of the entire Stingray engine code base (well everything except rendering) as a series of youtube videos [1]. Not at all… I feel no pressure, no guilt, nothing… I promise… Thanks Niklas for pushing me!)

Outline

Below is some kind of outline of what I intend to cover and in what order, I might swap things around as I go if I discover it makes more sense. This post will work as an index and I will link to the posts as they come online.

  1. Overview
  2. Resources & Resource Contexts
  3. Render Contexts
  4. Sorting
  5. RenderDevice
  6. RenderInterface
  7. render_config & data-driven rendering
  8. Shaders & Materials
  9. Stingray-renderer & mini-renderer
  10. Renderer PluginAPI

Tuesday, October 4, 2016

The Implementation of Frustum Culling in Stingray

Overview

Frustum culling can be an expensive operation. Stingray accelerates it by making heavy use of SIMD and distributing the workload over several threads. The basic workflow is:

  • Kick jobs to do frustum vs sphere culling
    • For each frustum plane, test plane vs sphere
  • Wait for sphere culling to finish
  • For objects that pass sphere test, kick jobs to do frustum vs object-oriented bounding box (OOBB) culling
    • For each frustum plane, test plane vs OOBB
  • Wait for OOBB culling to finish

Frustum vs sphere tests are significantly faster than frustum vs OOBB. By rejecting objects that fail sphere culling first, we have fewer objects to process in the more expensive OOBB pass.

Why go over all objects brute force instead of using some sort of spatial partition data structure? We like to keep things simple and with the current setup we have yet to encounter a case where we've been bound by the culling. Brute force sphere culling followed by OOBB culling is fast enough for all cases we've encountered so far. That might of course change in the future, but we'll take care of that when it's an actual problem.

The brute force culling is pretty fast, because:

  1. The sphere and the OOBB culling use SIMD and only load the minimum amount of needed data.
  2. The workload is distributed over several threads.

In this post, I we will first look at the single threaded SIMD code and then how the culling is distributed over multiple threads.

I'll use a lot of code to show how it's all done. It's mostly actual code from the engine, but it has been cleaned up to a certain extent. Some stuff has been renamed and/or removed to make it easier to understand what's going on.

Data structures used

If you go back to my previous post about state reflection, http://bitsquid.blogspot.ca/2016/09/state-reflection.html you can read that each object on the main thread is associated with a render thread representation via a render_handle. The render_handle is used to get the object_index which is the index of an object in the _objects array.

Take a look at the following code for a refresher:

void RenderWorld::create_object(WorldRenderInterface::ObjectManagementPackage *omp)
{
    // Acquire an `object_index`.
    uint32_t object_index = _objects.size();

    // Same recycling mechanism as seen for render handles.
    if (_free_object_indices.any()) {
        object_index = _free_object_indices.back();
        _free_object_indices.pop_back();
    } else {
        _objects.resize(object_index + 1);
        _object_types.resize(object_index + 1);
    }

    void *render_object = omp->user_data;
    if (omp->type == RenderMeshObject::TYPE) {
        // Cast the `render_object` to a `MeshObject`.
        RenderMeshObject *rmo = (RenderMeshObject*)render_object;

        // If needed, do more stuff with `rmo`.
    }

    // Store the `render_object` and `type`.
    _objects[object_index] = render_object;
    _object_types[object_index] = omp->type;

    if (omp->render_handle >= _object_lut.size())
        _object_lut.resize(omp->handle + 1);
    // The `render_handle` is used
    _object_lut[omp->render_handle] = object_index;
}

The _objects array stores objects of all kinds of different types. It is defined as:

Array<void*> _objects;

The types of the objects are stored in a corresponding _object_types array, defined as:

Array<uint32_t> _object_types;

From _object_types, we know the actual type of the objects and we can use that to cast the void * into the proper type (mesh, terrain, gui, particle_system, etc).

The culling happens in the // If needed, do more stuff with rmo section above. It looks like this:

void *render_object = omp->user_data;
if (omp->type == RenderMeshObject::TYPE) {
    // Cast the `render_object` to a `MeshObject`.
    RenderMeshObject *rmo = (RenderMeshObject*)render_object;

    // If needed, do more stuff with `rmo`.
    if (!(rmo->flags() & renderable::CULLING_DISABLED)) {
        culling::Object o;
        // Extract necessary information to do culling.

        // The index of the object.
        o.id = object_index;

        // The type of the object.
        o.type = rmo->type;

        // Get the mininum and maximum corner positions of a boudning box in object space.
        o.min = float4(rmo->bounding_volume().min, 1.f);
        o.max = float4(rmo->bounding_volume().max, 1.f);

        // World transform matrix.
        o.m = float4x4(rmo->world());

        // Depending on the value of `flags` add the culling representation to different culling sets.
        if (rmo->flags() & renderable::VIEWPORT_VISIBLE)
            _cullable_objects.add(o, rmo->node());
        if (rmo->flags() & renderable::SHADOW_CASTER)
            _cullable_shadow_casters.add(o, rmo->node());
        if (rmo->flags() & renderable::OCCLUDER)
            _occluders.add(o, rmo->node());
    }
}

For culling MeshObjects and other cullable types are represented by culling::Objects that are used to populate the culling data structures. As can be seen in the code they are _cullable_objects, _cullable_shadow_casters and _occluders and they are all represented by an ObjectSet:

struct ObjectSet
{
    // Minimum bounding box corner position.
    Array<float> min_x;
    Array<float> min_y;
    Array<float> min_z;

    // Maximum bounding box corner position.
    Array<float> max_x;
    Array<float> max_y;
    Array<float> max_z;

    // Object->world matrix.
    Array<float> world_xx;
    Array<float> world_xy;
    Array<float> world_xz;
    Array<float> world_xw;
    Array<float> world_yx;
    Array<float> world_yy;
    Array<float> world_yz;
    Array<float> world_yw;
    Array<float> world_zx;
    Array<float> world_zy;
    Array<float> world_zz;
    Array<float> world_zw;
    Array<float> world_tx;
    Array<float> world_ty;
    Array<float> world_tz;
    Array<float> world_tw;

    // World space center position of bounding sphere.
    Array<float> ws_pos_x;
    Array<float> ws_pos_y;
    Array<float> ws_pos_z;

    // Radius of bounding sphere.
    Array<float> radius;

    // Flag to indicate if an object is culled or not.
    Array<uint32_t> visibility_flag;

    // The type and id of an object.
    Array<uint32_t> type;
    Array<uint32_t> id;

    uint32_t n_objects;
};

When an object is added to, e.g. _cullable_objects the culling::Object data is added to the ObjectSet. The ObjectSet flattens the data into a structure-of-arrays representation. The arrays are padded to the SIMD lane count to make sure there's valid data to read.

Frustum-sphere culling

The world space positions and sphere radii of objects are represented by the following members of the ObjectSet:

Array<float> ws_pos_x;
Array<float> ws_pos_y;
Array<float> ws_pos_z;
Array<float> radius;

This is all we need to do frustum-sphere culling.

The frustum-sphere culling needs the planes of the frustum defined in world space. Information on how to find that can be found in: http://gamedevs.org/uploads/fast-extraction-viewing-frustum-planes-from-world-view-projection-matrix.pdf.

The frustum-sphere intersection code tests one plane against several spheres using SIMD instructions. The ObjectSet data is already laid out in a SIMD friendly way. To test one plane against several spheres, the plane's data is splatted out in the following way:

// `float4` is our cross platform abstraction of SSE, NEON etc.
struct SIMDPlane
{
    float4 normal_x; // the normal's x value replicted 4 times.
    float4 normal_y; // the normal's y value replicted 4 times.
    float4 normal_z; // etc.
    float4 d;
};

The single threaded code needed to do frustum-sphere culling is:

void simd_sphere_culling(const SIMDPlane planes[6], culling::ObjectSet &object_set)
{
    const auto all_true = bool4_all_true();
    const uint32_t n_objects = object_set.n_objects;

    uint32_t *visibility_flag = object_set.visibility_flag.begin();

    // Test each plane of the frustum against each sphere.
    for (uint32_t i = 0; i < n_objects; i += 4)
    {
        const auto ws_pos_x = float4_load_aligned(&object_set->ws_pos_x[i]);
        const auto ws_pos_y = float4_load_aligned(&object_set->ws_pos_y[i]);
        const auto ws_pos_z = float4_load_aligned(&object_set->ws_pos_z[i]);
        const auto radius = float4_load_aligned(&object_set->radius[i]);

        auto inside = all_true;
        for (unsigned p = 0; p < 6; ++p) {
            auto &n_x = planes[p].normal_x;
            auto &n_y = planes[p].normal_y;
            auto &n_z = planes[p].normal_z;
            auto n_dot_pos = dot_product(ws_pos_x, ws_pos_y, ws_pos_z, n_x, n_y, n_z);
            auto plane_test_point = n_dot_pos + radius;
            auto plane_test = plane_test_point >= planes[p].d;
            inside = vector_and(plane_test, inside);
        }

        // Store 0 for spheres that didn't intersect or ended up on the positive side of the
        // frustum planes. Store 0xffffffff for spheres that are visible.
        store_aligned(inside, &visibility_flag[i]);
    }
}

After the simd_sphere_culling call, the visibility_flag array contains 0 for all objects that failed the test and 0xffffffff for all objects that passed. We chain this together with the OOBB culling by doing a compactness pass over the visibility_flag array and populating an indirection array:

{
    // Splat out the planes to be able to do plane-sphere test with SIMD.
    const auto &frustum = camera.frustum();

    const SIMDPlane planes[6] = {
        float4_splat(frustum.planes[0].n.x),
        float4_splat(frustum.planes[0].n.y),
        float4_splat(frustum.planes[0].n.z),
        float4_splat(frustum.planes[0].d),

        float4_splat(frustum.planes[1].n.x),
        float4_splat(frustum.planes[1].n.y),
        float4_splat(frustum.planes[1].n.z),
        float4_splat(frustum.planes[1].d),

        float4_splat(frustum.planes[2].n.x),
        float4_splat(frustum.planes[2].n.y),
        float4_splat(frustum.planes[2].n.z),
        float4_splat(frustum.planes[2].d),

        float4_splat(frustum.planes[3].n.x),
        float4_splat(frustum.planes[3].n.y),
        float4_splat(frustum.planes[3].n.z),
        float4_splat(frustum.planes[3].d),

        float4_splat(frustum.planes[4].n.x),
        float4_splat(frustum.planes[4].n.y),
        float4_splat(frustum.planes[4].n.z),
        float4_splat(frustum.planes[4].d),

        float4_splat(frustum.planes[5].n.x),
        float4_splat(frustum.planes[5].n.y),
        float4_splat(frustum.planes[5].n.z),
        float4_splat(frustum.planes[5].d),
    };


    // Do frustum-sphere culling.
    simd_sphere_culling(planes, object_set);

    // Make sure to align the size to the simd lane count.
    const uint32_t n_aligned_objects = align_to_simd_lane_count(object_set.n_objects);

    // Store the indices of the objects that passed the frustum-sphere culling in the `indirection` array.
    Array<uint32_t> indirection(n_aligned_objects);

    const uint32_t n_visible = remove_not_visible(object_set, object_set.n_objects, indirection.begin());
}

Where remove_not_visible is:

uint32_t remove_not_visible(const ObjectSet &object_set, uint32_t count, uint32_t *output_indirection)
{
    const uint32_t *visibility_flag = object_set.visibility_flag.begin();
    uint32_t n_visible = 0U;
    for (uint32_t i = 0; i < count; ++i) {
        if (visibility_flag[i]) {
            output_indirection[n_visible] = i;
            ++n_visible;
        }
    }

    const uint32_t n_aligned_visible = align_to_simd_lane_count(n_visible);
    const uint32_t last_visible = n_visible? output_indirection[n_visible- 1] : 0;

    // Pad out to the simd alignment.
    for (unsigned i = n_visible; i < n_aligned_visible; ++i)
        output_indirection[i] = last_visible;

    return n_visible;
}

n_visible together with indirection provides the input for doing the frustum-OOBB culling on the objects that survived the frustum-sphere culling.

Frustum-OOBB culling

The frustum-OOBB culling takes ideas from Fabian Giesen's https://fgiesen.wordpress.com/2010/10/17/view-frustum-culling/ and Arseny Kapoulkine's http://zeuxcg.org/2009/01/31/view-frustum-culling-optimization-introduction/.

More specifically we use the Method 2: Transform box vertices to clip space, test against clip-space planes that both Fabian and Arseny write about. But we also go with Method 2b: Saving arithmetic ops that Fabian mentions. I won't dwelve into how the culling actually works, to understand that please read their posts.

The code is SIMDified to process several OOBBs at the same time. The same corner of four multiple OOBBs is tested against one frustum plane as a single SIMD operation.

To be able to write the SIMD code in a more intuitive form a few data structures and functions are used:

struct SIMDVector
{
    float4 x; // stores x0, x1, x2, x3
    float4 y; // stores y0, y1, y2, y3
    float4 z; // etc.
    float4 w;
};

A SIMDVector stores x, y, z & w for four objects. To store a matrix for four objects a SIMDMatrix is used:

struct SIMDMatrix
{
    SIMDVector x;
    SIMDVector y;
    SIMDVector z;
    SIMDVector w;
};

A SIMDMatrix-SIMDVector multiplication can then be written as a regular matrix-vector multiplication:

SIMDVector simd_multiply(const SIMDVector &v, const SIMDMatrix &m)
{
    float4 x = v.x * m.x.x;     x = v.y * m.y.x + x;    x = v.z * m.z.x + x;    x = v.w * m.w.x + x;
    float4 y = v.x * m.x.y;     y = v.y * m.y.y + y;    y = v.z * m.z.y + y;    y = v.w * m.w.y + y;
    float4 z = v.x * m.x.z;     z = v.y * m.y.z + z;    z = v.z * m.z.z + z;    z = v.w * m.w.z + z;
    float4 w = v.x * m.x.w;     w = v.y * m.y.w + w;    w = v.z * m.z.w + w;    w = v.w * m.w.w + w;
    SIMDVector res = { x, y, z, w };
    return res;
}

A SIMDMatrix-SIMDMatrix multiplication is:

SIMDMatrix simd_multiply(const SIMDMatrix &lhs, const SIMDMatrix &rhs)
{
    SIMDVector x = simd_multiply(lhs.x, rhs);
    SIMDVector y = simd_multiply(lhs.y, rhs);
    SIMDVector z = simd_multiply(lhs.z, rhs);
    SIMDVector w = simd_multiply(lhs.w, rhs);
    SIMDMatrix res = { x, y, z, w };
    return res;
}

The code needed to do the actual frustum-OOBB culling is:

void simd_oobb_culling(const SIMDMatrix &view_proj, const culling::ObjectSet &object_set, uint32_t n_objects, const uint32_t *indirection)
{
    // Get pointers to the necessary members of the object set.
    const float *min_x = object_set->min_x.begin();
    const float *min_y = object_set->min_y.begin();
    const float *min_z = object_set->min_z.begin();

    const float *max_x = object_set->max_x.begin();
    const float *max_y = object_set->max_y.begin();
    const float *max_z = object_set->max_z.begin();

    const float *world_xx = object_set->world_xx.begin();
    const float *world_xy = object_set->world_xy.begin();
    const float *world_xz = object_set->world_xz.begin();
    const float *world_xw = object_set->world_xw.begin();
    const float *world_yx = object_set->world_yx.begin();
    const float *world_yy = object_set->world_yy.begin();
    const float *world_yz = object_set->world_yz.begin();
    const float *world_yw = object_set->world_yw.begin();
    const float *world_zx = object_set->world_zx.begin();
    const float *world_zy = object_set->world_zy.begin();
    const float *world_zz = object_set->world_zz.begin();
    const float *world_zw = object_set->world_zw.begin();
    const float *world_tx = object_set->world_tx.begin();
    const float *world_ty = object_set->world_ty.begin();
    const float *world_tz = object_set->world_tz.begin();
    const float *world_tw = object_set->world_tw.begin();

    uint32_t *visibility_flag = object_set.visibility_flag.begin();

    for (uint32_t i = 0; i < n_objects; i += 4) {
        SIMDMatrix world;

        // Load the world transform matrix for four objects via the indirection table.

        const uint32_t i0 = indirection[i];
        const uint32_t i1 = indirection[i + 1];
        const uint32_t i2 = indirection[i + 2];
        const uint32_t i3 = indirection[i + 3];

        world.x.x = float4(world_xx[i0], world_xx[i1], world_xx[i2], world_xx[i3]);
        world.x.y = float4(world_xy[i0], world_xy[i1], world_xy[i2], world_xy[i3]);
        world.x.z = float4(world_xz[i0], world_xz[i1], world_xz[i2], world_xz[i3]);
        world.x.w = float4(world_xw[i0], world_xw[i1], world_xw[i2], world_xw[i3]);

        world.y.x = float4(world_yx[i0], world_yx[i1], world_yx[i2], world_yx[i3]);
        world.y.y = float4(world_yy[i0], world_yy[i1], world_yy[i2], world_yy[i3]);
        world.y.z = float4(world_yz[i0], world_yz[i1], world_yz[i2], world_yz[i3]);
        world.y.w = float4(world_yw[i0], world_yw[i1], world_yw[i2], world_yw[i3]);

        world.z.x = float4(world_zx[i0], world_zx[i1], world_zx[i2], world_zx[i3]);
        world.z.y = float4(world_zy[i0], world_zy[i1], world_zy[i2], world_zy[i3]);
        world.z.z = float4(world_zz[i0], world_zz[i1], world_zz[i2], world_zz[i3]);
        world.z.w = float4(world_zw[i0], world_zw[i1], world_zw[i2], world_zw[i3]);

        world.w.x = float4(world_tx[i0], world_tx[i1], world_tx[i2], world_tx[i3]);
        world.w.y = float4(world_ty[i0], world_ty[i1], world_ty[i2], world_ty[i3]);
        world.w.z = float4(world_tz[i0], world_tz[i1], world_tz[i2], world_tz[i3]);
        world.w.w = float4(world_tw[i0], world_tw[i1], world_tw[i2], world_tw[i3]);

        // Create the matrix to go from object->world->view->clip space.
        const auto clip = simd_multiply(world, view_proj);

        SIMDVector min_pos;
        SIMDVector max_pos;

        // Load the mininum and maximum corner positions of the bounding box in object space.
        min_pos.x = float4(min_x[i0], min_x[i1], min_x[i2], min_x[i3]);
        min_pos.y = float4(min_y[i0], min_y[i1], min_y[i2], min_y[i3]);
        min_pos.z = float4(min_z[i0], min_z[i1], min_z[i2], min_z[i3]);
        min_pos.w = float4_splat(1.0f);

        max_pos.x = float4(max_x[i0], max_x[i1], max_x[i2], max_x[i3]);
        max_pos.y = float4(max_y[i0], max_y[i1], max_y[i2], max_y[i3]);
        max_pos.z = float4(max_z[i0], max_z[i1], max_z[i2], max_z[i3]);
        max_pos.w = float4_splat(1.0f);

        SIMDVector clip_pos[8];

        // Transform each bounding box corner from object to clip space by sharing calculations.
        simd_min_max_transform(clip, min_pos, max_pos, clip_pos);

        const auto zero = float4_zero();
        const auto all_true = bool4_all_true();

        // Initialize test conditions.
        auto all_x_less = all_true;
        auto all_x_greater = all_true;
        auto all_y_less = all_true;
        auto all_y_greater = all_true;
        auto all_z_less = all_true;
        auto any_z_less = bool4_all_false();
        auto all_z_greater = all_true;

        // Test each corner of the oobb and if any corner intersects the frustum that object
        // is visible.
        for (unsigned cs = 0; cs < 8; ++cs) {
            const auto neg_cs_w = negate(clip_pos[cs].w);

            auto x_le = clip_pos[cs].x <= neg_cs_w;
            auto x_ge = clip_pos[cs].x >= clip_pos[cs].w;
            all_x_less = vector_and(x_le, all_x_less);
            all_x_greater = vector_and(x_ge, all_x_greater);

            auto y_le = clip_pos[cs].y <= neg_cs_w;
            auto y_ge = clip_pos[cs].y >= clip_pos[cs].w;
            all_y_less = vector_and(y_le, all_y_less);
            all_y_greater = vector_and(y_ge, all_y_greater);

            auto z_le = clip_pos[cs].z <= zero;
            auto z_ge = clip_pos[cs].z >= clip_pos[cs].w;
            all_z_less = vector_and(z_le, all_z_less);
            all_z_greater = vector_and(z_ge, all_z_greater);
            any_z_less = vector_or(z_le, any_z_less);
        }

        const auto any_x_outside = vector_or(all_x_less, all_x_greater);
        const auto any_y_outside = vector_or(all_y_less, all_y_greater);
        const auto any_z_outside = vector_or(all_z_less, all_z_greater);
        auto outside = vector_or(any_x_outside, any_y_outside);
        outside = vector_or(outside, any_z_outside);

        auto inside = vector_xor(outside, all_true);

        // Store the result in the `visibility_flag` array in a compacted way.
        store_aligned(inside, &visibility_flag[i]);
    }
}

The function simd_min_max_transforms used above is the function to transform each OOBB corner from object space to clip space by sharing some of the calculations, for completeness the function is:

void simd_min_max_transform(const SIMDMatrix &m, const SIMDVector &min, const SIMDVector &max, SIMDVector result[])
{
    auto m_xx_x = m.x.x * min.x;    m_xx_x = m_xx_x + m.w.x;
    auto m_xy_x = m.x.y * min.x;    m_xy_x = m_xy_x + m.w.y;
    auto m_xz_x = m.x.z * min.x;    m_xz_x = m_xz_x + m.w.z;
    auto m_xw_x = m.x.w * min.x;    m_xw_x = m_xw_x + m.w.w;

    auto m_xx_X = m.x.x * max.x;    m_xx_X = m_xx_X + m.w.x;
    auto m_xy_X = m.x.y * max.x;    m_xy_X = m_xy_X + m.w.y;
    auto m_xz_X = m.x.z * max.x;    m_xz_X = m_xz_X + m.w.z;
    auto m_xw_X = m.x.w * max.x;    m_xw_X = m_xw_X + m.w.w;

    auto m_yx_y = m.y.x * min.y;
    auto m_yy_y = m.y.y * min.y;
    auto m_yz_y = m.y.z * min.y;
    auto m_yw_y = m.y.w * min.y;

    auto m_yx_Y = m.y.x * max.y;
    auto m_yy_Y = m.y.y * max.y;
    auto m_yz_Y = m.y.z * max.y;
    auto m_yw_Y = m.y.w * max.y;

    auto m_zx_z = m.z.x * min.z;
    auto m_zy_z = m.z.y * min.z;
    auto m_zz_z = m.z.z * min.z;
    auto m_zw_z = m.z.w * min.z;

    auto m_zx_Z = m.z.x * max.z;
    auto m_zy_Z = m.z.y * max.z;
    auto m_zz_Z = m.z.z * max.z;
    auto m_zw_Z = m.z.w * max.z;

    {
        auto xyz_x = m_xx_x + m_yx_y;   xyz_x = xyz_x + m_zx_z;
        auto xyz_y = m_xy_x + m_yy_y;   xyz_y = xyz_y + m_zy_z;
        auto xyz_z = m_xz_x + m_yz_y;   xyz_z = xyz_z + m_zz_z;
        auto xyz_w = m_xw_x + m_yw_y;   xyz_w = xyz_w + m_zw_z;
        result[0].x = xyz_x;
        result[0].y = xyz_y;
        result[0].z = xyz_z;
        result[0].w = xyz_w;
    }

    {
        auto Xyz_x = m_xx_X + m_yx_y;   Xyz_x = Xyz_x + m_zx_z;
        auto Xyz_y = m_xy_X + m_yy_y;   Xyz_y = Xyz_y + m_zy_z;
        auto Xyz_z = m_xz_X + m_yz_y;   Xyz_z = Xyz_z + m_zz_z;
        auto Xyz_w = m_xw_X + m_yw_y;   Xyz_w = Xyz_w + m_zw_z;
        result[1].x = Xyz_x;
        result[1].y = Xyz_y;
        result[1].z = Xyz_z;
        result[1].w = Xyz_w;
    }

    {
        auto xYz_x = m_xx_x + m_yx_Y;   xYz_x = xYz_x + m_zx_z;
        auto xYz_y = m_xy_x + m_yy_Y;   xYz_y = xYz_y + m_zy_z;
        auto xYz_z = m_xz_x + m_yz_Y;   xYz_z = xYz_z + m_zz_z;
        auto xYz_w = m_xw_x + m_yw_Y;   xYz_w = xYz_w + m_zw_z;
        result[2].x = xYz_x;
        result[2].y = xYz_y;
        result[2].z = xYz_z;
        result[2].w = xYz_w;
    }

    {
        auto XYz_x = m_xx_X + m_yx_Y;   XYz_x = XYz_x + m_zx_z;
        auto XYz_y = m_xy_X + m_yy_Y;   XYz_y = XYz_y + m_zy_z;
        auto XYz_z = m_xz_X + m_yz_Y;   XYz_z = XYz_z + m_zz_z;
        auto XYz_w = m_xw_X + m_yw_Y;   XYz_w = XYz_w + m_zw_z;
        result[3].x = XYz_x;
        result[3].y = XYz_y;
        result[3].z = XYz_z;
        result[3].w = XYz_w;
    }

    {
        auto xyZ_x = m_xx_x + m_yx_y;   xyZ_x = xyZ_x + m_zx_Z;
        auto xyZ_y = m_xy_x + m_yy_y;   xyZ_y = xyZ_y + m_zy_Z;
        auto xyZ_z = m_xz_x + m_yz_y;   xyZ_z = xyZ_z + m_zz_Z;
        auto xyZ_w = m_xw_x + m_yw_y;   xyZ_w = xyZ_w + m_zw_Z;
        result[4].x = xyZ_x;
        result[4].y = xyZ_y;
        result[4].z = xyZ_z;
        result[4].w = xyZ_w;
    }

    {
        auto XyZ_x = m_xx_X + m_yx_y;   XyZ_x = XyZ_x + m_zx_Z;
        auto XyZ_y = m_xy_X + m_yy_y;   XyZ_y = XyZ_y + m_zy_Z;
        auto XyZ_z = m_xz_X + m_yz_y;   XyZ_z = XyZ_z + m_zz_Z;
        auto XyZ_w = m_xw_X + m_yw_y;   XyZ_w = XyZ_w + m_zw_Z;
        result[5].x = XyZ_x;
        result[5].y = XyZ_y;
        result[5].z = XyZ_z;
        result[5].w = XyZ_w;
    }

    {
        auto xYZ_x = m_xx_x + m_yx_Y;   xYZ_x = xYZ_x + m_zx_Z;
        auto xYZ_y = m_xy_x + m_yy_Y;   xYZ_y = xYZ_y + m_zy_Z;
        auto xYZ_z = m_xz_x + m_yz_Y;   xYZ_z = xYZ_z + m_zz_Z;
        auto xYZ_w = m_xw_x + m_yw_Y;   xYZ_w = xYZ_w + m_zw_Z;
        result[6].x = xYZ_x;
        result[6].y = xYZ_y;
        result[6].z = xYZ_z;
        result[6].w = xYZ_w;
    }

    {
        auto XYZ_x = m_xx_X + m_yx_Y;   XYZ_x = XYZ_x + m_zx_Z;
        auto XYZ_y = m_xy_X + m_yy_Y;   XYZ_y = XYZ_y + m_zy_Z;
        auto XYZ_z = m_xz_X + m_yz_Y;   XYZ_z = XYZ_z + m_zz_Z;
        auto XYZ_w = m_xw_X + m_yw_Y;   XYZ_w = XYZ_w + m_zw_Z;
        result[7].x = XYZ_x;
        result[7].y = XYZ_y;
        result[7].z = XYZ_z;
        result[7].w = XYZ_w;
    }
}

To get a compact indirection array of all the objects that passed the frustum-OOBB culling, the remove_not_visible function needs to be slightly modified:

uint32_t remove_not_visible(const ObjectSet &object_set, uint32_t count, uint32_t *output_indirection, const uint32_t *input_indirection/*new argument*/)
{
    const uint32_t *visibility_flag = object_set.visibility_flag.begin();
    uint32_t n_visible = 0U;
    for (uint32_t i = 0; i < count; ++i) {

        // Each element of `input_indirection` represents an object that has either been culled
        // or not culled. If it's not null then do a lookup to get the actual object index else
        // use `i` directly.
        const uint32_t index = input_indirection? input_indirection[i] : i;

        // `visibility_flag` is already compacted, so use `i` directly.
        if (visibility_flag[i]) {
            output_indirection[n_visible] = i;
            ++n_visible;
        }
    }

    const uint32_t n_aligned_visible = align_to_simd_lane_count(n_visible);
    const uint32_t last_visible = n_visible? output_indirection[n_visible- 1] : 0;

    // Pad out to the simd alignment.
    for (unsigned i = n_visible; i < n_aligned_visible; ++i)
        output_indirection[i] = last_visible;

    return n_visible;
}

Bringing the frustum-sphere and frustum-OOBB code together we get:

{
    // Splat out the planes to be able to do plane-sphere test with SIMD.
    const auto &frustum = camera.frustum();

    const SIMDPlane planes[6] = {
        float4_splat(frustum.planes[0].n.x),
        float4_splat(frustum.planes[0].n.y),
        float4_splat(frustum.planes[0].n.z),
        float4_splat(frustum.planes[0].d),

        float4_splat(frustum.planes[1].n.x),
        float4_splat(frustum.planes[1].n.y),
        float4_splat(frustum.planes[1].n.z),
        float4_splat(frustum.planes[1].d),

        float4_splat(frustum.planes[2].n.x),
        float4_splat(frustum.planes[2].n.y),
        float4_splat(frustum.planes[2].n.z),
        float4_splat(frustum.planes[2].d),

        float4_splat(frustum.planes[3].n.x),
        float4_splat(frustum.planes[3].n.y),
        float4_splat(frustum.planes[3].n.z),
        float4_splat(frustum.planes[3].d),

        float4_splat(frustum.planes[4].n.x),
        float4_splat(frustum.planes[4].n.y),
        float4_splat(frustum.planes[4].n.z),
        float4_splat(frustum.planes[4].d),

        float4_splat(frustum.planes[5].n.x),
        float4_splat(frustum.planes[5].n.y),
        float4_splat(frustum.planes[5].n.z),
        float4_splat(frustum.planes[5].d),
    };

    // Do frustum-sphere culling.
    simd_sphere_culling(planes, object_set);

    // Make sure to align the size to the simd lane count.
    const uint32_t n_aligned_objects = align_to_simd_lane_count(object_set.n_objects);

    // Store the indices of the objects that passed the frustum-sphere culling in the `indirection` array.
    Array<uint32_t> indirection(n_aligned_objects);

    const uint32_t n_visible = remove_not_visible(object_set, object_set.n_objects, indirection.begin(), nullptr);

    const auto &view_proj = camera.view() * camera.proj();

    // Construct the SIMDMatrix `simd_view_proj`.
    const SIMDMatrix simd_view_proj = {
        float4_splat(view_proj.v[xx]),
        float4_splat(view_proj.v[xy]),
        float4_splat(view_proj.v[xz]),
        float4_splat(view_proj.v[xw]),

        float4_splat(view_proj.v[yx]),
        float4_splat(view_proj.v[yy]),
        float4_splat(view_proj.v[yz]),
        float4_splat(view_proj.v[yw]),

        float4_splat(view_proj.v[zx]),
        float4_splat(view_proj.v[zy]),
        float4_splat(view_proj.v[zz]),
        float4_splat(view_proj.v[zw]),

        float4_splat(view_proj.v[tx]),
        float4_splat(view_proj.v[ty]),
        float4_splat(view_proj.v[tz]),
        float4_splat(view_proj.v[tw]),
    };

    // Cull objects via frustum-oobb tests.
    simd_oobb_culling(simd_view_proj, object_set, n_visible, indirection.begin());

    // Build up the indirection array that represents the objects that survived the frustum-oobb culling.
    const uint32_t n_oobb_visible = remove_not_visible(object_set, n_visible, indirection.begin(), indirection.begin());
}

The final call to remove_not_visible populates the indirection array with the objects that passed both the frustum-sphere and the frustum-OOBB culling. indirection together with n_oobb_visible is all that is needed to know what objects should be rendered.

Distributing the work over several threads

In Stingray, work is distributed by submitting jobs to a pool of worker threads -- conveniently called the ThreadPool. Submitted jobs are put in a thread safe work queue from which the worker threads pop jobs to work on. A task is defined as:

typedef void (*TaskCallback)(void *user_data);

struct TaskDefinition
{
    TaskCallback callback;
    void *user_data;
};

For the purpose of this article, the interesting methods of the ThreadPool are:

class ThreadPool
{
    // Adds `count` tasks to the work queue.
    void add_tasks(const TaskDefinition *tasks, uint32_t count);

    // Tries to pop one task from the queue and do that work. Returns true if any work was done.
    bool do_work();

    // Will call `do_work` while `signal` == value.
    void wait_atomic(std::atomic<uint32_t> *signal, uint32_t value);
};

The ThreadPool doesn't dictate how to synchronize when a job is fully processed, but usually a std::atomic<uint32_t> signal is used for that purpose. The value is 0 while the job is being processed and set to 1 when it's done. wait_atomic() is a convenience method that can be used to wait for such values:

void ThreadPool::wait_atomic(std::atomic<uint32_t> *signal, uint32_t value)
{
    while (signal->load(std::memory_order_acquire) == value) {
        if (!do_work())
            YieldProcessor();
    }
}

do_work:

bool ThreadPool::do_work()
{
    TaskDefinition task;
    if (pop_task(task)) {
        task.callback(task.user_data);
        return true;
    }
    return false;
}

Multi-threading the culling only requires a few changes to the code. For the simd_sphere_culling() method we need to add offset and count parameters to specify the range of objects we are processing:

void simd_sphere_culling(const SIMDPlane planes[6], culling::ObjectSet &object_set, uint32_t offset, uint32_t count)
{
    const auto all_true = bool4_all_true();
    const uint32_t n_objects = offset + count;

    uint32_t *visibility_flag = object_set.visibility_flag.begin();

    // Test each plane of the frustum against each sphere.
    for (uint32_t i = offset; i < n_objects; i += 4)
    {
        const auto ws_pos_x = float4_load_aligned(&object_set->ws_pos_x[i]);
        const auto ws_pos_y = float4_load_aligned(&object_set->ws_pos_y[i]);
        const auto ws_pos_z = float4_load_aligned(&object_set->ws_pos_z[i]);
        const auto radius = float4_load_aligned(&object_set->radius[i]);

        auto inside = all_true;
        for (unsigned p = 0; p < 6; ++p) {
            auto &n_x = planes[p].normal_x;
            auto &n_y = planes[p].normal_y;
            auto &n_z = planes[p].normal_z;
            auto n_dot_pos = dot_product(ws_pos_x, ws_pos_y, ws_pos_z, n_x, n_y, n_z);
            auto plane_test_point = n_dot_pos + radius;
            auto plane_test = plane_test_point >= planes[p].d;
            inside = vector_and(plane_test, inside);
        }

        // Store 0 for spheres that didn't intersect or ended up on the positive side of the
        // frustum planes. Store 0xffffffff for spheres that are visible.
        store_aligned(inside, &visibility_flag[i]);
    }
}

Bringing the previous code snippet together with multi-threaded culling:

{
    // Calculate the number of work items based on that each work will process `work_size` elements.
    const uint32_t work_size = 512;

    // `div_ceil(a, b)` calculates `(a + b - 1) / b`.
    const uint32_t n_work_items = math::div_ceil(n_objects, work_size);

    Array<CullingWorkItem> culling_work_items(n_work_items);
    Array<TaskDefinition> tasks(n_work_items);

    // Splat out the planes to be able to do plane-sphere test with SIMD.
    const auto &frustum = camera.frustum();

    const SIMDPlane planes[6] = {
        same code as previously shown...
    };

    // Make sure to align the size to the simd lane count.
    const uint32_t n_aligned_objects = align_to_simd_lane_count(object_set.n_objects);

    for (unsigned i = 0; i < n_work_items; ++i) {

        // The `offset` and `count` for the work item.
        const uint32_t offset = math::min(work_size * i, n_objects);
        const uint32_t count = math::min(work_size, n_objects - offset);

        auto &culling_item = culling_work_items[i];
        memcpy(culling_data.planes, planes, sizeof(planes));
        culling_item.object_set = &object_set;
        culling_item.offset = offset;
        culling_item.count = count;
        culling_item.signal = 0;

        auto &task = tasks[i];
        task.callback = simd_sphere_culling_task;
        task.user_data = &culling_item;
    }

    // Add the tasks to the `ThreadPool`.
    thread_pool.add_tasks(n_work_items, tasks.begin());
    // Wait for each `item` and if it's not done, help out with the culling work.
    for (auto &item : culling_work_items)
        thread_pool.wait_atomic(&item.signal, 0);
}

CullingWorkItem and simd_sphere_culling_task are defined as:

struct CullingWorkItem
{
    SIMDPlane planes[6];
    const culling::ObjectSet *object_set;
    uint32_t offset;
    uint32_t count;
    std::atomic<uint32_t> signal;
};

void simd_sphere_culling_task(void *user_data)
{
    auto culling_item = (CullingWorkItem*)(user_data);

    // Call the frustum-sphere culling function.
    simd_sphere_culling(culling_item->planes, *culling_item->object_set, culling_item->offset, culling_item->count);

    // Signal that the work is done.
    culling_item->store(1, std::memory_order_release);
}

The same pattern is used to multi-thread the frustum-OOBB culling. That is "left as an exercise for the reader" ;)

Conclusion

This type of culling is done for all of the objects that can be rendered, i.e. meshes, particle systems, terrain, etc. We also use it to cull light sources. It is used both when rendering the main scene and for rendering shadows.

I've left out a few details of our solution. One thing we also do is something called contribution culling. In the frustum-OOBB culling step, the extents of the OOBB corners are projected to the near plane and from that the screen space extents are derived. If the object is smaller than a certain threshold in any axis the object is considered as culled. Special care needs to be considered if any of the corners intersect or is behind the near plane so we don't have to deal with "external line segments" caused by the projection. If you don't know what that is see: http://www.gamasutra.com/view/news/168577/Indepth_Software_rasterizer_and_triangle_clipping.php. In our case the contribution culling is disabled by expanding the extents to span the entire screen when any corner intersects or is behind the near plane.

For our cascaded shadow maps, the extents are also used to detect if an object is fully enclosed by a cascade. If that is the case, then that object is culled from the later cascades. Let me illustrate with some ASCII:

+-----------+-----------+
|           |           |
|     /\    |           |
|    /--\   |           |
+-----------+-----------+
|           |           |
|           |           |
|           |           |
+-----------+-----------+

The squares are the different cascades. The top left square is the first cascades, the top right is the second cascade, bottom left the third and the bottom right is the fourth cascade. In this case the weird triangle shaped object is fully enclosed by the first cascade. What that means is that the object doesn't need to be rendered to any of the later cascades, since the shadow contribution from that object will be fully taken care of from the first cascade.