I’ve been looking at a lot of resources on current rendering algorithms to get nice looking real-time graphics and thought that it’s time that I actually go and implement some of them. This is the first project that I worked on in a series of three that I used to improve my understanding of some graphics algorithms. I’m also using physically based shading in some of these screen shots.
Deferred Shading
Deferred shading is an alternative method of doing lighting calculations for a 3D scene. The traditional method of forward shading is to go render each object to the back buffer and do lighting calculations for that object at the same time. Forward shading has been the primary method that most rasterizer-based renderers use for a long time. Standard forward rendering has the drawback that it quickly becomes hard to manage when you want to have more than a few dynamic lights affecting an object. The most common solutions are to either pass an array of lights into each shader and let the shader evaluate the shading for all lights in the list on each object, or to render the same object multiple times with additive blending for each light that affects the object.
Deferred shading goes and does as its name says. It defers the lighting calculations until all objects have been rendered, and then it goes and shades the whole scene in one pass. This is done by rendering information about each object to a set of render targets that contain data about the surface of the object this set of render targets is normally called the G-buffer.
For instance the normals of each object encoded into the 0 to 1 range as one of the render targets:
This is also done with diffuse albedo, specular color and power, depth, and emissive. Each get packed into their own textures.
At the moment I’m using textures packed with the following formats:
- Diffuse: RGBA8 texture
- Specular Color & Roughness: RGBA8 texture where the alpha is the roughness
- World Space Normal: RGBA16 texture where alpha is currently unused
- Emissive Color: RGBA16 texture
- Depth: R32 texture
A while back I made a very basic implementation of deferred shading that would just render a scene to a G buffer and then draw a quad on the screen that evaluated lighting from hundreds of point lights at each fragment. This ended up running pretty poorly since it was just shading with brute force. I ended up just going with forward shading for a while and using a single directional light for many of my projects afterwards.
I looked around some and found a number of culling techniques that could significantly improve deferred rendering performance. A few involved drawing proxy geometry that approximated the bounds of each type of light and evaluating lighting by sampling from the G buffer for each fragment that the geometry touched. This can be implemented with varying complexity of proxy geometry. Some implementations just used billboarded quads with enough width and height in world space to approximate the bounds of the area that the light influences. For instance a point light would just have a quad with a width and height the same as the lights radius of influence. Other implementations actually draw 3D proxy geometry like spheres for point lights and cones for spotlights.
These implementations have the issue that they require many additional samples of the G buffer. Each light still needs to sample the G buffer for each texture that it has; in my case 5 textures. So each fragment of the G buffer gets sampled 5 * the number of lights affecting that fragment. Additionally these techniques incur a lot of overdraw since many of the proxy geometry objects will overlap and cannot be culled most of the time.
Tiled Deferred Shading
Tiled Deferred Shading allows you to avoid the overdraw and only needs to sample each G buffer texture once so it’s generally capable of performing much better than using proxy geometry. The main resource I found on tiled deferred shading was this presentation and implementation.
Tiled deferred shading takes a different approach and does all of the culling of lights in the same pass as the shading calculations as opposed to using proxy geometry to do all the shading and executing multiple times per fragment.
Tiled deferred shading splits the viewport frustum into many smaller frustums in a grid of tiles that are extended along z in view space and does the culling on each of those frustums.
This is what a grid of tiles looks like when colored by the number of lights intersecting each frustum:
In my implementation I split the screen into 16×16 tiles using a compute shader with 16x16x1 sized thread groups. Each tile contains a list of indices into a global array of lights. The tiled deferred shader starts by constructing frustums for each tile that are capped by the minimum and maximum depth stored in the 16×16 tile on the depth texture. This is why in the previous picture edges are highlighted; it’s because the frustum on those tiles has much larger variation in the minimum and maximum so the frustum has the potential to intersect more point lights.
Once the frustums are constructed, each tile checks if a light overlaps the frustum. If a light overlaps the frustum, then the light’s index in the global light list is appended to the tile’s index list which is stored in group shared memory.
Finally once the index list for the tile is constructed each thread loops through the index list and accumulates the lighting from all lights in the list.
Here’s what the Sponza scene looks like with 512 lights (I’m also using a physically based BRDF but I won’t talk about that in this post):
Performance
Performance really depends on several factors so it’s difficult to gauge the performance for all scenarios. My computer has a GTX 980 so all measurement is made using that. With my current settings I can render the Sponza scene at 1080p with 512 lights at 16 ms a frame. 256 lights generally takes 4.5 ms and 1028 takes 30 ms.
In larger scenes the performance would be much better since there will generally be many less lights intersecting any given tile.
Drawbacks
Tiled deferred shading is not a magic bullet, and has some major problems associated with it.
The first is that it can take a lot of memory, especially with higher resolution outputs. With my current buffer layout I’m being pretty greedy since I want extra precision on normals and a large range on emissive without encoding an intensity. At 1080p my G buffer takes up 55 MB on the GPU. This is less of a problem on newer GPUs and consoles with 2, 4, and 8GB of GPU memory. However, it still grows with resolution and can eat away at the texture budget. If you’re rendering at 4K then the G buffer will suddenly be 236MB.
The second and largest issue is that there’s no way to do transparency unless you store a deep G buffer with data for multiple fragments per pixel which makes the memory situation far worse. Most engines just do a forward pass after the deferred pass to render all of the transparent geometry, but it’s harder to shade the transparent geometry efficiently.
Another issue is that it’s difficult to change the shading model on geometry, so if you want to shade most of the level using a standard microfacet BRDF, then render some characters with a subsurface scattering effect on their skin, then render their eyes, then render some cars with clearcoat surfaces it’s not going to work using just basic deferred shading. With these situations I’ve seen engines do one of two things; some go and build the shading model into the tiled differed shader and dynamic branch between the models depending on a value from the G buffer, other go and do a forward rendering pass with the alternate shading model.
The final significant issue is that it’s difficult to do good quality anti-aliasing with decent performance. Most engines that use deferred shading pipelines use FXAA or some other screen space anti-aliasing solution. The problem with FXAA is that it can tend to blur details that shouldn’t be blurred. It is possible to do MSAA with deferred shading by rendering to a buffer with storage for multiple samples, and then finding the edges of objects in the G buffer and executing shading for all samples on the edges. However, this has a more significant performance hit with deferred shading than it does with forward shading because of the additional texture sampling.
Improvements to Tiled Deferred Shading
It’s possible to improve the culling of lights in various ways so that you avoid as many erroneous positive overlap checks on lights as possible. I have yet to try any of these in my implementation.
The first way is change the way that intersections tests are preformed. The standard algorithm uses simple sphere-frustum intersection tests, but it’s possible and can be more effective to use alternative intersection tests. Constructing a bounding box around the frustum and additionally checking against that can remove overlaps that are wrongly detected with the frustum-sphere test. The bounding box test also does better on its own. In regards to false-positives with the frustum tests Iñigo Quilez has a good article here.
The second way is to partition the frustum more effectively. There are several ways that people have done this. One way is the split the frustum into two frustums at the center along the depth of the frustum between the minimum and maximum depth of the frustum. Then find the minimum and maximum depths of the two smaller frustums and make two separate frustums to check against. This is what Unreal Engine 4 does and it works decently to resolve depth discontinuities. It’s also possible to take a simpler approach and just split the frustum into two lists, one for the first half, and one for the second. The correct list then gets chosen in the shading stage depending on the depth of the fragment being shaded.
Determining the minimum and maximum depth values for each tile can also be improved fairly simply. Most implementations of tiled deferred shading just brute force determining the minimum and maximum depth values by using InterlockedMin and InterlockedMax. It is possible to get small performance boost by doing parallel reduction on each tile, though it would require a separate shader.
Similar Algorithms
Tiled deferred shading can actually be fairly easily extended to resolve many of the issues with the original implementation; in exchange it requires a pre-pass over the scene to render depth. This technique is normally referred to as Forward+.
Forward+ rendering renders the scene from the viewport’s perspective to a depth buffer, then runs the same tiled culling algorithm that tiled deferred shading uses on the depth texture and builds a set of index lists in a global buffer. Then all geometry in the scene is rendered a second time and shaded. During the forward shading of the geometry, the shader determines what index list to loop through based on the screen space position of the fragment.
This algorithm generally performs better than tiled deferred shading for less than 2048 lights, but sees performance gains across the board when comparing MSAA performance between the two algorithms. It also solves the problems with transparent geometry and varying shading models.
Another more recent extension to tiled deferred shading is an algorithm called Clustered Deferred Shading. Clustered deferred shading does frustum culling in 3 dimensions instead of two by splitting the frustums along the Z axis with exponential spacing. It can also be extended similarly to work for forward shading.
Instead of doing a brute force check of every light against every cell of the frustums, it does hierarchical culling by building an octree out of the cells. It then merges cells with identical index lists and stores them in a page table. This allows for much faster and effective culling. In the paper’s implementation they are able to cull 1 million lights in under 6 ms.
Where to Go From Here
I want to go and improve the culling at some point to be more tight on what it can cull. It will probably use a combination of frustum culling and bounding box culling. Next would be implementing Forward+, but I like the abstraction that deferred shading provides so I need to figure out a good way to author shaders for Forward+ that’s not crazy complicated.
I want to also try at implementing clustered forward shading since it seems to allow crazy large numbers of lights in real time. However, the original implementation uses CUDA for the culling, and I’m not completely sure if that means that it’s impossible or just inefficient to implement in DirectX compute shaders. Some other implementations have done the culling on the CPU. I need to look more at the implementation to see if it’s plausible to do in compute shaders, otherwise I’ll probably just do the CUDA implementation and interop with DirectX since I’ve looked at CUDA before and it does not seem that complex.
Useful Resources
Andrew Lauritzen - Deferred Rendering for Current and Future Rendering Pipelines
Gareth Thomas - Advancements In Tiled Rendering
Ola Olsson, Markus Billeter, and Ulf Assarsson - Clustered Deferred and Forward Shading