Leif Node

Voxel Cone Traced Global Illumination

Leif Erkenbrach — Mon, 04 May 2015 07:06:11 +0000

I’ve been looking at voxel cone traced global illumination for a while as something that I want to implement since it gives a decent approximation of global illumination in real time for dynamic scenes. In the past month I’ve finally given myself a chance to look at the algorithm more in-depth and try at implementing it.

Voxel Cone Traced Global Illumination

Voxel cone traced global illumination allows real-time evaluation of indirect lighting. It works by voxelizing a scene into a structure on the GPU that stores outgoing radiance and occlusion. Then the scene is rendered as normal, but cones are cast through the volume from each fragment to approximate indirect diffuse and specular lighting.

Voxelization

The first step of the algorithm is to voxelize the scene. The original implementation builds a sparse octree structure on the GPU. The octree implementation helps reduce memory usage on the GPU significantly so it is possible to voxelize the scene at higher resolution, but traversing the structure is not incredibly fast.

Instead of using a sparse voxel octree I just used a 3D texture in my implementation to simplify the cone tracing and mip mapping steps.

The OpenGL Insights chapter about voxelization using the hardware rasterizer has helpful for getting an idea of how to do voxelization. The method for averaging colors on voxels using interlocked operations was useful, but I ran into some problems when using it with DirectX.

When I’m voxelizing geometry it gets passed through the geometry shader to find the normal of triangle faces and get the dominant axis from that normal to project the triangle onto. Once the triangle is projected onto its dominant axis it gets passed through a pixel shader that writes to the target 3D texture.

I store the diffuse albedo, normal, and emisssive color in three 3D textures. All of the textures have the format RGBA8. In total, along with the radiance texture these end up taking about 350MB of memory on the GPU with 256x256x256 volumes since they are not sparse volumes so there is a lot of unused, wasted space. A 512x512x512 volume takes about 2.5 GB of GPU memory so I normally stick with the 256 volume.

Injecting Radiance

Next I render a shadow map from the light’s perspective. I then run a compute shader on the resulting depth map that unprojects each pixel back into world space and then into voxel volume coordinates. It then gets the diffuse color and normal at the position of the pixel, calculates the diffuse lighting on the voxel, and stores the result in a radiance texture.

Mip Mapping

Anisotropic voxels mip for the -Z direction

In order to do cone tracing the volume needs to be mip mapped. I mip map the volume into anisotropic voxels in the same way that the original implementation does. Anisotropic voxels store a color that varies based on the direction that the voxel gets sampled from.

I store a color for positive and negative X, Y, and Z. These values are calculated by taking the 8 values that go into the upper mip and doing volume integration on those voxels in the direction that the anisotropic voxel is for.

Since DirectX has no 3D texture array structure I just store the colors for the different axis of the anisotropic voxels in one larger 3D texture where the X axis size is extended to be 6x the size of the mipmap x dimension. This is not that terrible for memory usage since I don’t need to do this on the base volume mip, and only on the upper mip maps. Since I normally use a 256x256x256 volume this means that the first mip level will end up being 768x128x128.

Because of this, the anisotropic voxel volume is stored as a different texture than the base radiance texture.

Cone Tracing

Cone tracing is a way to approximate the result of casting many rays into a scene with a distribution on a lobe. This is done by taking samples along a ray, but as the samples get further from the ray origin the sampled mip map level increases.

Each circle is a sample along a ray; the size of the circles corresponds to the mip map sampling level

It’s not possible to get perfect spheres during sampling, but quadralinearly interpolating the sampled colors works well enough to approximate a sphere.

Each sample is accumulated based on the sample’s occlusion value and color as the cone traces outwards. Once the accumulated occlusion is close to 1.0 the cone tracing stops.

This can be seen when you’re doing cone tracing for the specular reflection cone. As the samples get further from the ray origin their sample radius increases. This makes reflections of objects close to a surface appear sharper, but appear more blurry as they get further from the surface.

Specular Cone Tracing

Tracing cones with wide apertures along the view vector reflected long surface normals to approximate specular reflections

I ended up working out a function to map the roughness value I’m using in my physically based shading implementation to a cone aperture using the GGX importance sampling function for image based lighting. So it’s pretty simple to just take the roughness and get a cone that would give a similar approximation of roughness to the image based lighting.

Only the indirect specular component with roughness to vary cone aperture

Diffuse Cone Tracing

Using cone tracing to figure out indirect diffuse lighting works the same way, but you trace multiple cones in different directions to try and approximate a hemisphere on the surface’s normal. I use 6 cones with 60 degree apertures. One points in the direction of the surface normal and 5 others circle around it.

In this stage it’s also possible to approximate ambient occlusion by using the average occlusion of all of the diffuse cones.

Final Composition

Now it’s pretty simple to just add the diffuse and specular components into the direct lighting. I have yet to go and put in the rest of the physically based rendering work at the moment though. So more specular surfaces can appear to be metallic.

Emissive Materials

Voxel cone tracing also makes it pretty simple to add direct lighting from emissive materials. This is done simply by adding the emissive color of a material to the radiance volume.

This is useful because it supports arbitrarily shaped lights with emissive colors that can vary across the surfaces of objects. This makes it possible to approximate the illumination from area lights easily.

Performance

There’s still a lot that I want to optimize. At the moment everything runs with decent frame rates, but there’s not much room left for other GPU work. I’m running these tests on my GTX 980.

When the only thing that needs to be done is cone tracing, each frame only takes about 4 ms @ 720p and 10 ms @ 1080p

The whole injection and mip mapping process takes about 4 ms on top of that so if the directional light is moving then that will be added to the total.

Re-voxelizing the entire Sponza scene also takes about 4 ms and is necessary at the moment if there are dynamic objects since I’m not flagging static geometry or anything.

Drawbacks

The most significant drawback with my implementation is that it takes a significant chunk of GPU memory to store the scene. My implementation using a volume with the resolution of 256x256x256 takes ~350MB. And that’s for a comparably very small scene. Because of this it’s not very plausible to make the volume any higher resolution than it currently is. A volume with 512 resolution takes ~2.5GB. This is mostly an issue because it becomes hard to scale this to larger scenes and maintain decent quality. Even on a small scene like Sponza with 256x256x256 sized volumes each voxel is about 10 cm wide.

Performance is also a major concern. While my implementation is not yet optimized that well, the cone tracing step performs better than the sparse octree tracing since there’s a lot less cache thrashing.

Problems at the Moment

At the moment I’m not mip mapping the radiance volume using a gaussian kernel. I started by doing this, but when I switched over to anisotropic voxels I did not get around to implementing it again. This causes a lot of banding when sampling specular and diffuse. At the moment I’m just continuing the cone until its occlusion reaches 0.999, but this ruins the occlusion on diffuse and specular so you can see color bleed through occluding geometry. Another issue that this would probably alleviate is apparent flickering in indirect diffuse and specular illumination from dynamic objects.

I got the interlocked average during voxelization converted to HLSL syntax using InterlockedCompareExchange, but once I try to do it to average values in multiple output textures it seems like the shader gets deadlocked because of scheduling.

Other Storage Methods

I did the most basic implementation by just using a plain 3D texture. There are several other ways that the voxel structure can be stored.

Sparse Voxel Octrees

The first alternative is to store voxels in a sparse octree structure. This is what the original implementation uses and allows much more effective use of memory since it does not store anything for empty areas of the volume. Though there is a performance tradeoff due to needing to traverse the tree.

Sparse Voxel DAGs

There’s an extension to the SVO technique called Sparse Voxel Directed Acyclic Graphs. These work by taking the basic sparse voxel octree and merging identical nodes together and redirecting the pointers of parent nodes to single nodes. This is capable of decreasing the memory footprint further. However, it seems like it would not work well unless you just need to store occlusion values like in the paper’s implementation. If you store more data such as diffuse albedo, then it would become much less likely to find identical child nodes to the point where it would probably not be worth the extra time to build the tree.

Cascaded Voxel Volumes

Another method is to extend the concept of cascaded shadow maps to VXGI. This has multiple volumes with identical resolution, but different scale that are centered around the player and voxelize the scene for varying levels of coverage and spacial resolution. The Tomorrow Children by Q-Games does this to get efficient coverage of large scenes on the PS4. They also stagger the updates of each cascade across frames and prioritize the one closest to the player. It seems like NVIDIA’s recent implementation of VXGI in Unreal Engine 4 also does this based on the observation that specular reflections lose quality at greater distances from the player.

Tiled Volume Textures

Finally, DirectX 11.3 and 12 bring volume tiled resources in as a feature to primarily target the memory issues while maintaining high performance. Tiled resources were first added to DirectX in 11.2 which is included with Windows 8.1, but there was no support for 3D textures. Tiled resources expose some of the virtual addressing capability that graphics cards have to allow you to load high resolution textures that would otherwise not fit on the GPU by only having part of the texture loaded on the GPU at any given time. id Tech 5 used virtual texturing for Rage which is close to the same thing as tiled resources, but it had to spend a large portion of time to texture everything since the engine had to manage all of the pages itself.

Having tiled resources as a hardware feature makes it both easier to implement, and more efficient. It allows the performance of 3D texture volume while allowing you to mark blocks of the texture as unused to maintain some of the sparseness of SVOs and DAGs. This will probably be a good enough compromise for memory, though the mapping of tiles in the volume texture needs to be done by the CPU. This means that dynamic parts of the scene would need to flag which bricks need to be marked as active, then write the list of bricks back to the CPU, then have the CPU mark the bricks as active/inactive. Reading back to the CPU takes a few frames so the dynamic objects may not get fully voxelized across the bounds of bricks until the CPU can mark the brick as used.

Where to Go From Here

First I am going to implement the correct filtering on the radiance volume so that I can get more correct-looking occlusion.

I also want to implement the sparse octree structure and tracing just to use it for comparing to other implementations on performance and memory usage. I am inclined to try a DAG with this, but I don’t think it will be that worthwhile for this application so I’m not sure if I’ll get around to that.

I really want to implement the sparse 3D texture when I can, but at the moment Microsoft has not released the public SDK for DirectX 11.3 and 12. I’m waiting for a response to my application to the early access program, but have not gotten a response in the past month so I’m not confident on that. Along with this I want to do more to manage voxelization of scene geometry so that I can mark static and dynamic geometry.

Resources

-Interactive Indirect Illumination Using Voxel Cone Tracing

-Implementing Voxel Cone Tracing

-Octree-Based Sparse Voxelization Using the GPU Hardware Rasterizer

-GigaVoxels: Ray-Guided Streaming for Efficient and Detailed Voxel Rendering

-Cascaded Voxel Cone Tracing in The Tomorrow Children

-High Resolution Sparse Voxel DAGs

Tiled Deferred Shading

Leif Erkenbrach — Sat, 02 May 2015 04:55:28 +0000

I’ve been looking at a lot of resources on current rendering algorithms to get nice looking real-time graphics and thought that it’s time that I actually go and implement some of them. This is the first project that I worked on in a series of three that I used to improve my understanding of some graphics algorithms. I’m also using physically based shading in some of these screen shots.

Deferred Shading

Deferred shading is an alternative method of doing lighting calculations for a 3D scene. The traditional method of forward shading is to go render each object to the back buffer and do lighting calculations for that object at the same time. Forward shading has been the primary method that most rasterizer-based renderers use for a long time. Standard forward rendering has the drawback that it quickly becomes hard to manage when you want to have more than a few dynamic lights affecting an object. The most common solutions are to either pass an array of lights into each shader and let the shader evaluate the shading for all lights in the list on each object, or to render the same object multiple times with additive blending for each light that affects the object.

Deferred shading goes and does as its name says. It defers the lighting calculations until all objects have been rendered, and then it goes and shades the whole scene in one pass. This is done by rendering information about each object to a set of render targets that contain data about the surface of the object this set of render targets is normally called the G-buffer.

For instance the normals of each object encoded into the 0 to 1 range as one of the render targets:

This is also done with diffuse albedo, specular color and power, depth, and emissive. Each get packed into their own textures.

At the moment I’m using textures packed with the following formats:

Diffuse: RGBA8 texture
Specular Color & Roughness: RGBA8 texture where the alpha is the roughness
World Space Normal: RGBA16 texture where alpha is currently unused
Emissive Color: RGBA16 texture
Depth: R32 texture

A while back I made a very basic implementation of deferred shading that would just render a scene to a G buffer and then draw a quad on the screen that evaluated lighting from hundreds of point lights at each fragment. This ended up running pretty poorly since it was just shading with brute force. I ended up just going with forward shading for a while and using a single directional light for many of my projects afterwards.

I looked around some and found a number of culling techniques that could significantly improve deferred rendering performance. A few involved drawing proxy geometry that approximated the bounds of each type of light and evaluating lighting by sampling from the G buffer for each fragment that the geometry touched. This can be implemented with varying complexity of proxy geometry. Some implementations just used billboarded quads with enough width and height in world space to approximate the bounds of the area that the light influences. For instance a point light would just have a quad with a width and height the same as the lights radius of influence. Other implementations actually draw 3D proxy geometry like spheres for point lights and cones for spotlights.

These implementations have the issue that they require many additional samples of the G buffer. Each light still needs to sample the G buffer for each texture that it has; in my case 5 textures. So each fragment of the G buffer gets sampled 5 * the number of lights affecting that fragment. Additionally these techniques incur a lot of overdraw since many of the proxy geometry objects will overlap and cannot be culled most of the time.

Tiled Deferred Shading

Tiled Deferred Shading allows you to avoid the overdraw and only needs to sample each G buffer texture once so it’s generally capable of performing much better than using proxy geometry. The main resource I found on tiled deferred shading was this presentation and implementation.

Tiled deferred shading takes a different approach and does all of the culling of lights in the same pass as the shading calculations as opposed to using proxy geometry to do all the shading and executing multiple times per fragment.

Tiled deferred shading splits the viewport frustum into many smaller frustums in a grid of tiles that are extended along z in view space and does the culling on each of those frustums.

This is what a grid of tiles looks like when colored by the number of lights intersecting each frustum:

In my implementation I split the screen into 16×16 tiles using a compute shader with 16x16x1 sized thread groups. Each tile contains a list of indices into a global array of lights. The tiled deferred shader starts by constructing frustums for each tile that are capped by the minimum and maximum depth stored in the 16×16 tile on the depth texture. This is why in the previous picture edges are highlighted; it’s because the frustum on those tiles has much larger variation in the minimum and maximum so the frustum has the potential to intersect more point lights.

Once the frustums are constructed, each tile checks if a light overlaps the frustum. If a light overlaps the frustum, then the light’s index in the global light list is appended to the tile’s index list which is stored in group shared memory.

Finally once the index list for the tile is constructed each thread loops through the index list and accumulates the lighting from all lights in the list.

Here’s what the Sponza scene looks like with 512 lights (I’m also using a physically based BRDF but I won’t talk about that in this post):

Performance

Performance really depends on several factors so it’s difficult to gauge the performance for all scenarios. My computer has a GTX 980 so all measurement is made using that. With my current settings I can render the Sponza scene at 1080p with 512 lights at 16 ms a frame. 256 lights generally takes 4.5 ms and 1028 takes 30 ms.

In larger scenes the performance would be much better since there will generally be many less lights intersecting any given tile.

Drawbacks

Tiled deferred shading is not a magic bullet, and has some major problems associated with it.

The first is that it can take a lot of memory, especially with higher resolution outputs. With my current buffer layout I’m being pretty greedy since I want extra precision on normals and a large range on emissive without encoding an intensity. At 1080p my G buffer takes up 55 MB on the GPU. This is less of a problem on newer GPUs and consoles with 2, 4, and 8GB of GPU memory. However, it still grows with resolution and can eat away at the texture budget. If you’re rendering at 4K then the G buffer will suddenly be 236MB.

The second and largest issue is that there’s no way to do transparency unless you store a deep G buffer with data for multiple fragments per pixel which makes the memory situation far worse. Most engines just do a forward pass after the deferred pass to render all of the transparent geometry, but it’s harder to shade the transparent geometry efficiently.

Another issue is that it’s difficult to change the shading model on geometry, so if you want to shade most of the level using a standard microfacet BRDF, then render some characters with a subsurface scattering effect on their skin, then render their eyes, then render some cars with clearcoat surfaces it’s not going to work using just basic deferred shading. With these situations I’ve seen engines do one of two things; some go and build the shading model into the tiled differed shader and dynamic branch between the models depending on a value from the G buffer, other go and do a forward rendering pass with the alternate shading model.

The final significant issue is that it’s difficult to do good quality anti-aliasing with decent performance. Most engines that use deferred shading pipelines use FXAA or some other screen space anti-aliasing solution. The problem with FXAA is that it can tend to blur details that shouldn’t be blurred. It is possible to do MSAA with deferred shading by rendering to a buffer with storage for multiple samples, and then finding the edges of objects in the G buffer and executing shading for all samples on the edges. However, this has a more significant performance hit with deferred shading than it does with forward shading because of the additional texture sampling.

Improvements to Tiled Deferred Shading

It’s possible to improve the culling of lights in various ways so that you avoid as many erroneous positive overlap checks on lights as possible. I have yet to try any of these in my implementation.

The first way is change the way that intersections tests are preformed. The standard algorithm uses simple sphere-frustum intersection tests, but it’s possible and can be more effective to use alternative intersection tests. Constructing a bounding box around the frustum and additionally checking against that can remove overlaps that are wrongly detected with the frustum-sphere test. The bounding box test also does better on its own. In regards to false-positives with the frustum tests Iñigo Quilez has a good article here.

The second way is to partition the frustum more effectively. There are several ways that people have done this. One way is the split the frustum into two frustums at the center along the depth of the frustum between the minimum and maximum depth of the frustum. Then find the minimum and maximum depths of the two smaller frustums and make two separate frustums to check against. This is what Unreal Engine 4 does and it works decently to resolve depth discontinuities. It’s also possible to take a simpler approach and just split the frustum into two lists, one for the first half, and one for the second. The correct list then gets chosen in the shading stage depending on the depth of the fragment being shaded.

Determining the minimum and maximum depth values for each tile can also be improved fairly simply. Most implementations of tiled deferred shading just brute force determining the minimum and maximum depth values by using InterlockedMin and InterlockedMax. It is possible to get small performance boost by doing parallel reduction on each tile, though it would require a separate shader.

Similar Algorithms

Tiled deferred shading can actually be fairly easily extended to resolve many of the issues with the original implementation; in exchange it requires a pre-pass over the scene to render depth. This technique is normally referred to as Forward+.

Forward+ rendering renders the scene from the viewport’s perspective to a depth buffer, then runs the same tiled culling algorithm that tiled deferred shading uses on the depth texture and builds a set of index lists in a global buffer. Then all geometry in the scene is rendered a second time and shaded. During the forward shading of the geometry, the shader determines what index list to loop through based on the screen space position of the fragment.

This algorithm generally performs better than tiled deferred shading for less than 2048 lights, but sees performance gains across the board when comparing MSAA performance between the two algorithms. It also solves the problems with transparent geometry and varying shading models.

Another more recent extension to tiled deferred shading is an algorithm called Clustered Deferred Shading. Clustered deferred shading does frustum culling in 3 dimensions instead of two by splitting the frustums along the Z axis with exponential spacing. It can also be extended similarly to work for forward shading.

Instead of doing a brute force check of every light against every cell of the frustums, it does hierarchical culling by building an octree out of the cells. It then merges cells with identical index lists and stores them in a page table. This allows for much faster and effective culling. In the paper’s implementation they are able to cull 1 million lights in under 6 ms.

Where to Go From Here

I want to go and improve the culling at some point to be more tight on what it can cull. It will probably use a combination of frustum culling and bounding box culling. Next would be implementing Forward+, but I like the abstraction that deferred shading provides so I need to figure out a good way to author shaders for Forward+ that’s not crazy complicated.

I want to also try at implementing clustered forward shading since it seems to allow crazy large numbers of lights in real time. However, the original implementation uses CUDA for the culling, and I’m not completely sure if that means that it’s impossible or just inefficient to implement in DirectX compute shaders. Some other implementations have done the culling on the CPU. I need to look more at the implementation to see if it’s plausible to do in compute shaders, otherwise I’ll probably just do the CUDA implementation and interop with DirectX since I’ve looked at CUDA before and it does not seem that complex.

Useful Resources

Andrew Lauritzen - Deferred Rendering for Current and Future Rendering Pipelines

Gareth Thomas - Advancements In Tiled Rendering

Ola Olsson, Markus Billeter, and Ulf Assarsson - Clustered Deferred and Forward Shading