23 Mar 2012

# Space partitioning and ray traversal with `cukd`

A first working version of my parallelized kd-tree implementation, `cukd`

, is available on github. A good moment to analyze how it performs so far.

## What it does

Given a triangle soup, `cukd`

constructs a kd-tree on NVidia GPUs utilizing the CUDA framework, using empty space cutting, median splitting and surface area heuristics (SAH). Once the tree is constructed, it takes a list of rays — their origins and directions - and traces them through the scene, thereby searching for triangle – ray intersections, returning indices to the foremost intersected triangles and traversal costs.

Here’s a visualisation of the ray traversal cost for the Stanford Dragon with roughly 200k triangles.

Red and green pixels represent triangle hits/no hits, resepectively, and the brightness encodes the traversal cost.

## How well it works

Quite to my surprise, the first version of `cukd`

shows nice performance. So far, the main focus of development was on correctness in implementing the algorithm detailed in Real time KD-Tree Construction on Graphics Hardware by Zhou et al., choosing the optimal interplay of algorithms like parallel reductions and scans, reducing memory allocations on and transfers to the GPU and using structure of arrays wherever possible. Only secondary (though very important) were hardware specific considerations, like memory alignment, using of texture/shared/constant memory and so on, though these changes can be incorporated quite easily in the future. Still, the performance is quite impressive, topping almost 90 million rays/sec for a small scene:

Building the kd-tree for the Stanford dragon consisting of 300k triangles takes around 350ms on my NVidia GTX 760, not really in real time territory. 40 million rays/sec for traversal of this scene however pretty much is. Of course, there is more to raytracing than just ray-triangles intersections, but it’s a nice start. Here the rays are cast line by line in the scene. Switching to Morton order should improve cache coherence and therefore ray traversal time even more.

Looking closer at the graph, it’s obvious that there is a lot of overhead one can potentially get rid of. While the tree creation time rises linearly and rather slowly – going from ~6k to 300k triangles, a factor of 50, only requires 8 times the computation time –, the large ~60ms offset around ~6k triangles indidcates room for improvement.

The distribution of the traversal cost per ray (i.e. the number of steps through the tree plus the number of triangle intersection tests) looks the same for every size of the model:

The peak around ~40 steps for rays intersecting some triangle is exactly where one would expect it, somewhere between the average leaf depth of the tree (18 in this case) and the average number of triangles in each leaf (26 in this case). In the no-hit case, however, there is a large spread to fairly high values. I suppose this has to do with some kind of problem in empty space splitting. Indeed, the above picture looks very much different when we allow more empty space in each node.

Apparently there are huge nodes containing only few triangles at their boundaries, something that shouldn’t really happen.