Are you the publisher? Claim or contact us about this channel

Embed this content in your HTML


Report adult content:

click to rate:

Account: (login)

More Channels


Channel Catalog

Channel Description:

Quasi-random, more or less unbiased blog about real-time photorealistic GPU rendering
    0 0

    Today, the GPU Pro blog posted a very interesting article about a novel technique that seemlessly unifies rasterization and ray tracing based rendering for fully dynamic scenes. The technique entitled "Object-order Ray Tracing for Fully Dynamic Scenes" will be described in the upcoming GPU Pro 5 book (to be released on March 25, 2014 during the GDC conference)  and was developed by Tobias Zirr, Hauke Rehfeld and Carsten Dachsbacher .  

    Abstract (taken from
    This article presents a method for tracing incoherent secondary rays that integrates well with existing rasterization-based real-time rendering engines. In particular, it requires only linear scene access and supports fully dynamic scene geometry. All parts of the method that work with scene geometry are implemented in the standard graphics pipeline. Thus, the ability to generate, transform and animate geometry via shaders is fully retained. Our method does not distinguish between static and dynamic geometry. Moreover, shading can share the same material system that is used in a deferred shading rasterizer. Consequently, our method allows for a unified rendering architecture that supports both rasterization and ray tracing. The more expensive ray tracing can easily be restricted to complex phenomena that require it, such as reflections and refractions on arbitrarily shaped scene geometry. Steps in rendering that do not require the tracing of incoherent rays with arbitrary origins can be dealt with using rasterization as usual.

    This is by my knowledge the first practical implementation of the so-called hybrid rendering technique which mixes ray tracing and rasterization by plugging a ray tracer in an existing rasterization based rendering framework and sharing the traditional graphics pipeline. Since no game developer in his right mind will switch to pure ray tracing overnight, this seems to be the most sensible and commercially viable approach to introduce real ray traced high quality reflections of dynamic objects into game engines in the short term, without having to resort to complicated hacks like screen space raytracing for reflections (as seen in e.g. Killzone Shadow Fall, UE4 tech demos and CryEngine) or cubemap arrays, which never really look right and come with a lot of limitations and artifacts. For example, in this screenshot of the new technique you can see the reflection of the sky, which would simply be impossible with screen space reflections from this camera angle:  

    Probably the best thing about this technique is that it works with fully dynamic geometry (accelerating ray intersections by coarsely voxelizing the scene) and - judging from the abstract - with dynamically tesselated geometry as well, which is a huge advantage for DX11 based game engines. It's very likely that the PS4 is capable of real-time raytraced reflections using this technique and when optimized, it could not only be used for rendering reflections and refractions, but for very high quality soft shadows and ambient occlusion as well. 

    The ultimate next step would be global illumination with path tracing for dynamic scenes, which is a definite possibility on very high end hardware, especially when combined with another technique from a very freshly released paper (by Ulbrich, Novak, Rehfeld and Dachsbacher) entitled Progressive Visibility Caching for Fast Indirect Illumination which promises a 5x speedup for real-time progressively path traced GI by cleverly caching diffuse and glossy interreflections  (a video can be found here). Incredibly exciting if true!

    0 0
  • 10/19/14--17:10: Scratch-a-pixel and more
  • Having left Otoy some time ago and after enjoying  a sweet as holiday, it's time for things new and exciting. Lots of interesting rendering related stuff happened in the past months, below are some of the most fascinating developments in my opinion:

    - starting off, there's an excellent online tutorial series on computer graphics (mostly ray tracing) for both beginners and experts called Scratch-a-Pixel. The authors are veterans from the VFX, animation and game industry and have years of experience writing production rendering code like Renderman. The tutorials deal with all the features that are expected from a production renderer and contains a lot of background and insights into the science of light and tips and tricks on how to write performant and well optimized ray tracing code. Rendering concepts like CIE xyY colorspace and esoteric mathematical subjects like discrete Fourier transforms, harmonics and integration of orthonormal polynomials are explained in an easy-to-digest manner. Most tutorials also come with C++ source code. At the moment some sections are missing or incomplete, but the author told me there's a revamp of the website coming very soon... 

    - hybrid rendering (rasterization mixed with ray tracing) for games has finally arrived with the recent release of Unreal Engine 4.5 which supports ray traced soft shadows and ambient occlusion via signed distance fields (which can be faster to compute than traditional shadow mapping, but works only for static geometry): 

    A nice video of the technique in action:
    Like voxels or triangles, distance fields are another way to represent scene geometry. Just like voxels, distance fields approximate the scene geometry and are more efficient to trace than triangles to create low frequency effects like soft shadows, ambient occlusion and global illumination that don't require 100% geometric accuracy (and because they have inherent multiresolution characteristics by approximating the scene geometry). Inigo Quilez wrote a few interesting articles on rendering with distance fields (in 2008):

    Free penumbra shadows for raymarching distance fields

    More on distance fields:
    Distance fields in Unreal Engine
    Alex Evans from Media Molecule invented a neat trick to approximate AO and GI with distance fields in "Fast Approximations for Global Illumination for Dynamic Scenes"

    There's also  a very recent paper about speeding up sphere tracing for rendering of signed distance fields or path tracing: Enhanced sphere tracing 

    - one of the most interesting Siggraph 2014 surprises, must be the announcement from Weta (the New Zealand based visual effects studio that created the CG effects for blockbusters like the Lord of the Rings, King Kong, Avatar, Tintin and the Hobbit movies) that they are developing their own production path tracer called Manuka (the Maori name for New Zealand's healing tea tree) in conjunction with Gazebo, a physically plausible realtime GPU renderer. While Manuka has been used to render just a couple of shots in "The Hobbit: the Desolation of Smaug", it will be the main renderer for the next Hobbit film. More details are provided in this extensive fxguide article: Another surprise was Solid Angle (creators of Arnold) unveiling of an OpenCL accelerated renderer prototype running on the GPU. There's not much info to be found apart from a comment on by Solid Angle's Mike Farnsworth ("This is a prototype written by another Solid Angle employee (not Brecht), and it is not Arnold core itself. It's pretty obvious we're experimenting, though. We've been keeping a close eye on GPUs and have active communication with both AMD and Nvidia (and also obviously Intel). I wouldn't speculate on what the prototype is, or what Brecht is up to, because you're almost certainly going to be wrong.")

    - Alex St John, ex-Microsoft and one of the creators of DirectX API, has recently moved to New Zealand and aims to create the next standard API for real-time graphics rendering using CUDA GPGPU technology. More details on his blog His post on his visit to Weta contains some great insights into the CUDA accelerated CG effects created for The Desolation of Smaug. 

    - Magic Leap, an augmented reality company founded by a biomedical engineer, recently got an enormous investment from Google and is working with a team at Weta in New Zealand to create imaginative experiences. Info available on the net suggests they are developing a wearable device that directly projects 3d images onto the viewer's retina that seemlessly integrate with the real-life scene via projecting multiple images with a depth offset. Combined with Google Glass it could create games that are grounded in the real world like this: (augmented reality objects are rendered with Octane Render). 

    - the Lab for Animate Technologies at the University of Auckland in New Zealand is doing cutting edge research into the first real-time autonomously animated AI avatar: 
    The facial animation is driven in real-time by artificial intelligence using concepts from computational neuroscience and is based on a physiological simulation of the human brain which is incredibly deep and complex (I was lucky to get a behind the scenes look): it includes the information exchange pathways between the retina, the thalamic nuclei and the visual cortex including all the feedback loops and also mimics low level single neuron phenomena such as the release of neurotransmitters and hormones like dopamine, epinephrine and cortisol. All of these neurobiological processes together drive the avatar's thoughts, reactions and facial animation through a very detailed facial muscle system, which is probably the best in the industry (Mark Sagar, the person behind this project, was one of the original creators of the USC Lightstage and pioneered facial capturing and rendering for Weta in King Kong and Avatar). More info on and One of the most impressive things I've ever seen and it's something that is actually happening now. 

    0 0

    With last month's unveiling of Microsoft's augmented reality glasses project dubbed "HoloLens", today's announcement of Sony's plans to release a similar AR device called the "SmartEyeGlass", and with more details surfacing on Magic Leap's retina projecting fiber optic AR glasses (cleverly reconstructed from publicly available patents by a Gizmodo journalist), the hype around augmented reality seems to be reaching a peak at the moment. Unfortunately, most of the use cases for these technologies that have been demonstrated so far, for example husbands assisting their wives with screwing a new syphon on a sink, projecting the weather forecast for Maui on the kitchen wall or casually investigating a suspicious rock on the surface of Mars, look either gimmicky, far-fetched or both. 

    The area where I see a real and immediate use for these high tech AR devices is in the operating room. In my previous life as a medical student, I've spent quite some time in the operating theatre watching surgeons frantically checking if they were cutting the right part of the brain by placing a sharp needle-like pointer (with motion capture dots) on or inside the brain of the patient. The position of the pointer was picked up by 3 infrared cameras and a monitor showed the position of the needle tip in real-time on three 2D views (front, top and side) of the brain reconstructed from CT or MRI scans. This 3D navigation technique is called stereotactic neurosurgery and is an invaluable tool to guide neurosurgical interventions.    

    Instruments for stereotactic surgery (from here)

    While I was amazed at the accuracy and usefulness of this high tech procedure, I was also imagining ways to improve it, because every time the surgeon checks the position of the pointer on the monitor, he or she loses visual contact with the operating field and "blindly" guiding instruments inside the body is not recommended. A real-time three-dimensional augmented reality overlay that can be viewed from any angle, showing the relative position of the organs of interest (which might be partially or fully covered by other organs and tissues like skin, muscle, fat or bone) would be tremendously helpful provided that the AR display device minimally interferes with the surgical intervention and the augmented 3D images are of such a quality that they seamlessly blend with the real world. The recently announced wearable AR glasses by MS, Sony and Magic Leap seem to take care of the former, but for the latter there is no readily available solution yet. This is where I think real-time ray tracing will play a major role: ray tracing is the highest quality method to visualise medical volumetric data reconstructed from CT and MRI scans. It's actually possible to extend a volume ray caster with physically accurate lighting (soft shadows, ambient occlusion and indirect lighting) to add visual depth cues and have it running in real-time on a machine with multiple high end GPUs. The results are frighteningly realistic. I for one can't wait to test it with one of these magical glasses.

    As an update to my previous post, the people behind Scratch-a-Pixel have launched a v2.0 website, featuring improved and better organised content (still work in progress, but the old website can still be accessed). It's by far the best resource to learn ray tracing programming for both novices (non engineers) and experts. Once you've conquered all the content on Scratch-a-Pixel, I recommend taking a look at the following ray tracing tutorials that come with source code:

    - smallpt from Kevin Beason: an impressively tiny path tracer in 100 lines of C++ code. Make sure to read David Cline's slides which explain the background details of this marvel. 

    - Rayito, by Mike Farnsworth from Renderspud (currently at Solid Angle working on Arnold Render): a very neatly coded ray/path tracer in C++, featuring path tracing, stratified sampling, lens aperture (depth of field), a simple BVH (with median split), Qt GUI, triangle meshes with obj parser, diffuse/glossy materials, motion blur and a transformation system. Not superfast because of code clarity, but a great way to get familiar with the architecture of a ray tracer

    -  Renderer 2.x: a CUDA and C++ ray tracer, featuring a SAH BVH (built with the surface area heuristic for better performance), triangle meshes, a simple GUI and ambient occlusion

    - Peter and Karl's GPU path tracer: a simple, but very fast open source GPU path tracer which supports sphere primitives, raytraced depth of field and subsurface scattering (SSS)
    If you're still not satisfied after that and want a deeper understanding, consider the following books:
    - "Realistic ray tracing" by Peter Shirley, 
    - "Ray tracing from the ground up" by Kevin Suffern, 
    - "Principles of Digital Image Synthesis" by Andrew Glassner, a fantastic and huge resource, freely available here, which also covers signal processing techniques like Fourier transforms and wavelets (if your calculus is a bit rusty, check out Khan academy, a great open online platform for engineering level mathematics)
    - "Advanced global illumination" by Philip Dutré, Kavita Bala and Philippe Bekaert, another superb resource, covering finite element radiosity and Monte Carlo rendering techniques (path tracing, bidirectional path tracing, Metropolis light transport, importance sampling, ...)

    0 0

    A very interesting paper called "Gradient domain path tracing" was just published by Nvidia researchers (coming from the same incredibly talented Helsinki university research group as Timo Aila, Samuli Laine and Tero Karras who developed highly optimized open source CUDA ray tracing kernels for Tesla, Fermi and Kepler GPUs), describing a new technique derived from the ideas in the paper Gradient domain Metropolis Light Transport, which drastically reduces noise without blurring details. 
    We introduce gradient-domain rendering for Monte Carlo image synthesis. While previous gradient-domain Metropolis Light Transport sought to distribute more samples in areas of high gradients, we show, in contrast, that estimating image gradients is also possible using standard (non-Metropolis) Monte Carlo algorithms, and furthermore, that even without changing the sample distribution, this often leads to significant error reduction. This broadens the applicability of gradient rendering considerably. To gain insight into the conditions under which gradient-domain sampling is beneficial, we present a frequency analysis that compares Monte Carlo sampling of gradients followed by Poisson reconstruction to traditional Monte Carlo sampling. Finally, we describe Gradient-Domain Path Tracing (G-PT), a relatively simple modification of the standard path tracing algorithm that can yield far superior results. 
    This picture shows a noise comparison between gradient domain path tracing (GPT) and regular path tracing (PT). Computing a sample with the new technique is about 2.5x slower, but path tracing noise seems to clear up much faster, far outweighing the computational overhead: 

    More images and details of the technique can be found in

    Related to the previous post about using real-time ray tracing for augmented reality, a brand new Nvidia paper titled "Filtering Environment Illumination for Interactive Physically-Based Rendering in Mixed Reality" demonstrates the feasibility of real-time Monte Carlo path tracing for augmented or mixed reality: 
    Physically correct rendering of environment illumination has been a long-standing challenge in interactive graphics, since Monte-Carlo ray-tracing requires thousands of rays per pixel. We propose accurate filtering of a noisy Monte-Carlo image using Fourier analysis. Our novel analysis extends previous works by showing that the shape of illumination spectra is not always a line or wedge, as in previous approximations, but rather an ellipsoid. Our primary contribution is an axis-aligned filtering scheme that preserves the frequency content of the illumination. We also propose a novel application of our technique to mixed reality scenes, in which virtual objects are inserted into a real video stream so as to become indistinguishable from the real objects. The virtual objects must be shaded with the real lighting conditions, and the mutual illumination between real and virtual objects must also be determined. For this, we demonstrate a novel two-mode path tracing approach that allows ray-tracing a scene with image-based real geometry and mesh-based virtual geometry. Finally, we are able to de-noise a sparsely sampled image and render physically correct mixed reality scenes at over 5 fps on the GPU.

    While Nvidia is certainly at the forefront of GPU path tracing research (with CUDA), AMD has recently begun venturing into GPU rendering as well and has previewed its own OpenCL based path tracer at the Siggraph 2014 conference. The path tracer is developed by Takahiro Harada, who is a bit of an OpenCL rendering genius. He recently published an article in GPU Pro 6 about rendering on-the-fly vector displacement mapping with OpenCL based GPU path tracing. Vector displacement mapping differs from regular displacement mapping in that it allows the extrusion of overlapping geometry (eg a mushroom), which is not possible with the heightfield-like displacement provided by traditional displacement (the Renderman vector displacement documentation explains this nicely with pictures).

    Slides from

    This video shows off the new technique, rendering in near-realtime on the GPU:

    There's more info on Takahiro's personal page, along with some really interesting slideshow presentations about OpenCL based ray tracing. This guy also developed a new technique called "Foveated real-time ray tracing for virtual reality devices" (paper), progressively focusing more samples on the parts in the image where the eyes are looking (determined by eye/pupil tracking), "reducing the number of pixels to shade by 1/20, achieving 75 fps while preserving the same visual quality" (source: Foveated rendering takes advantage of the fact that the human retina is most sensitive in its center (the "fovea", which contains densely packed colour sensitive cones) where objects' contours and colours are sharply observed, while the periphery of the retina consists mostly of sparsely distributed, colour insensitive rods, which cause objects in the periphery of the visual field to be represented by the brain as blurry blobs (although we do not consciously perceive it like that, thinking that our entire visual field is sharply defined and has colour).
    This graph shows that the resolution of the retina is highest at the fovea and drops off quickly with increasing distance from the center. This is due to the fact that the fovea contains only cones which each send individual inputs over the optic fibre (maximizing resolution), while the inputs from several rods in the periphery of the retina are merged by the retinal nerve cells before reaching the optic nerve (image from

    Foveated rendering has the potential to make high quality real-time raytraced imagery feasible on VR headsets that support eye tracking like the recently Kickstarted FOVE VR headset. Using ray tracing for foveated rendering is also much more efficient than using rasterisation: ray tracing allows for sparse loading and sampling of the scene geometry in the periphery of the visual field, while rasterisation needs to load and project all geometry in the viewplane, whether it's sampled or not.

    Slides from

    This video shows a working prototype of the FOVE VR headset with a tracking beam to control which parts of the scene are in focus, so this type of real-time ray traced (or path traced) foveated rendering should be possible right now, (which is pretty exciting):

    It's good to finally see AMD stepping up its OpenCL game with its own GPU path tracer. Another example of this greater engagement is that AMD recently released a large patch to fix the OpenCL performance of Blender's Cycles renderer on AMD cards. Hopefully it will put some pressure on Nvidia and make GPU rendering as exciting as in 2010 with the release of the Fermi GPU, a GPGPU computing monster which effectively doubled the CUDA ray tracing performance compared to the previous generation. 

    Rendering stuff aside, today is a very important day: for the first time in their 115 year long existence, the Buffalo's from AA Gent, my hometown's football team, have won the title in the Belgian Premier League, giving them a direct ticket to the Champions League. This calls for a proper celebration!

    0 0

    Pretty big news for GPU rendering: about 6 years after Nvidia released the source code of their high performance GPU ray tracing kernels and 4 years after Intel released Embree (high performance CPU ray tracing kernels), last week at Siggraph AMD finally released their own GPU rendering framework in the form of FireRays, an OpenCL based ray tracing SDK, first shown in prototype form at Siggraph 2014 by Takahiro Harada (who also conducted research into foveated ray tracing for VR):

    The FireRays, SDK can be downloaded from the AMD Developer site:

    More details  can be found at The acceleration structure is a BVH with spatial splits and the option to build the BVH with or without the surface area heuristic (SAH). For instances and motion blur, a two level BVH is used, which enables very efficient object transformations (translation, rotation, scaling) at virtually no cost. 

    AMD's own graphs show that their OpenCL renderer is roughly 10x faster running on 2 D700 FirePro GPUs than Embree running on the CPU:

    There are already a few OpenCL based path tracers available today such as Blender's Cycles engine and LuxRays (even V-Ray RT GPU was OpenCL based at some point), but none of them have been able to challenge their CUDA based GPU rendering brethren. AMD's OpenCL dev tools have historically been lagging behind Nvidia's CUDA SDK tools which made compiling large and complex OpenCL kernels a nightmare (splitting the megakernel in smaller parts was the only option). Hopefully the OpenCL developer tools have gotten a makeover as well with the release of this SDK, but at least I'm happy to see AMD taking GPU ray tracing serious. This move could truly bring superfast GPU rendering to the masses and with the two big GPU vendors in the ray tracing race, there will hopefully be more ray tracing specific hardware improvements in future GPU architectures.

    (thanks heaps to CPFUUU for pointing me to this)

    UPDATE: Alex Evans from Media Molecule had a great talk at Siggraph 2015 about his research into raymarching signed distance fields for Dreams. Alex Evans is currently probably the biggest innovator in real-time game rendering since John Carmack (especially since Carmack spends all his time on VR now, which is a real shame). Alex's presentation can be downloaded from and is well worth reading. It sums up a bunch of approaches to rendering voxels, signed distance fields and global illumination in real-time that ultimately were not as successful as hoped, but they came very close to real-time on the PS4 (and research is still ongoing).

    For people interested in the real-world physics of light bouncing, there was also this very impressive video from Karoly Zsolnai about ultra high speed femto-photography cameras able to shoot images at the speed of light, demonstrating how light propagates and is transprorted as an electromagnetic wave through a scene, illuminating objects a fraction of a nanosecond before their mirror image becomes visible:

    0 0

    In early 2011 I developed a simple real-time path traced Pong game together with Kerrash on top of an open source GPU path tracer called tokaspt (developed by Thierry Berger-Perrin) which could only render spheres, but was bloody fast at it. The physics were bodged, but the game proved that path tracing of very simple scenes at 30 fps was feasible, although a bit noisy. You can still download it from Since that time I've always wanted to write a short and simple tutorial about GPU path tracing to show how to make your GPU draw an image with high quality ray traced colour bleeding with a minimum of code and now is a good time to do exactly that.

    This tutorial is not meant as an introduction to ray tracing or path tracing as there are plenty of excellent ray tracing tutorials for beginners online such as Scratch-a-Pixel (also check out the old version which contains more articles) and Minilight (more links at the bottom of this article). The goal of this tutorial is simply to show how incredibly easy it is to turn a simple CPU path tracer into a CUDA accelerated version. Being a fan of the KISS principle from design and engineering (Keep It Simple Stupid) and aiming to avoid unnecessary complexity, I've chosen to cudafy Kevin Beason's smallpt, the most basic but still fully functional CPU path tracer around. It's a very short piece of code that doesn't require the user to install any tedious libraries to compile the code (apart from Nvidia's CUDA Toolkit).

    The full CPU version of smallpt can be found at Due to its compactness the code is not very easy to read, but fortunately David Cline made a great Powerpoint presentation explaining what each line in smallpt is doing with references to Peter Shirley's "Realistic Ray Tracing" book. 

    To keep things simple and free of needless clutter, I've stripped out the code for the tent filter, supersampling, Russian Roulette and the material BRDFs for reflective and refractive materials, leaving only the diffuse BRDF. The 3D vector class from smallpt is replaced by CUDA's own built-in float3 type (built-in CUDA types are more efficient due to automatic memory alignment) which has the same linear algebra math functions as a vector such as addition, subtraction, multiplication, normalize, length, dot product and cross product. For reasons of code clarity, there is no error checking when initialising CUDA. To compile the code, save the code in a file with ".cu" file extension and follow these CUDA installation guides to install Nvidia's GPU Computing Toolkit and configure the programming tools to work with CUDA.

    After reading the slides from David Cline, the commented code below should speak for itself, but feel free to drop me a comment below if some things are still not clear.

    So without further ado, here's the full CUDA code:

    // smallptCUDA by Sam Lapere, 2015
    // based on smallpt, a path tracer by Kevin Beason, 2008

    #include <iostream>
    #include <cuda_runtime.h>
    #include <vector_types.h>
    #include "device_launch_parameters.h"
    #include <cutil_math.h> // from

    #define M_PI 3.14159265359f // pi
    #define width 512 // screenwidth
    #define height 384 // screenheight
    #define samps 1024 // samples

    // __device__ : executed on the device (GPU) and callable only from the device

    struct Ray {
    float3 orig; // ray origin
    float3 dir; // ray direction
    __device__ Ray(float3 o_, float3 d_) : orig(o_), dir(d_) {}

    enum Refl_t { DIFF, SPEC, REFR }; // material types, used in radiance(), only DIFF used here

    struct Sphere {

    float rad; // radius
    float3 pos, emi, col; // position, emission, colour
    Refl_t refl; // reflection type (e.g. diffuse)

    __device__ float intersect_sphere(const Ray &r) const {

    // ray/sphere intersection
    // returns distance t to intersection point, 0 if no hit
    // ray equation: p(x,y,z) = ray.orig + t*ray.dir
    // general sphere equation: x^2 + y^2 + z^2 = rad^2
    // classic quadratic equation of form ax^2 + bx + c = 0
    // solution x = (-b +- sqrt(b*b - 4ac)) / 2a
    // solve t^2*ray.dir*ray.dir + 2*t*(orig-p)*ray.dir + (orig-p)*(orig-p) - rad*rad = 0
    // more details in "Realistic Ray Tracing" book by P. Shirley or

    float3 op = pos - r.orig; // distance from ray.orig to center sphere
    float t, epsilon = 0.0001f; // epsilon required to prevent floating point precision artefacts
    float b = dot(op, r.dir); // b in quadratic equation
    float disc = b*b - dot(op, op) + rad*rad; // discriminant quadratic equation
    if (disc<0) return 0; // if disc < 0, no real solution (we're not interested in complex roots)
    else disc = sqrtf(disc); // if disc >= 0, check for solutions using negative and positive discriminant
    return (t = b - disc)>epsilon ? t : ((t = b + disc)>epsilon ? t : 0); // pick closest point in front of ray origin

    // SCENE
    // 9 spheres forming a Cornell box
    // small enough to be in constant GPU memory
    // { float radius, { float3 position }, { float3 emission }, { float3 colour }, refl_type }
    __constant__ Sphere spheres[] = {
    { 1e5f, { 1e5f + 1.0f, 40.8f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { 0.75f, 0.25f, 0.25f }, DIFF }, //Left
    { 1e5f, { -1e5f + 99.0f, 40.8f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { .25f, .25f, .75f }, DIFF }, //Rght
    { 1e5f, { 50.0f, 40.8f, 1e5f }, { 0.0f, 0.0f, 0.0f }, { .75f, .75f, .75f }, DIFF }, //Back
    { 1e5f, { 50.0f, 40.8f, -1e5f + 600.0f }, { 0.0f, 0.0f, 0.0f }, { 1.00f, 1.00f, 1.00f }, DIFF }, //Frnt
    { 1e5f, { 50.0f, 1e5f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { .75f, .75f, .75f }, DIFF }, //Botm
    { 1e5f, { 50.0f, -1e5f + 81.6f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { .75f, .75f, .75f }, DIFF }, //Top
    { 16.5f, { 27.0f, 16.5f, 47.0f }, { 0.0f, 0.0f, 0.0f }, { 1.0f, 1.0f, 1.0f }, DIFF }, // small sphere 1
    { 16.5f, { 73.0f, 16.5f, 78.0f }, { 0.0f, 0.0f, 0.0f }, { 1.0f, 1.0f, 1.0f }, DIFF }, // small sphere 2
    { 600.0f, { 50.0f, 681.6f - .77f, 81.6f }, { 2.0f, 1.8f, 1.6f }, { 0.0f, 0.0f, 0.0f }, DIFF } // Light

    __device__ inline bool intersect_scene(const Ray &r, float &t, int &id){

    float n = sizeof(spheres) / sizeof(Sphere), d, inf = t = 1e20; // t is distance to closest intersection, initialise t to a huge number outside scene
    for (int i = int(n); i--;) // test all scene objects for intersection
    if ((d = spheres[i].intersect_sphere(r)) && d<t){ // if newly computed intersection distance d is smaller than current closest intersection distance
    t = d; // keep track of distance along ray to closest intersection point
    id = i; // and closest intersected object
    return t<inf; // returns true if an intersection with the scene occurred, false when no hit

    // random number generator from

    __device__ static float getrandom(unsigned int *seed0, unsigned int *seed1) {
    *seed0 = 36969 * ((*seed0) & 65535) + ((*seed0) >> 16); // hash the seeds using bitwise AND and bitshifts
    *seed1 = 18000 * ((*seed1) & 65535) + ((*seed1) >> 16);

    unsigned int ires = ((*seed0) << 16) + (*seed1);

    // Convert to float
    union {
    float f;
    unsigned int ui;
    } res;

    res.ui = (ires & 0x007fffff) | 0x40000000; // bitwise AND, bitwise OR

    return (res.f - 2.f) / 2.f;

    // radiance function, the meat of path tracing
    // solves the rendering equation:
    // outgoing radiance (at a point) = emitted radiance + reflected radiance
    // reflected radiance is sum (integral) of incoming radiance from all directions in hemisphere above point,
    // multiplied by reflectance function of material (BRDF) and cosine incident angle
    __device__ float3 radiance(Ray &r, unsigned int *s1, unsigned int *s2){ // returns ray color

    float3 accucolor = make_float3(0.0f, 0.0f, 0.0f); // accumulates ray colour with each iteration through bounce loop
    float3 mask = make_float3(1.0f, 1.0f, 1.0f);

    // ray bounce loop (no Russian Roulette used)
    for (int bounces = 0; bounces < 4; bounces++){ // iteration up to 4 bounces (replaces recursion in CPU code)

    float t; // distance to closest intersection
    int id = 0; // index of closest intersected sphere

    // test ray for intersection with scene
    if (!intersect_scene(r, t, id))
    return make_float3(0.0f, 0.0f, 0.0f); // if miss, return black

    // else, we've got a hit!
    // compute hitpoint and normal
    const Sphere &obj = spheres[id]; // hitobject
    float3 x = r.orig + r.dir*t; // hitpoint
    float3 n = normalize(x - obj.pos); // normal
    float3 nl = dot(n, r.dir) < 0 ? n : n * -1; // front facing normal

    // add emission of current sphere to accumulated colour
    // (first term in rendering equation sum)
    accucolor += mask * obj.emi;

    // all spheres in the scene are diffuse
    // diffuse material reflects light uniformly in all directions
    // generate new diffuse ray:
    // origin = hitpoint of previous ray in path
    // random direction in hemisphere above hitpoint (see "Realistic Ray Tracing", P. Shirley)

    // create 2 random numbers
    float r1 = 2 * M_PI * getrandom(s1, s2); // pick random number on unit circle (radius = 1, circumference = 2*Pi) for azimuth
    float r2 = getrandom(s1, s2); // pick random number for elevation
    float r2s = sqrtf(r2);

    // compute local orthonormal basis uvw at hitpoint to use for calculation random ray direction
    // first vector = normal at hitpoint, second vector is orthogonal to first, third vector is orthogonal to first two vectors
    float3 w = nl;
    float3 u = normalize(cross((fabs(w.x) > .1 ? make_float3(0, 1, 0) : make_float3(1, 0, 0)), w));
    float3 v = cross(w,u);

    // compute random ray direction on hemisphere using polar coordinates
    // cosine weighted importance sampling (favours ray directions closer to normal direction)
    float3 d = normalize(u*cos(r1)*r2s + v*sin(r1)*r2s + w*sqrtf(1 - r2));

    // new ray origin is intersection point of previous ray with scene
    r.orig = x + nl*0.05f; // offset ray origin slightly to prevent self intersection
    r.dir = d;

    mask *= obj.col; // multiply with colour of object
    mask *= dot(d,nl); // weigh light contribution using cosine of angle between incident light and normal
    mask *= 2; // fudge factor

    return accucolor;

    // __global__ : executed on the device (GPU) and callable only from host (CPU)
    // this kernel runs in parallel on all the CUDA threads

    __global__ void render_kernel(float3 *output){

    // assign a CUDA thread to every pixel (x,y)
    // blockIdx, blockDim and threadIdx are CUDA specific keywords
    // replaces nested outer loops in CPU code looping over image rows and image columns
    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    unsigned int i = (height - y - 1)*width + x; // index of current pixel (calculated using thread index)

    unsigned int s1 = x; // seeds for random number generator
    unsigned int s2 = y;

    // generate ray directed at lower left corner of the screen
    // compute directions for all other rays by adding cx and cy increments in x and y direction
    Ray cam(make_float3(50, 52, 295.6), normalize(make_float3(0, -0.042612, -1))); // first hardcoded camera ray(origin, direction)
    float3 cx = make_float3(width * .5135 / height, 0.0f, 0.0f); // ray direction offset in x direction
    float3 cy = normalize(cross(cx, cam.dir)) * .5135; // ray direction offset in y direction (.5135 is field of view angle)
    float3 r; // r is final pixel color

    r = make_float3(0.0f); // reset r to zero for every pixel

    for (int s = 0; s < samps; s++){ // samples per pixel

    // compute primary ray direction
    float3 d = cam.dir + cx*((.25 + x) / width - .5) + cy*((.25 + y) / height - .5);

    // create primary ray, add incoming radiance to pixelcolor
    r = r + radiance(Ray(cam.orig + d * 40, normalize(d)), &s1, &s2)*(1. / samps);
    } // Camera rays are pushed ^^^^^ forward to start in interior

    // write rgb value of pixel to image buffer on the GPU, clamp value to [0.0f, 1.0f] range
    output[i] = make_float3(clamp(r.x, 0.0f, 1.0f), clamp(r.y, 0.0f, 1.0f), clamp(r.z, 0.0f, 1.0f));

    inline float clamp(float x){ return x < 0.0f ? 0.0f : x > 1.0f ? 1.0f : x; }

    inline int toInt(float x){ return int(pow(clamp(x), 1 / 2.2) * 255 + .5); } // convert RGB float in range [0,1] to int in range [0, 255] and perform gamma correction

    int main(){

    float3* output_h = new float3[width*height]; // pointer to memory for image on the host (system RAM)
    float3* output_d; // pointer to memory for image on the device (GPU VRAM)

    // allocate memory on the CUDA device (GPU VRAM)
    cudaMalloc(&output_d, width * height * sizeof(float3));

    // dim3 is CUDA specific type, block and grid are required to schedule CUDA threads over streaming multiprocessors
    dim3 block(8, 8, 1);
    dim3 grid(width / block.x, height / block.y, 1);

    printf("CUDA initialised.\nStart rendering...\n");

    // schedule threads on device and launch CUDA kernel from host
    render_kernel <<< grid, block >>>(output_d);

    // copy results of computation from device back to host
    cudaMemcpy(output_h, output_d, width * height *sizeof(float3), cudaMemcpyDeviceToHost);

    // free CUDA memory


    // Write image to PPM file, a very simple image file format
    FILE *f = fopen("smallptcuda.ppm", "w");
    fprintf(f, "P3\n%d %d\n%d\n", width, height, 255);
    for (int i = 0; i < width*height; i++) // loop over pixels, write RGB values
    fprintf(f, "%d %d %d ", toInt(output_h[i].x),

    printf("Saved image to 'smallptcuda.ppm'\n");

    delete[] output_h;

    Optionally, the following 3D vector algebra functions can be inserted at the top of the file instead of #including "cutil_math.h". Instead of creating a Vector3D class (with 3 floats), CUDA's built-in float3 type is used instead as built-in types have automated memory alignment and provide higher for performance. The "__host__ __device__" keywords in front of the functions allow them to run on both the CPU and GPU.

    // 3D vector algebra from cutil_math.h
    struct float3 {float x, y, z;};
    typedef struct float3 float3;
    // add
    inline __host__ __device__ float3 operator+(float3 a, float3 b){return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);}
    inline __host__ __device__ void operator+=(float3 &a, float3 b){a.x += b.x; a.y += b.y; a.z += b.z;}
    inline __host__ __device__ float3 operator+(float3 a, float b){return make_float3(a.x + b, a.y + b, a.z + b);}
    inline __host__ __device__ float3 operator+(float b, float3 a){return make_float3(b + a.x, b + a.y, b + a.z);}
    inline __host__ __device__ void operator+=(float3 &a, float b){a.x += b; a.y += b; a.z += b;}
    // subtract
    inline __host__ __device__ float3 operator-(float3 a, float3 b){return make_float3(a.x - b.x, a.y - b.y, a.z - b.z);}
    inline __host__ __device__ void operator-=(float3 &a, float3 b){a.x -= b.x; a.y -= b.y; a.z -= b.z;}
    inline __host__ __device__ float3 operator-(float3 a, float b){return make_float3(a.x - b, a.y - b, a.z - b);}
    inline __host__ __device__ float3 operator-(float b, float3 a){return make_float3(b - a.x, b - a.y, b - a.z);}
    inline __host__ __device__ void operator-=(float3 &a, float b){a.x -= b; a.y -= b; a.z -= b;}
    // multiply
    inline __host__ __device__ float3 operator*(float3 a, float3 b){return make_float3(a.x * b.x, a.y * b.y, a.z * b.z);}
    inline __host__ __device__ void operator*=(float3 &a, float3 b){a.x *= b.x; a.y *= b.y; a.z *= b.z;}
    inline __host__ __device__ float3 operator*(float3 a, float b){return make_float3(a.x * b, a.y * b, a.z * b);}
    inline __host__ __device__ float3 operator*(float b, float3 a){return make_float3(b * a.x, b * a.y, b * a.z);}
    inline __host__ __device__ void operator*=(float3 &a, float b){a.x *= b; a.y *= b; a.z *= b;}
    // divide
    inline __host__ __device__ float3 operator/(float3 a, float3 b){return make_float3(a.x / b.x, a.y / b.y, a.z / b.z);}
    inline __host__ __device__ void operator/=(float3 &a, float3 b){a.x /= b.x; a.y /= b.y; a.z /= b.z;}
    inline __host__ __device__ float3 operator/(float3 a, float b){return make_float3(a.x / b, a.y / b, a.z / b);}
    inline __host__ __device__ void operator/=(float3 &a, float b){a.x /= b; a.y /= b; a.z /= b;}
    inline __host__ __device__ float3 operator/(float b, float3 a){return make_float3(b / a.x, b / a.y, b / a.z);}
    // min
    inline __host__ __device__ float3 fminf(float3 a, float3 b){return make_float3(fminf(a.x, b.x), fminf(a.y, b.y), fminf(a.z, b.z));}
    // max
    inline __host__ __device__ float3 fmaxf(float3 a, float3 b){return make_float3(fmaxf(a.x, b.x), fmaxf(a.y, b.y), fmaxf(a.z, b.z));}
    // lerp
    inline __device__ __host__ float3 lerp(float3 a, float3 b, float t){return a + t*(b - a);}
    // clamp value v between a and b
    inline __device__ __host__ float clamp(float f, float a, float b){return fmaxf(a, fminf(f, b));}
    inline __device__ __host__ float3 clamp(float3 v, float a, float b){return make_float3(clamp(v.x, a, b), clamp(v.y, a, b), clamp(v.z, a, b));}
    inline __device__ __host__ float3 clamp(float3 v, float3 a, float3 b){return make_float3(clamp(v.x, a.x, b.x), clamp(v.y, a.y, b.y), clamp(v.z, a.z, b.z));}
    // dot product
    inline __host__ __device__ float dot(float3 a, float3 b){return a.x * b.x + a.y * b.y + a.z * b.z;}
    // length
    inline __host__ __device__ float length(float3 v){return sqrtf(dot(v, v));}
    // normalize
    inline __host__ __device__ float3 normalize(float3 v){float invLen = rsqrtf(dot(v, v));return v * invLen;}
    // floor
    inline __host__ __device__ float3 floorf(float3 v){return make_float3(floorf(v.x), floorf(v.y), floorf(v.z));}
    // frac
    inline __host__ __device__ float fracf(float v){return v - floorf(v);}
    inline __host__ __device__ float3 fracf(float3 v){return make_float3(fracf(v.x), fracf(v.y), fracf(v.z));}
    // fmod
    inline __host__ __device__ float3 fmodf(float3 a, float3 b){return make_float3(fmodf(a.x, b.x), fmodf(a.y, b.y), fmodf(a.z, b.z));}
    // absolute value
    inline __host__ __device__ float3 fabs(float3 v){return make_float3(fabs(v.x), fabs(v.y), fabs(v.z));}
    // reflect
    //returns reflection of incident ray I around surface normal N
    // N should be normalized, reflected vector's length is equal to length of I
    inline __host__ __device__ float3 reflect(float3 i, float3 n){return i - 2.0f * n * dot(n, i);}
    // cross product
    inline __host__ __device__ float3 cross(float3 a, float3 b){return make_float3(a.y*b.z - a.z*b.y, a.z*b.x - a.x*b.z, a.x*b.y - a.y*b.x);}

    In this example, it's pretty easy to turn C/C++ code into CUDA code (CUDA is a subset of the C language). The differences with the CPU version of smallpt are as follows:

    • smallpt's 3D Vector struct is replaced by CUDA's built-in float3 type (linear algebra vector functions for float3 are defined in cutil_math.h)
    • CUDA specific keyword __device__ before functions that should run on the GPU and are only callable from the GPU
    • CUDA specific keyword __global__ in front of the kernel that is called from the host (CPU) and which runs in parallel on all CUDA threads
    • a custom random number generator that runs on the GPU
    • as GPUs don't handle recursion well, the radiance function needs to be converted from a recursive function to an iterative function (see Richie Sam's blogpost or Karl Li's slides for more details) with a fixed number of bounces (Russian roulette could be implemented here to terminate paths with a certain probability, but I took it out for simplicity)
    • in a CPU raytracer, you loop over each pixel of the image with two nested loops (one for image rows and one for image columns). On the GPU the loops are replaced by a kernel which runs for each pixel in parallel. A global thread index is computed instead from the grid dimensions, block dimensions and local thread index. See for more details
    • the main() function calls CUDA specific functions to allocate memory on the CUDA device (cudaMalloc()), launch the CUDA kernel using the "<<< grid, block >>>" syntax and copy the results (in this case the rendered image) from the GPU back to the CPU, where the image is saved in PPM format (a supersimple image format)

    When running the code above, we get the following image (1024 samples per pixel, brute force path tracing):

    Path traced color bleeding rendered entirely on the GPU! On my laptop's GPU (Geforce 840M) it renders about 24x faster than the multithreaded CPU version (laptop Core-i7 clocked at 2.00 Ghz). The neat thing here is that it only took about 100 lines (if you take out the comments) to get path tracing working on the GPU. The beauty lies in its simplicity.

    Even though the path tracing code already works well, it is actually very unoptimized and there are many techniques to speed it up:

    • explicit light sampling (or next event estimation): sample the light source directly instead of using brute force path tracing. This makes an enormous difference in reducing noise.
    • jittered sampling (also called stratified sampling): instead of sampling a pixel randomly, divide the pixel up into a number of layers (strata) in which random sampling is performed. According to Peter Shirley's book this way of sampling (which is partly structured and partly random) is one of the most important noise reduction methods
    • better random number generators
    • various importance sampling strategies: this code already performs cosine weighted importance sampling for diffuse rays, favouring rays with directions that are closer to the normal (as they contribute more to the final image). See  
    • ray tracing acceleration structures: kd-trees, octrees, grids, bounding volume hierarchies provide massive speedups

    GPU specific optimisations (see and Karl Li's course slides linked below):
    • using shared memory and registers whenever possible is many times faster than using global/local memory
    • memory alignment for coalesced reads from GPU memory
    • thread compaction: since CUDA launches a kernel in groups of 32 threads in parallel ("warps"), threads taking different code paths can give rise to thread divergence which reduces the GPU's occupancy. Thread compaction aims to mitigate the effects of thread divergence by bundling threads following similar code paths

    I plan to cover the following topics (with CUDA implementations) in upcoming tutorials whenever I find some time:
    • an interactive viewport camera with progressive rendering, 
    • textures (and bump mapping), 
    • environment lighting, 
    • acceleration structures,  
    • triangles and triangle meshes
    • building more advanced features on top of Aila and Laine's GPU ray tracing framework which is also used by Blender's Cycles GPU renderer
    • dissecting some code snippets from Cycles GPU render or SmallLuxGPU 

    References used:

    0 0

    While the tutorial from the previous post was about path tracing simple scenes made of spheres, this tutorial will focus on how to build a very simple path tracer with support for loading and rendering triangle meshes. Instead of rendering the entire image in the background and saving it to a file as was done in the last tutorial, this path tracer displays an interactive viewport which shows progressively rendered updates. This way we can see the rendered image from the first pass and watch it converge to a noise free result (which can take some time in the case of path tracing triangle meshes without using acceleration structures).

    For this tutorial I decided to modify the code of a real-time CUDA ray tracer developed by Peter Trier from the Alexandra Institute in 2009 (described in this blog post), because it's very compact, does not use any external libraries (except for CUDA-OpenGL interoperability) and provides a simple obj loader for triangle meshes. I modified the ray tracing kernel to handle path tracing (without recursion) using the path tracing code from the previous tutorial, added support for perfectly reflective and refractive materials (like glass) based on the code of smallpt. The random number generator from the previous post has been replaced with CUDA's own random number generation library provided by curand(), which is less prone to patterns at low sample rates and has more uniform distribution properties. The seed calculation is based on a trick described in a post on RichieSam's blog.

    Features of this path tracer

    - primitive types: supports spheres, boxes and triangles/triangle meshes
    - material types: support for perfectly diffuse, perfectly reflective and perfectly refractive materials
    - progressive rendering
    - interactive viewport displaying intermediate rendering results

    Scratch-a-Pixel has some excellent lessons on ray tracing triangles and triangle meshes, which discuss barycentric coordinates, backface culling and the fast Muller-Trumbore ray/triangle intersection algorithm that is also used in the code for this tutorial:

    - ray tracing triangles:

    - ray tracing polygon meshes:

    The code is one big CUDA file with lots of comments and can be found on my Github repository.

    Github repository link: 

    Some screenshots

    Performance optimisations

    - triangle edges are precomputed to speed up ray intersection computation and triangles are stored as (first vertex, edge1, edge2)
    - ray/triangle intersection uses the fast Muller-Trumbore technique
    - triangle data is stored in the GPU's texture memory which is cached and is a bit faster than global memory because fetching data from textures is accelerated in hardware. The texture cache is also optimized for 2D spatial locality, so threads that access addresses in texture memory that are close together in 2D will achieve best performance. 
    - triangle data is aligned in float4s (128 bits) for coalesced memory access, maximising memory throughput,  (see and
    - for expensive functions (such as sin() and sqrt()), compute fast approximations using single precision intrinsic math functions such as __sinf(), __powf(), __fdividef(): these functions are performed in hardware by the special function units (SFU) on the GPU and are much faster than the standard divide and sin/cos functions at the cost of precision and robustness in corner cases (see
    - to speed up the ray tracing an axis aligned bounding box is created around the triangle mesh. Only rays hitting this box are intersected with the mesh. Without this box,  all rays would have to be tested against every triangle for intersection, which is unbearably slow.

    In the next tutorial, we'll have a look at implementing an acceleration structure, which speeds up the rendering by several orders of magnitude. This blog post provides  a good overview of the most recent research in ray tracing acceleration structures for the GPU. There will also be an interactive camera to allow real-time navigation through the scene with depth of field and supersampled anti-aliasing (and there are still lots of optimisations). 


    0 0

    (In case you were wondering, my pun-loving girlfriend came up with the title for this post). This tutorial is the longest, but most crucial one so far and deals with the implementation ray tracing acceleration structure that can be traversed on the GPU. The code from the previous tutorial works okay for simple triangle meshes with less then 10,000 triangles, but since render times grow linearly or O(n)with the complexity of the scene (each ray needs to test every primitive in the scene for intersection), anything above that number becomes unfeasible. To address this issue, ray tracing researchers came up with several acceleration structures such as grids, octrees, binary space partitioning trees (BSP trees), kd-trees and BVHs (bounding volume hierarchy), allowing render times to scale logarithmically or O(log n) instead of linearly with scene complexity, a huge improvement in speed and efficiency. Acceleration structures are by far the most important ingredient to building a fast ray tracer and an enormous amount of research has gone into improving and refining the algorithms to build and traverse them, both on the CPU on the GPU (the latter since 2006, around the same time unified shader architecture was introduced on GPUs). 

    Scratch-a-Pixel (again) has a great introduction to acceleration structures for ray tracing (grids and bounding volume hierarchies) that includes example code: Peter Shirley's "Realistic Ray Tracing" book also contains a good description and implementation of a BVH with C++ code.

    An overview of the latest state-of-the-art research in acceleration structures for GPUs can be found in this blogpost on Robbin Marcus' blog:

    This tutorial focuses on the implementation of a BVH acceleration structure on the GPU, and comes with complete annotated source code for BVH construction (on the CPU) and BVH traversal (on the GPU using CUDA). The reason for choosing a BVH over a grid or kd-tree is because BVHs map better to modern GPU architectures and have also been shown to be the acceleration structure which allows the fastest build and render times (see for example Another reason for choosing BVHs is that they are conceptually simple and easy to implement. The Nvidia research paper "Understanding the efficiency of ray traversal on GPUs" by Aila and Laine comes with open source code that contains a highly optimised BVH for CUDA path tracers which was used in Cycles, Blender's GPU path tracing renderer (

    The code in this tutorial is based on a real-time CUDA ray tracer developed by Thanassis Tsiodras, which can be found on and which I converted to support path tracing instead. The BVH from this renderer is already quite fast and relatively easy to understand.

    For the purpose of clarity and to keep the code concise (as there's quite a lot of code required for BVH construction), I removed quite a few nice features from Thanassis' code which are not essential for this tutorial, such as multithreaded BVH building on the CPU (using SSE intrinsics), various render modes (like point rendering), backface culling, a scheme to divide the image in rendertiles in Morton order (along a space filling Z-curve) and some clever workarounds to deal with CUDA's limitations such as separate templated kernels for shadow rays and ambient occlusion. 

    One of the more tricky parts of implementing a BVH for ray tracing on the GPU is how to store the BVH structure and BVH node data in a GPU friendly format. CPU ray tracers store a BVH as a hierarchical structure starting with the root node, which contains pointers to its child nodes (in case of an inner node) or pointers to triangles (in case of a leaf node). Since a BVH is built recursively, the child nodes in turn contain pointers to their own child nodes and this keeps on going until the leaf nodes are reached. This process involves lots of pointers which might point to scattered locations in memory, a scenario which is not ideal for the GPU. GPUs like coherent, memory aligned datastructures such as indexable arrays that avoid the use of too many pointers. In this tutorial, the BVH data (such as nodes, triangle data, triangle indices, precomputed intersection data) are therefore stored in flat one-dimensonal arrays (storing elements in depth first order by recursively traversing the BVH), which can be easily digested by CUDA and are stored on the GPU in either global memory or texture memory in the form of CUDA textures (hardware cached). The BVH in this tutorial is using CUDA texture memory, since global memory on older GPUs is not cached (as opposed to texture memory). Since the introduction of Fermi however, global memory is also cached and the performance difference when using one or the other is hardly noticeable.  

    In order to avoid wasting time by rebuilding the BVH every time the program is run, the BVH is built only once and stored in a file. For this to work, the BVH data is converted to a cache-friendly format which takes up as little memory space as possible (but the compactness of the data makes it also harder to read). A clever scheme is used to store BVH leaf nodes and inner nodes using the same data structure: instead of using a separate struct for leaf nodes and inner nodes, both types of nodes occupy the same memory space (using a union), which stores either two child indices to the left and right child when dealing with an inner node or a start index into the list of triangles and a triangle count in case of a leaf node. To distinguish between a leaf node and an inner node, the highest bit of the triangle count variable is set to 1 for a leaf node. The renderer can then determine at runtime if it has intersected an inner node or a leaf node by checking the highest bit (with a bitwise AND operation).  

    A lot of the triangle intersection data (such as triangle edges, barycentric coordinates, dot products between vertices and edge planes) is precomputed at the scene initialisation stage and stored. Since modern GPUs have much more raw compute power than memory bandwidth, it would be interesting to know whether fetching the precomputed data from memory is faster or slower compared to computing that data directly on the GPU. 

    The following is a high level explanation of the algorithm for top-down BVH construction (on the CPU) and traversal (on the GPU). The BVH in this code is built according to the surface area heuristic and uses binning to find the best splitting plane. The details of the BVH algorithm can be found in the following papers:

    "On fast construction of SAH based Bounding Volume Hierarchies" by Ingo Wald, 2007. This paper is a must read in order to understand what the code is doing.

    - "Ray tracing deformable scenes using dynamic Bounding Volume Hierarchies" by Wald, Boulos and Shirley, 2007

    - "On building fast kd-trees for ray tracing, and on doing that in O(N log N)" by Wald and Havran, 2006

    Overview of algorithm for building the BVH on the CPU

    - the main() function (in main.cpp) calls prepCUDAscene(), which in turn calls UpdateBoundingVolumeHierarchy()

    - UpdateBoundingVolumeHierarchy() checks if there is already a BVH for the scene stored (cached) in a file and loads that one or builds a new BVH by calling CreateBVH()

    - CreateBVH():
    1. computes a bbox (bounding box) for every triangle and calculate the bounds (top and bottom)
    2. initialises a "working list" bbox to contain all the triangle bboxes
    3. expands the bounds of the working list bbox so it encompasses all triangles in the scene by looping over all the triangle bboxes
    4. computes each triangle bbox centre and adds the triangle bbox to the working list
    5. passes the working list to Recurse(), which builds the BVH tree structure
    6. returns the BVH root node
    Recurse() recursively builds the BVH tree from top (rootnode) to bottom using binning, finding optimal split planes for each depth. It divides the work bounding box into a number of equally sized "bins" along each axis, chooses the axis and splitting plane resulting in the least cost (determined by the surface area heuristic or SAH: the larger the surface area of a bounding box, the costlier it is to raytrace) and finding the bbox with the minimum surface area:
    1. Check if the working list contains less then 4 elements (triangle bboxes) in which case create a leaf node and push each triangle to a triangle list
    2. Create an inner node if the working list contains 4 or more elements
    3. Divide node further into smaller nodes
    4. Start by finding the working list bounds (top and bottom)
    5. Loop over all bboxes in current working list, expanding/growing the working list bbox
    6. find surface area of bounding box by multiplying the dimensions of the working list's bounding box
    7. The current bbox has a cost C of N (number of triangles) * SA (Surface Area) or C = N * SA
    8. Loop over all three axises (X, Y, Z) to find best splitting plane using "binning"
    9. Binning: try splitting the current axis at a uniform distance (equidistantly spaced planes) in "bins" of size "step" that gets smaller the deeper we go: size of "sampling grid": 1024 (depth 0), 512 (depth 1), etc
    10. For each bin (equally spaced bins of size "step"), initialise a left and right bounding box 
    11. For each test split (or bin), allocate all triangles in the current work list based on their bbox centers (this is a fast O(N) pass, no triangle sorting needed): if the center of the triangle bbox is smaller than the test split value, put the triangle in the left bbox, otherwise put the triangle in the right bbox. Count the number of triangles in the left and right bboxes.
    12. Now use the Surface Area Heuristic to see if this split has a better "cost": calculate the surface area of the left and right bbox and calculate the total cost by multiplying the surface area of the left and right bbox by the number of triangles in each. Keep track of cheapest split found so far.
    13. At the end of this loop (which runs for every "bin" or "sample location"), we should have the best splitting plane, best splitting axis and bboxes with minimal traversal cost
    14. If we found no split to improve the cost, create a BVH leaf, otherwise create a BVH inner node with L and R child nodes. Split with the optimal value we found above.
    15. After selection of the best split plane, distribute each of the triangles into the left or right child nodes based on their bbox center
    16. Recursively build the left and right child nodes (do step 1 - 16)
    17. When all recursive function calls have finished, the end result of Recurse() is the root node of the BVH
    Once the BVH has been created, we can copy its data into a memory saving, cache-friendly format (CacheFriendlyBVHNode occupies exactly 32 bytes, i.e. a cache-line) by calling CreateCFBVH(). which recursively counts the triangles and bounding boxes and stores them in depth first order in one-dimensional arrays by calling PopulateCacheFriendlyBVH()

    The data of the cache friendly BVH is copied to the GPU in CUDA global memory by prepCUDAscene() (using the cudaMalloc() and cudaMemcpy() functions). Once the data is in global memory it's ready to be used by the renderer, but the code is taking it one step further and binds the BVH data to CUDA textures for performance reasons (texture memory is cached, although global memory is also cached since Fermi). The texture binding is done by cudarender() (in which calls cudaBindTexture(). After this stage, all scene data is now ready to be rendered (rays traversing the BVH and intersecting triangles).

    Overview of algorithm for traversing the BVH on the GPU

    - after cudarenderer() has bound the data to CUDA textures with cudaBindTexture() the first time it's being called, it launches the CoreLoopPathTracingKernel() which runs in parallel over all pixels to render a frame.
    - CoreLoopPathTracingKernel() computes a primary ray starting from the interactive camera view (which can differ each frame) and calls path_trace() to calculate the ray bounces 
    - path_trace() first tests all spheres in the scene for intersection and then tests if the ray intersects any triangles by calling BVH_IntersectTriangles() which traverses the BVH.
    - BVH_IntersectTriangles():
    1. initialise a stack to keep track of all the nodes the ray has traversed
    2. while the stack is not empty, pop a BVH node from the stack and decrement the stack index
    3. fetch the data associated with this node (indices to left and right child nodes for inner nodes or start index in triangle list + triangle count for leaf nodes)
    4. determine if the node is a leaf node or triangle node by examining the highest bit of the count variable
    5. if inner node, test ray for intersection with AABB (axis aligned bounding box) of node --> if intersection, push left and right child node indices on the stack, and go back to step 2 (pop next node from the stack)
    6. if leaf node, loop over all the triangles in the node (determined by the start index in the list of triangle indices and the triangle count), 
    7. for each triangle in the node, fetch the index, center, normal and precomputed intersection data and check for intersection with the ray
    8. if ray intersects triangle, keep track of the closest hit
    9. recursively traverse the left and right child nodes, if any (do step 2 - 9)
    10. after all recursive calls have finished, the end result returned by the function is a bool based on the index of the closest hit triangle (true if index is not -1)
    - after the ray has been tested for intersection with the scene, compute the colour of the ray by multiplying with the colour of the intersected object, calculate the direction of the next ray in the path according to the material BRDF and accumulate the colours of the subsequent path segments (see GPU path tracing tutorial 1).

    In addition to the BVH, I added an interactive camera based on the interactive CUDA path tracer code from Yining Karl Li and Peter Kutz ( The camera's view direction and position can be changed interactively with mouse and keyboard (a new orthornormal basis for the camera is computed each frame). The camera produces an antialiased image by jittering the primary ray directions. By allowing primary rays to start randomly on a simulated disk shaped lens instead of from a point. a camera aperture (the opening in the diaphragm) with focal plane can be simulated, providing a cool, photographic depth-of-field effect. The focal distance can also be adjusted interactively.

    The material system for this tutorial allows five basic materials: ideal diffuse, ideal specular, ideal refractive, Phong metal (based on code from Peter Shirley's "Realistic Ray Tracing" book) with a hardcoded exponent and a coat (acrylic) material (based on Karl Li and Peter Kutz' CUDA path tracer).

    CUDA/C++ code (on GitHub)

    The code for this tutorial can be found at I've added plenty of comments throughout the code, but if some steps aren't clear, let me know. Detailed compilation instructions for Windows and Visual Studio are in the readme file:

    I'll provide a downloadable executable once a find a good website to upload it to (Google code is no longer an option). 

    Screenshots produced with the code from this tutorial (Stanford Dragon and Happy Buddha .ply models from the Stanford 3D scanning repository)

    Glossy Stanford dragon model (871,000 triangles)

    Happy Buddha model (1,088,000 triangles) with Phong metal material

    The next tutorial will add even more speed: I'll dive deeper into the highly optimised BVH acceleration structure for GPU traversal from Aila and Laine, which uses spatial splitting to build higher quality (and faster) trees. It's also the framework that the GPU part of Blender Cycles is using.

    Other features for upcoming tutorials are support for textures, sun and sky lighting, environment lighting, more general and accurate materials using Fresnel, area light support, direct light sampling and multiple importance sampling.


    - CUDA based sphere path tracer by Peter Kutz and Yining Karl Li
    Overview of state-of-the-art acceleration structures for GPU ray tracing by Robbin Marcus
    - "Realistic Ray Tracing" by P. Shirley

    0 0

    Peter Shirley has just released "Ray tracing, the next week", a free book on Amazon for anyone who wants to learn how to code a basic path tracer in C:

    "Ray tracing the next week" is a follow-up to another mini-book by Shirley, "Ray tracing in one weekend" which was released only last month and covers the very basics of a ray tracer including ray-sphere intersection, path tracing of diffuse, metal and dielectric materials, anti-aliasing, positionable camera and depth-of-field. The Kindle edition is available for free when downloaded within the next five days (until 11 March). The book is excellent for people who quickly want to dive into coding a path tracer from scratch without being overwhelmed by theoretical details. It covers more advanced features such as solid textures, image textures, participating media, motion blur, instancing, and BVH acceleration structures and comes with source code snippets (using C plus classes and operator overloading, easily portable to CUDA). The code even contains some simple but clever optimisation tricks which are not published in any other ray tracing books.

    0 0

    AMD has just released the full source code of FireRays, their OpenCL based GPU renderer which was first available as a SDK library since August 2015 (see This is an outstanding move by AMD which significantly lowers the threshold for developers to enter the GPU rendering arena and create an efficient OpenCL based path tracing engine that is able to run on hardware from AMD, Intel and Nvidia without extra effort. 

    Here's an ugly sample render of FireRays provided by AMD:

    And an old video from one of the developers:

    Nvidia open sourced their high performance CUDA based ray tracing framework in 2009, but hasn't updated it since 2012 (presumably due to the lack of any real competition from AMD in this area) and has since focused more on developing OptiX, a CUDA based closed source ray tracing library. Intel open sourced Embree in 2011, which is being actively developed and updated with new features and performance improvements. They even released another open source high performance ray tracer for scientific visualisation called OSPRay.

    FireRays seems to have some advanced features such as ray filtering, geometry and ray masking (to make certain objects invisible to the camera or selectively ignore effects like shadows and reflections) and support for volumetrics. Hopefully AMD will also release some in-depth documentation and getting started tutorials in order to maximise adoption of this new technology among developers who are new to GPU ray tracing.

    0 0

    Last week, Edd Biddulph released the code and some videos of a very impressive project he's working on: a real-time path traced version of Quake 2 running on OpenGL 3.3.

    Project link with videos:
    Full source code on Github:

    Quake 2, now with real-time indirect lighting and soft shadows
    The path tracing engine behind this project is quite astonishing when you consider the number of lightsources in the level and the amount of dynamic characters (each with a unique pose) that are updated every single frame. I had a very interesting talk with Edd on some of the features of his engine, revealing that he used a lot of clever optimisations (some of which are taking advantage of the specific properties of the Quake 2 engine). 

    Copying Edd's answers here:
    Why Quake 2 instead of Quake 3
    I chose Quake 2 because it has area lightsources and the maps were designed with multiple-bounce lighting in mind. As far as I know, Quake 3 was not designed this way and didn't even have area lightsources for the baked lighting. Plus Quake 2's static geometry was still almost entirely defined by a binary space partitioning tree (BSP) and I found that traversing a BSP is pretty easy in GLSL and seems to perform quite well, although I haven't made any comparisons to other approaches. Quake 3 has a lot more freeform geometry such as tessellated Bezier surfaces so it doesn't lend itself so well to special optimisations. I'm a big fan of both games of course :)

    How the engine updates dynamic objects
    All dynamic geometry is inserted into a single structure which is re-built from scratch on every frame. Each node is an axis-aligned bounding box and has a 'skip pointer' to skip over the children. I make a node for each triangle and build the structure bottom-up after sorting the leaf nodes by morton code for spatial coherence. I chose this approach because the implementation is simple both for building and traversing, the node hierarchy is quite flexible, and building is fast although the whole CPU side is single-threaded for now (mostly because Quake 2 is single-threaded of course). I'm aware that the lack of ordered traversal results in many more ray-triangle intersection tests than are necessary, but there is little divergence and low register usage since the traversal is stackless.

    How to keep noise to a minimum when dealing with so many lights
    The light selection is a bit more tricky. I divided lightsources into two categories - regular and 'skyportals'. A skyportal is just a light-emitting surface from the original map data which has a special texture applied, which indicates to the game that the skybox should be drawn there. Each leaf in the BSP has two lists of references to lightsources. The first list references regular lightsources which are potentially visible from the leaf according to the PVS (potentially visible set) tables. The second list references skyportals which are contained within the leaf. At an intersection point the first list is used to trace shadow rays and make explicit samples of lightsources, and the second list is used to check if the intersection point is within a skyportal surface. If it's within a skyportal then there is a contribution of light from the sky. This way I can perform a kind of offline multiple importance sampling (MIS) because skyportals are generally much larger than regular lights. For regular lights of course I use importance sampling, but I believe the weight I use is more approximate than usual because it's calculated always from the center of the lightsource rather than from the real sample position on the light.

    One big point about the lights right now is that the pointlights that the original game used are being added as 4 triangular lightsources arranged in a tetrahedron so they tend to make quite a performance hit. I'd like to try adding a whole new type of lightsource such as a spherical light to see if that works out better.

    Ray tracing specific optimisations
    I'm making explicit light samples by tracing shadow rays directly towards points on the lightsources. MIS isn't being performed in the shader, but I'm deciding offline whether a lightsource should be sampled explicitly or implicitly.

    Which parts of the rendering process use rasterisation
    I use hardware rasterisation only for the primary rays and perform the raytracing in the same pass for the following reasons:
    • Translucent surfaces can be lit and can receive shadows identically to all other surfaces.
    • Hardware anti-aliasing can be used, of course.
    • Quake 2 sorts translucent BSP surfaces and draws them in a second pass, but it doesn't do this for entities (the animated objects) so I would need to change that design and I consider this too intrusive and likely to break something. One of my main goals was to preserve the behaviour of Q2's own renderer.
    • I'm able to eliminate overdraw by making a depth-only pre-pass which even uses the same GL buffers that the raytracer uses so it has little overhead except for a trick that I had to make since I packed the three 16-bit triangle indices for the raytracer into two 32-bit elements (this was necessary due to OpenGL limitations on texture buffer objects).
    • It's nice that I don't need to manage framebuffer stuff and design a good g-buffer format.
    The important project files containing the path tracing code
    If you want to take a look at the main parts that I wrote, stick to src/client/refresh/r_pathtracing.c and src/client/refresh/pathtracer.glsl. The rest of my changes were mostly about adding various GL extensions and hooking in my stuff to the old refresh subsystem (Quake 2's name for the renderer). I apologise that r_pathtracing.c is such a huge file, but I did try to comment it nicely and refactoring is already on my huge TODO list. The GLSL file is converted into a C header at build time by which is at the root of the codebase.

    More interesting tidbits
    - This whole project is only made practical by the fact that the BSP files still contain surface emission data despite the game itself making no use of it at all. This is clearly a by-product of keeping the map-building process simple, and it's a very fortunate one!
    - The designers of the original maps sometimes placed pointlights in front of surface lights to give the appearence that they are glowing or emitting light at their sides like a fluorescent tube diffuser. This looks totally weird in my pathtracer so I have static pointlights disabled by default. They also happen to go unused by the original game, so it's also fortunate that they still exist among the map data. 
    - The weapon that is viewed in first-person is drawn with a 'depth hack' (it's literally called RF_DEPTHHACK), in which the range of depth values is reduced to prevent the weapon poking in to walls. Unfortunately the pathtracer's representation would still poke in to walls because it needs the triangles in worldspace, and this would cause the tip of the weapon to turn black (completely in shadow). I worked around this by 'virtually' scaling down the weapon for the pathtracer. This is one of the many ways in which raytracing turns out to be tricky for videogames, but I'm sure there can always be elegant solutions.
    If you want to mess around with the path traced version of Quake 2 yourself (both AMD and Nvidia cards are supported as the path tracer uses OpenGL), simply follow these steps:
    • on Windows, follow the steps under section 2.3 in the readme file (link: Lots of websites still offer the Quake 2 demo for download (e.g.
    • download and unzip the Yamagi Quake 2 source code with path tracing from
    • following the steps under section 2.6 of the readme file, download and extract the premade MinGW build environment, run MSYS32, navigate to the source directory with the makefile, "make" the release build and replace the files "q2ded.exe", "quake2.exe" and "baseq2\game.dll" in the Quake 2 game installation with the freshly built ones
    • start the game by double clicking "quake2", open the Quake2 console with the ~ key (under the ESC key), type "gl_pt_enable 1", hit Enter and the ~ key to close the console
    • the game should now run with path tracing

    Edd also said he's also planning to add new special path tracing effects (such as light emitting particles from the railgun) and implementing more optimisations to reduce the path tracing noise.

    0 0

    For this tutorial, I've implemented a couple of improvements based on the high performance GPU ray tracing framework of Timo Aila, Samuli Laine and Tero Karras (Nvidia research) which is described in their 2009 paper "Understanding the efficiency of ray traversal on GPUs" and the 2012 addendum to the original paper which contains specifically hand tuned kernels for Fermi and Kepler GPUs (which also works on Maxwell). The code for this framework is open source and can be found at the Google code repository (which is about to be phased out) or on GitHub. The ray tracing kernels are thoroughly optimised and deliver state-of-the-art performance (the code from this tutorial is 2-3 times faster than the previous one).  For that reason, they are also used in the production grade CUDA path tracer Cycles:




    The major improvements from this framework are:

    - Spatial split BVH: this BVH building method is based on Nvidia's "Spatial splits in bounding volume hierarchies" paper by Martin Stich. It aims to reduce BVH node overlap (a high amount of node overlap lowers ray tracing performance) by combining the object splitting strategy of regular BVH building (according to a surface area heuristic or SAH) with the space splitting method of kd-tree building. The algorithm determines for each triangle whether "splitting" it (by creating duplicate references to the triangle and storing them in its overlapping nodes) lowers the cost of ray/node intersections compared to the "unsplit" case. The result is a very high quality acceleration structure with ray traversal performance which on average is significantly higher than (or in the worst case equal to) a regular SAH BVH.

    - Woop ray/triangle intersection: this algorithm is explained in "Real-time ray tracing of dynamic scenes on an FPGA chip". It basically transforms each triangle in the mesh to a unit triangle with vertices (0, 0, 0), (1, 0, 0) and (0, 1, 0). During rendering, a ray is transformed into "unit triangle space" using a triangle specific affine triangle transformation and intersected with the unit triangle, which is a much simpler computation.

    - Hand optimised GPU ray traversal and intersection kernels:  these kernels use a number of specific tricks to minimise thread divergence within a warp (a warp is a group of 32 SIMD threads which operate in lockstep, i.e. all threads within a warp must execute the same instructions). Thread divergence occurs when one or more threads within a warp follow a different code execution branch, which (in the absolute worst case) could lead to a scenario where only one thread is active while the other 31 threads in the warp are idling, waiting for it to finish. Using "persistent threads" aims to mitigate this problem: when a predefined number of CUDA threads within a warp is idling, the GPU will dynamically fetch new work for these threads in order to increase compute occupancy. The persistent threads feature is used in the original framework. To keep things simple for this tutorial, it has not been implemented as it requires generating and buffering batches of rays, but it is relatively easy to add. Another optimisation to increase SIMD efficiency in a warp is postponing ray/triangle intersection tests until all threads in the same warp have found a leaf node. Robbin Marcus wrote a very informative blogpost about these specific optimisations. In addition to these tricks, the Kepler kernel also uses the GPUs video instructions to perform min/max operations (see "" at the top).

    Other new features:
    - a basic OBJ loader which triangulates n-sided faces (n-gons, triangle fans)
    - simple HDR environment map lighting, which for simplicity does not use any filtering (hence the blockiness) or importance sampling yet. The code is based on

    Some renders with the code from this tutorial (the "Roman Settlement" city scene was created by LordGood and converted from a SketchUp model, also used by Mitsuba Render. The HDR maps are available at the HDR Labs website):


    Source code
    The tutorial's source code can be found at

    For clarity, I've tried to simplify the code where possible, keeping the essential improvements provided by the framework and cutting out the unnecessary parts. I have also added clarifying comments to the most difficult code parts where appropriate. There is quite a lot of new code, but the most important and interesting files are:

    - SplitBVHBuilder.cpp contains the algorithm for building BVH with spatial splits
    - CudaBVH.cppshows the particular layout in which the BVH nodes are stored and Woop's triangle transformation method
    - renderkernel.cudemonstrates two methods of ray/triangle intersection: a regular ray/triangle intersection algorithm similar to the one in GPU path tracing tutorial 3, denoted as DEBUGintersectBVHandTriangles() and a method using Woop's ray/triangle intersection named intersectBVHandTriangles()  

    A downloadable demo (which requires an Nvidia GPU) is available from

    Working with and learning this ray tracing framework was a lot of fun, head scratching and cursing (mostly the latter). It has given me a deeper appreciation for both the intricacies and strengths of GPUs and taught me a multitude of ways of how to optimise Cuda code to maximise performance (even to the level of assembly/PTX). I recommend anyone who wants to build a GPU renderer to sink their teeth in it (the source code in this tutorial should make it easier to digest the complexities). It keeps astounding me what GPUs are capable of and how much they have evolved in the last decade. 

    The next tutorial(s) will cover direct lighting, physical sky, area lights, textures and instancing.  I've also had a few requests from people who are new to ray tracing for a more thorough explanation of the code from previous tutorials. At some point (when time permits), I hope to create tutorials with illustrations and pseudocode of all the concepts covered.

    0 0

    This is the first tutorial in a new series of GPU path tracing tutorials which will focus on OpenCL based rendering. The first few tutorials will cover the very basics of getting started with OpenCL and OpenCL based ray tracing and path tracing of simple scenes. Follow-up tutorials will use a cut-down version of AMD's RadeonRays framework (the framework formerly known as FireRays), to start from as a basis to add new features in a modular manner. The goal is to incrementally work up to include all the features of RadeonRays, a full-featured GPU path tracer. The Radeon Rays source also forms the basis of AMD's Radeon ProRender Technology (which will also be integrated as a native GPU renderer in an upcoming version of Maxon's Cinema4D).  In the end, developers that are new to rendering should be able to code up their own GPU renderer and integrate it into their application. 

    Why OpenCL?

    The major benefit of OpenCL is its platform independence, meaning that the same code can run on CPUs and GPUs made by AMD, Nvidia and Intel (in theory at least, in practice there are quite a few implementation differences between the various platforms). The tutorials in this series should thus run on any PC, regardless of GPU vendor (moreover a GPU is not even required to run the program). 

    Another advantage of OpenCL is that it can use all the available CPU and GPUs in a system simultaneously to accelerate parallel workloads (such as rendering or physics simulations).

    In order to achieve this flexibility, some boiler plate code is required which selects an OpenCL platform (e.g. AMD or Nvidia) and one or more OpenCL devices (CPUs or GPUs). In addition, the OpenCL source must be compiled at runtime (unless the platform and device are known in advance), which adds some initialisation time when the program is first run.

    OpenCL execution model quick overview

    This is a superquick overview OpenCL execution model, just enough to get started (there are plenty of more exhaustive sources on OpenCL available on the web). 

    In order to run an OpenCL program, the following structures are required (and are provided by the OpenCL API):
    • Platform: which vendor (AMD/Nvidia/Intel)
    • Device: CPU, GPU, APU or integrated GPU
    • Context: the runtime interface between the host (CPU) and device (GPU or CPU) which manages all the OpenCL resources (programs, kernels, command queue, buffers). It receives and distributes kernels and transfers data.
    • Program: the entire OpenCL program (one or more kernels and device functions)
    • Kernel: the starting point into the OpenCL program, analogous to the main() function in a CPU program. Kernels are called from the host (CPU). They represent the basic units of executable code that run on an OpenCL device and are preceded by the keyword "__kernel"
    • Command queue: the command queue allows kernel execution commands to be sent to the device (execution can be in-order or out-of-order)
    • Memory objects: buffers and images
    These structures are summarised in the diagram below (slide from AMD's Introduction to OpenCL programming):

    OpenCL execution model

    OpenCL memory model quick overview

    The full details of the memory model are beyond the scope of this first tutorial, but we'll cover the basics here to get some understanding on how a kernel is executed on the device. 

    There are four levels of memory on an OpenCL device, forming a memory hierarchy (from large and slow to tiny and fast memory):
    • Global memory (similar to RAM): the largest but also slowest form of memory, can be read and written to by all work items (threads) and all work groups on the device and can also be read/written by the host (CPU).
    • Constant memory: a small chunk of global memory on the device, can be read by all work items on the device (but not written to) and can be read/written by the host. Constant memory is slightly faster than global memory.
    • Local memory (similar to cache memory on the CPU): memory shared among work items in the same work group (work items executing together on the same compute unit are grouped into work groups). Local memory allows work items belonging to the same work group to share results. Local memory is much faster than global memory (up to 100x).
    • Private memory (similar to registers on the CPU): the fastest type of memory. Each work item (thread) has a tiny amount of private memory to store intermediate results that can only be used  by that work item

    First OpenCL program

    With the obligatory theory out of the way, it's time to dive into the code. To get used to the OpenCL syntax, this first program will be very simple (nothing earth shattering yet): the code will just add the corresponding elements of two floating number arrays together in parallel (all at once).

    In a nutshell, what happens is the following:
    1. Initialise the OpenCL computing environment: create a platform, device, context, command queue, program and kernel and set up the kernel arguments
    2. Create two floating point number arrays on the host side and copy them to the OpenCL device
    3. Make OpenCL perform the computation in parallel (by determining global and local worksizes and launching the kernel)
    4. Copy the results of the computation from the device to the host
    5. Print the results to the console
    To keep the code simple and readable, there is minimal error checking, the "cl" namespace is used for the OpenCL structures and the OpenCL kernel source is provided as a string in the CPU code. 

    The code contains plenty of comments to clarify the new syntax:

    // Getting started with OpenCL tutorial 
    // by Sam Lapere, 2016,
    // Code based on

    #include <iostream>
    #include <vector>
    #include <CL\cl.hpp> // main OpenCL include file

    usingnamespace cl;
    usingnamespace std;

    void main()
    // Find all available OpenCL platforms (e.g. AMD, Nvidia, Intel)
    vector<Platform> platforms;

    // Show the names of all available OpenCL platforms
    cout << "Available OpenCL platforms: \n\n";
    for (unsignedint i = 0; i < platforms.size(); i++)
    cout << "\t"<< i + 1 << ": "<< platforms[i].getInfo<CL_PLATFORM_NAME>() << endl;

    // Choose and create an OpenCL platform
    cout << endl << "Enter the number of the OpenCL platform you want to use: ";
    unsignedint input = 0;
    cin >> input;
    // Handle incorrect user input
    while (input < 1 || input > platforms.size()){
    cin.clear(); //clear errors/bad flags on cin
    cin.ignore(cin.rdbuf()->in_avail(), '\n'); // ignores exact number of chars in cin buffer
    cout << "No such platform."<< endl << "Enter the number of the OpenCL platform you want to use: ";
    cin >> input;

    Platform platform = platforms[input - 1];

    // Print the name of chosen OpenCL platform
    cout << "Using OpenCL platform: \t"<< platform.getInfo<CL_PLATFORM_NAME>() << endl;

    // Find all available OpenCL devices (e.g. CPU, GPU or integrated GPU)
    vector<Device> devices;
    platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);

    // Print the names of all available OpenCL devices on the chosen platform
    cout << "Available OpenCL devices on this platform: "<< endl << endl;
    for (unsignedint i = 0; i < devices.size(); i++)
    cout << "\t"<< i + 1 << ": "<< devices[i].getInfo<CL_DEVICE_NAME>() << endl;

    // Choose an OpenCL device
    cout << endl << "Enter the number of the OpenCL device you want to use: ";
    input = 0;
    cin >> input;
    // Handle incorrect user input
    while (input < 1 || input > devices.size()){
    cin.clear(); //clear errors/bad flags on cin
    cin.ignore(cin.rdbuf()->in_avail(), '\n'); // ignores exact number of chars in cin buffer
    cout << "No such device. Enter the number of the OpenCL device you want to use: ";
    cin >> input;

    Device device = devices[input - 1];

    // Print the name of the chosen OpenCL device
    cout << endl << "Using OpenCL device: \t"<< device.getInfo<CL_DEVICE_NAME>() << endl << endl;

    // Create an OpenCL context on that device.
    // the context manages all the OpenCL resources
    Context context = Context(device);


    // the OpenCL kernel in this tutorial is a simple program that adds two float arrays in parallel
    // the source code of the OpenCL kernel is passed as a string to the host
    // the "__global" keyword denotes that "global" device memory is used, which can be read and written
    // to by all work items (threads) and all work groups on the device and can also be read/written by the host (CPU)

    constchar* source_string =
    " __kernel void parallel_add(__global float* x, __global float* y, __global float* z){ "
    " const int i = get_global_id(0); "// get a unique number identifying the work item in the global pool
    " z[i] = y[i] + x[i]; "// add two arrays

    // Create an OpenCL program by performing runtime source compilation
    Program program = Program(context, source_string);

    // Build the program and check for compilation errors
    cl_int result ={ device }, "");
    if (result) cout << "Error during compilation! ("<< result << ")"<< endl;

    // Create a kernel (entry point in the OpenCL source program)
    // kernels are the basic units of executable code that run on the OpenCL device
    // the kernel forms the starting point into the OpenCL program, analogous to main() in CPU code
    // kernels can be called from the host (CPU)
    Kernel kernel = Kernel(program, "parallel_add");

    // Create input data arrays on the host (= CPU)
    constint numElements = 10;
    float cpuArrayA[numElements] = { 0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f };
    float cpuArrayB[numElements] = { 0.1f, 0.2f, 0.3f, 0.4f, 0.5f, 0.6f, 0.7f, 0.8f, 0.9f, 1.0f };
    float cpuOutput[numElements] = {}; // empty array for storing the results of the OpenCL program

    // Create buffers (memory objects) on the OpenCL device, allocate memory and copy input data to device.
    // Flags indicate how the buffer should be used e.g. read-only, write-only, read-write
    Buffer clBufferA = Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, numElements * sizeof(cl_int), cpuArrayA);
    Buffer clBufferB = Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, numElements * sizeof(cl_int), cpuArrayB);
    Buffer clOutput = Buffer(context, CL_MEM_WRITE_ONLY, numElements * sizeof(cl_int), NULL);

    // Specify the arguments for the OpenCL kernel
    // (the arguments are __global float* x, __global float* y and __global float* z)
    kernel.setArg(0, clBufferA); // first argument
    kernel.setArg(1, clBufferB); // second argument
    kernel.setArg(2, clOutput); // third argument

    // Create a command queue for the OpenCL device
    // the command queue allows kernel execution commands to be sent to the device
    CommandQueue queue = CommandQueue(context, device);

    // Determine the global and local number of "work items"
    // The global work size is the total number of work items (threads) that execute in parallel
    // Work items executing together on the same compute unit are grouped into "work groups"
    // The local work size defines the number of work items in each work group
    // Important: global_work_size must be an integer multiple of local_work_size
    std::size_t global_work_size = numElements;
    std::size_t local_work_size = 10; // could also be 1, 2 or 5 in this example
    // when local_work_size equals 10, all ten number pairs from both arrays will be added together in one go

    // Launch the kernel and specify the global and local number of work items (threads)
    queue.enqueueNDRangeKernel(kernel, NULL, global_work_size, local_work_size);

    // Read and copy OpenCL output to CPU
    // the "CL_TRUE" flag blocks the read operation until all work items have finished their computation
    queue.enqueueReadBuffer(clOutput, CL_TRUE, 0, numElements * sizeof(cl_float), cpuOutput);

    // Print results to console
    for (int i = 0; i < numElements; i++)
    cout << cpuArrayA[i] << " + "<< cpuArrayB[i] << " = "<< cpuOutput[i] << endl;


    Compiling instructions(for Visual Studio on Windows)

    To compile this code, it's recommended to download and install the AMD App SDK (this works for systems with GPUs or CPUs from AMD, Nvidia and Intel, even if your system doesn't have an AMD CPU or GPU installed) since Nvidia's OpenCL implementation is no longer up-to-date.
      1. Start an empty Console project in Visual Studio (any recent version should work, including Express and Community) and set to Release mode 
      2. Add the SDK include path to the "Additional Include Directories" (e.g. "C:\Program Files (x86)\AMD APP SDK\2.9-1\include") 
      3. In Linker > Input, add "opencl.lib" to "Additional Dependencies" and add the OpenCL lib path to "Additional Library Directories"  (e.g. "C:\Program Files (x86)\AMD APP SDK\2.9-1\lib\x86")
      4. Add the main.cpp file (or create a new file and paste the code) and build the project

        Download binaries

        The executable (Windows only) for this tutorial is available at

        It runs on CPUs and/or GPUs from AMD, Nvidia and Intel.

        Useful References

        - "A gentle introduction to OpenCL": 

        - "Simple start with OpenCL": 

        - Anteru's blogpost, Getting started with OpenCL (uses old OpenCL API)
        - AMD introduction to OpenCL programming:

        Up next

        In the next tutorial we'll start rendering an image with OpenCL.

        0 0

        This tutorial consists of two parts: the first part will describe how to ray trace one sphere using OpenCL, while the second part covers path tracing of a scene made of spheres. The tutorial will be light on ray tracing/path tracing theory (there are plenty of excellent resources available online such as Scratch-a-Pixel) and will focus instead on the practical implementation of rendering algorithms in OpenCL.The end result will be a rendered image featuring realistic light effects such as indirect lighting, diffuse colour bleeding and soft shadows, all achieved with just a few lines of code:

        Part 1: Ray tracing a sphere

        Computing a test image on the OpenCL device

        The host (CPU) sets up the OpenCL environment and launches the OpenCL kernel which will be executed on the OpenCL device (GPU or CPU) in parallel. Each work item (or thread) on the device will calculate one pixel of the image. There will thus be as many work items in the global pool as there are pixels in the image. Each work item has a unique ID which distinguishes from all other work items in the global pool of threads and which is obtained with get_global_id(0)

        The X- and Y-coordinates of each pixel can be computed by using that pixel's unique work item ID:
        • x-coordinate: divide by the image width and take the remainder
        • y-coordinate: divide by the image width
        By remapping the x and y coordinates from the [0 to width] range for x and [0 to height] range for y to the range [0 - 1] for both, and plugging those values in the red and green channels repsectively yields the following gradient image (the image is saved in ppm format which can be opened with e.g. IrfanView of Gimp):

        The OpenCL code to generate this image:

        __kernel void render_kernel(__global float3* output, int width, int height)
        constint work_item_id = get_global_id(0); /* the unique global id of the work item for the current pixel */
        int x = work_item_id % width; /* x-coordinate of the pixel */
        int y = work_item_id / width; /* y-coordinate of the pixel */
        float fx = (float)x / (float)width; /* convert int to float in range [0-1] */
        float fy = (float)y / (float)height; /* convert int to float in range [0-1] */
        output[work_item_id] = (float3)(fx, fy, 0); /* simple interpolated colour gradient based on pixel coordinates */

        Now let's use the OpenCL device for some ray tracing.

        Ray tracing a sphere with OpenCL

        We first define a Ray and a Sphere struct in the OpenCL code:

        A Ray has 
        • an origin in 3D space (3 floats for x, y, z coordinates) 
        • a direction in 3D space (3 floats for the x, y, z coordinates of the 3D vector)
        A Sphere has 
        • a radius
        • a position in 3D space (3 floats for x, y, z coordinates), 
        • an object colour (3 floats for the Red, Green and Blue channel) 
        • an emission colour (again 3 floats for each of the RGB channels)

        struct Ray{
        float3 origin;
        float3 dir;

        struct Sphere{
        float radius;
        float3 pos;
        float3 emi;
        float3 color;

        Camera ray generation

        Rays are shot from the camera (which is in a fixed position for this tutorial) through an imaginary grid of pixels into the scene, where they intersect with 3D objects (in this case spheres). For each pixel in the image, we will generate one camera ray (also called primary rays, view rays or eye rays) and follow or trace it into the scene. For camera rays, the ray origin is the camera position and the ray direction is the vector connecting the camera and the pixel on the screen.

        Source: Wikipedia

        The OpenCL code for generating a camera ray:

        struct Ray createCamRay(constint x_coord, constint y_coord, constint width, constint height){

        float fx = (float)x_coord / (float)width; /* convert int in range [0 - width] to float in range [0-1] */
        float fy = (float)y_coord / (float)height; /* convert int in range [0 - height] to float in range [0-1] */

        /* calculate aspect ratio */
        float aspect_ratio = (float)(width) / (float)(height);
        float fx2 = (fx - 0.5f) * aspect_ratio;
        float fy2 = fy - 0.5f;

        /* determine position of pixel on screen */
        float3 pixel_pos = (float3)(fx2, -fy2, 0.0f);

        /* create camera ray*/
        struct Ray ray;
        ray.origin = (float3)(0.0f, 0.0f, 40.0f); /* fixed camera position */
        ray.dir = normalize(pixel_pos - ray.origin);

        return ray;

        Ray-sphere intersection

        To find the intersection of a ray with a sphere, we need the parametric equation of a line, which denotes the distance from the ray origin to the intersection point along the ray direction with the parameter "t"

        intersection point = ray origin + ray direction * t

        The equation of a sphere follows from the Pythagorean theorem in 3D (all points on the surface of a sphere are located at a distance of radius r from its center): 

        (sphere surface point - sphere center)2 = radius2 

        Combining both equations 

        (ray origin + ray direction * t)2 = radius2

        and expanding the equation in a quadratic equation of form ax2 + bx + c = 0 where 

        • a = (ray direction) . (ray direction)  
        • b = 2 * (ray direction) . (ray origin to sphere center) 
        • c = (ray origin to sphere center) . (ray origin to sphere center) - radius2 
        yields solutions for t (the distance to the point where the ray intersects the sphere) given by the quadratic formula−b ± √ b2− 4ac / 2a (where b2 - 4acis called the discriminant).

        There can be zero (ray misses sphere), one (ray grazes sphere at one point) or two solutions (ray fully intersects sphere at two points). The distance t can be positive (intersection in front of ray origin) or negative (intersection behind ray origin). The details of the mathematical derivation are explained in this Scratch-a-Pixel article.

        The ray-sphere intersection algorithm is optimised by omitting the "a" coefficient in the quadratic formula, because its value is the dot product of the normalised ray direction with itself which equals 1. Taking the square root of the discriminant (an expensive function) can only be performed when the discriminant is non-negative.

        bool intersect_sphere(conststruct Sphere* sphere, conststruct Ray* ray, float* t)
        float3 rayToCenter = sphere->pos - ray->origin;

        /* calculate coefficients a, b, c from quadratic equation */

        /* float a = dot(ray->dir, ray->dir); // ray direction is normalised, dotproduct simplifies to 1 */
        float b = dot(rayToCenter, ray->dir);
        float c = dot(rayToCenter, rayToCenter) - sphere->radius*sphere->radius;
        float disc = b * b - c; /* discriminant of quadratic formula */

        /* solve for t (distance to hitpoint along ray) */

        if (disc < 0.0f) return false;
        else *t = b - sqrt(disc);

        if (*t < 0.0f){
        *t = b + sqrt(disc);
        if (*t < 0.0f) return false;

        elsereturn true;

        Scene initialisation

        For simplicity, in this first part of the tutorial the scene will be initialised on the device in the kernel function (in the second part the scene will be initialised on the host and passed to OpenCL which is more flexible and memory efficient, but also requires to be more careful with regards to memory alignment and the use of memory address spaces). Every work item will thus have a local copy of the scene (in this case one sphere).

        __kernel void render_kernel(__global float3* output, int width, int height)
        constint work_item_id = get_global_id(0); /* the unique global id of the work item for the current pixel */
        int x_coord = work_item_id % width; /* x-coordinate of the pixel */
        int y_coord = work_item_id / width; /* y-coordinate of the pixel */

        /* create a camera ray */
        struct Ray camray = createCamRay(x_coord, y_coord, width, height);

        /* create and initialise a sphere */
        struct Sphere sphere1;
        sphere1.radius = 0.4f;
        sphere1.pos = (float3)(0.0f, 0.0f, 3.0f);
        sphere1.color = (float3)(0.9f, 0.3f, 0.0f);

        /* intersect ray with sphere */
        float t = 1e20;
        intersect_sphere(&sphere1, &camray, &t);

        /* if ray misses sphere, return background colour
        background colour is a blue-ish gradient dependent on image height */
        if (t > 1e19){
        output[work_item_id] = (float3)(fy * 0.1f, fy * 0.3f, 0.3f);

        /* if ray hits the sphere, it will return the sphere colour*/
        output[work_item_id] = sphere1.color;

        Running the ray tracer 

        Now we've got everything we need to start ray tracing! Let's begin with a plain colour sphere. When the ray misses the sphere, the background colour is returned:

        A more interesting sphere with cosine-weighted colours, giving the impression of front lighting.

        To achieve this effect we need to calculate the angle between the ray hitting the sphere surface and the normal at that point. The sphere normal at a specific intersection point on the surface is just the normalised vector (with unit length) going from the sphere center to that intersection point.

                float3 hitpoint = camray.origin + camray.dir * t;
        float3 normal = normalize(hitpoint - sphere1.pos);
        float cosine_factor = dot(normal, camray.dir) * -1.0f;

        output[work_item_id] = sphere1.color * cosine_factor;

        Adding some stripe pattern by multiplying the colour with the sine of the height:

        Screen-door effect using sine functions for both x and y-directions

        Showing the surface normals (calculated in the code snippet above) as colours:

        Source code

        Download demo (works on AMD, Nvidia and Intel)

        The executable demo will render the above images.

        Part 2: Path tracing spheres

        Very quick overview of ray tracing and path tracing

        The following section covers the background of the ray tracing process in a very simplified way, but should be sufficient to understand the code in this tutorial. Scratch-a-Pixel provides a much more detailed explanation of ray tracing.  

        Ray tracing is a general term that encompasses ray casting, Whitted ray tracing, distribution ray tracing and path tracing. So far, we have only traced rays from the camera (so called "camera rays", "eye rays" or "primary rays") into the scene, a process called ray casting, resulting in plainly coloured images with no lighting. In order to achieve effects like shadows and reflections, new rays must be generated at the points where the camera rays intersect with the scene. These secondary rays can be shadow rays, reflection rays, transmission rays (for refractions), ambient occlusion rays or diffuse interreflection rays (for indirect lighting/global illumination). For example, shadow rays used for direct lighting are generated to point directly towards a light source while reflection rays are pointed in (or near) the direction of the reflection vector. For now we will skip direct lighting to generate shadows and go straight to path tracing, which is strangely enough easier to code, creates more realistic and prettier pictures and is just more fun.

        In (plain) path tracing, rays are shot from the camera and bounce off the surface of scene objects in a random direction (like a high-energy bouncing ball), forming a chain of random rays connected together into a path. If the path hits a light emitting object such as a light source, it will return a colour which depends on the surface colours of all the objects encountered so far along the path, the colour of the light emitters, the angles at which the path hit a surface and the angles at which the path bounced off a surface. These ideas form the essence of the "rendering equation", proposed in a paper with the same name by Jim Kajiya in 1986.

        Since the directions of the rays in a path are generated randomly, some paths will hit a light source while others won't, resulting in noise ("variance" in statistics due to random sampling). The noise can be reduced by shooting many random paths per pixel (= taking many samples) and averaging the results.

        Implementation of (plain) path tracing in OpenCL       

        The code for the path tracer is based on smallpt from Kevin Beason and is largely the same as the ray tracer code from part 1 of this tutorial, with some important differences on the host side:

        - the scene is initialised on the host (CPU) side, which requires a host version of the Sphere struct. Correct memory alignment in the host struct is very important to avoid shifting of values and wrongly initialised variables in the OpenCL struct, especially when  using OpenCL's built-in data types such as float3 and float4. If necessary, the struct should be padded with dummy variables to ensure memory alignment (the total size of the struct must be a multiple of the size of float4).

        struct Sphere
        cl_float radius;
        cl_float dummy1;
        cl_float dummy2;
        cl_float dummy3;
        cl_float3 position;
        cl_float3 color;
        cl_float3 emission;

        - the scene (an array of spheres) is copied from the host to the OpenCL device into global memory (using CL_MEM_READ_WRITE) or constant memory (using CL_MEM_READ_ONLY

        // initialise scene
        constint sphere_count = 9;
        Sphere cpu_spheres[sphere_count];

        // Create buffers on the OpenCL device for the image and the scene
        cl_output = Buffer(context, CL_MEM_WRITE_ONLY, image_width * image_height * sizeof(cl_float3));
        cl_spheres = Buffer(context, CL_MEM_READ_ONLY, sphere_count * sizeof(Sphere));
        queue.enqueueWriteBuffer(cl_spheres, CL_TRUE, 0, sphere_count * sizeof(Sphere), cpu_spheres);

        - explicit memory management: once the scene is on the device, its pointer can be passed on to other device functions preceded by the keyword "__global" or "__constant".

        - the host code automatically determines the local size of the kernel work group (the number of work items or "threads" per work group) by calling the OpenCL function kernel.getWorkGroupInfo(device)

        The actual path tracing function

        - iterative path tracing function: since OpenCL does not support recursion, the trace() function traces paths iteratively (instead of recursively) using a loop with a fixed number of bounces (iterations), representing path depth.

        - each path starts off with an "accumulated colour" initialised to black and a "mask colour" initialised to pure white. The mask colour "collects" surface colours along its path by multiplication. The accumulated colour accumulates light from emitters along its path by adding emitted colours multiplied by the mask colour.

        - generating random ray directions: new rays start at the hitpoint and get shot in a random direction by sampling a random point on the hemisphere above the surface hitpoint. For each new ray, a local orthogonal uvw-coordinate system and two random numbers are generated: one to pick a random value on the horizon for the azimuth, the other for the altitude (with the zenith being the highest point)

        - diffuse materials: the code for this tutorial only supports diffuse materials, which reflect incident light almost uniformly in all directions (in the hemisphere above the hitpoint)

        - cosine-weighted importance sampling: because diffuse light reflection is not truly uniform, the light contribution from rays that are pointing away from the surface plane and closer to the surface normal is greater. Cosine-weighted importance sampling favours rays that are pointing away from the surface plane by multiplying their colour with the cosine of the angle between the surface normal and the ray direction.

        - while ray tracing can get away with tracing only one ray per pixel to render a good image (more are needed for anti-aliasing and blurry effects like depth-of-field and glossy reflections), the inherently noisy nature of path tracing requires tracing of many paths per pixel (samples per pixel) and averaging the results to reduce noise to an acceptable level.

        float3 trace(__constant Sphere* spheres, const Ray* camray, constint sphere_count, constint* seed0, constint* seed1){

        Ray ray = *camray;

        float3 accum_color = (float3)(0.0f, 0.0f, 0.0f);
        float3 mask = (float3)(1.0f, 1.0f, 1.0f);

        for (int bounces = 0; bounces < 8; bounces++){

        float t; /* distance to intersection */
        int hitsphere_id = 0; /* index of intersected sphere */

        /* if ray misses scene, return background colour */
        if (!intersect_scene(spheres, &ray, &t, &hitsphere_id, sphere_count))
        return accum_color += mask * (float3)(0.15f, 0.15f, 0.25f);

        /* else, we've got a hit! Fetch the closest hit sphere */
        Sphere hitsphere = spheres[hitsphere_id]; /* version with local copy of sphere */

        /* compute the hitpoint using the ray equation */
        float3 hitpoint = ray.origin + ray.dir * t;

        /* compute the surface normal and flip it if necessary to face the incoming ray */
        float3 normal = normalize(hitpoint - hitsphere.pos);
        float3 normal_facing = dot(normal, ray.dir) < 0.0f ? normal : normal * (-1.0f);

        /* compute two random numbers to pick a random point on the hemisphere above the hitpoint*/
        float rand1 = 2.0f * PI * get_random(seed0, seed1);
        float rand2 = get_random(seed0, seed1);
        float rand2s = sqrt(rand2);

        /* create a local orthogonal coordinate frame centered at the hitpoint */
        float3 w = normal_facing;
        float3 axis = fabs(w.x) > 0.1f ? (float3)(0.0f, 1.0f, 0.0f) : (float3)(1.0f, 0.0f, 0.0f);
        float3 u = normalize(cross(axis, w));
        float3 v = cross(w, u);

        /* use the coordinte frame and random numbers to compute the next ray direction */
        float3 newdir = normalize(u * cos(rand1)*rand2s + v*sin(rand1)*rand2s + w*sqrt(1.0f - rand2));

        /* add a very small offset to the hitpoint to prevent self intersection */
        ray.origin = hitpoint + normal_facing * EPSILON;
        ray.dir = newdir;

        /* add the colour and light contributions to the accumulated colour */
        accum_color += mask * hitsphere.emission;

        /* the mask colour picks up surface colours at each bounce */
        mask *= hitsphere.color;

        /* perform cosine-weighted importance sampling for diffuse surfaces*/
        mask *= dot(newdir, normal_facing);

        return accum_color;

        A screenshot made with the code above (also see the screenshot at the top of this post). Notice the colour bleeding (bounced colour reflected from the floor onto the spheres), soft shadows and lighting coming from the background.

        Source code

        Downloadable demo (for AMD, Nvidia and Intel platforms, Windows only)

        Useful resources

        - Scratch-a-pixel is an excellent free online resource to learn about the theory behind ray tracing and path tracing. Many code samples (in C++) are also provided. This article gives a great introduction to global illumination and path tracing.

        - smallpt by Kevin Beason is a great little CPU path tracer in 100 lines code. It of formed the inspiration for the Cornell box scene and for many parts of the OpenCL code 

        Up next

        The next tutorial will cover the implementation of an interactive OpenGL viewport with a progressively refining image and an interactive camera with anti-aliasing and depth-of-field.

        0 0

        I'm working for an international company with very large (<Trump voice>"YUUUUUGE"<\Trump voice>) industry partners.

        We are currently looking for excellent developers with experience in GPU rendering (path tracing) for a new project.

        Our ideal candidates have either a:
        • Bachelor in Computer Science, Computer/Software Engineering or Physics with a minimum of 2 years of work experience in a relevant field, or
        • Master in Computer Science, Computer/Software Engineering or Physics, or
        • PhD in a relevant field
        and a strong interest in physically based rendering and ray tracing.

        Self-taught programmers are encouraged to apply if they meet the following requirements:
        • you breathe rendering and have Monte Carlo simulations running through your blood
        • you have a copy of PBRT (, version 3 was released just last week) on your bedside table
        • provable experience working with open source rendering frameworks such as PBRT, LuxRender, Cycles, AMD RadeonRays or with a commercial renderer will earn you extra brownie points
        • 5+ years of experience with C++
        • experience with CUDA or OpenCL
        • experience with version control systems and working on large projects
        • proven rendering track record (publications, Github projects, blog)

        Other requirements:
        • insatiable hunger to innovate
        • a "can do" attitude
        • strong work ethic and focus on results
        • continuous self-learner
        • work well in a team
        • work independently and able to take direction
        • ability to communicate effectively
        • comfortable speaking English
        • own initiatives and original ideas are highly encouraged
        • willing to relocate to New Zealand

        What we offer:
        • unique location in one of the most beautiful and greenest countries in the world
        • be part of a small, high-performance team 
        • competitive salary
        • jandals, marmite and hokey pokey ice cream

        For more information, contact me at

        If you are interested, send your CV and cover letter to Applications will close on 16 December or when we find the right people. (update: spots are filling up quickly so we advanced the closing date with five days)

        0 0

        Just a link to the source code on Github for now, I'll update this post with a more detailed description when I find a bit more time:

         Part 1Setting up an OpenGL window

        Part 2Adding an interactive camera, depth of field and progressive rendering

        Thanks to Erich Loftis and Brandon Miles for useful tips on improving the generation of random numbers in OpenCL to avoid the distracting artefacts (showing up as a sawtooth pattern) when using defocus blur (still not perfect but much better than before).

        The next tutorial will cover rendering of triangles and triangle meshes.

        0 0
      • 03/20/17--23:39: Virtual reality

      • 0 0

        This week Google announced "Seurat", a novel surface lightfield rendering technology which would enable "real-time cinema-quality, photorealistic graphics" on mobile VR devices, developed in collaboration with ILMxLab:

        The technology captures all light rays in a scene by pre-rendering it from many different viewpoints. During runtime, entirely new viewpoints are created by interpolating those viewpoints on-the-fly resulting in photoreal reflections and lighting in real-time (

        At almost the same time, Disney released a paper called "Real-time rendering with compressed animated light fields", demonstrating the feasibility of rendering a Pixar quality 3D movie in real-time where the viewer can actually be part of the scene and walk in between scene elements or characters (according to a predetermined camera path):

        Light field rendering in itself is not a new technique and has actually been around for more than 20 years, but has only recently become a viable rendering technique. The first paper was released at Siggraph 1996 ("Light field rendering" by Mark Levoy and Pat Hanrahan) and the method has since been incrementally improved by others. The Stanford university compiled an entire archive of light fields to accompany the Siggraph paper from 1996 which can be found at A more up-to-date archive of photography-based light fields can be found at

        One of the first movies that showed a practical use for light fields is The Matrix from 1999, where an array of cameras firing at the same time (or in rapid succession) made it possible to pan around an actor to create a super slow motion effect ("bullet time"):

        Bullet time in The Matrix (1999)

        Rendering the light field

        Instead of attempting to explain the theory behind light fields (for which there are plenty of excellent online sources), the main focus of this post is to show how to quickly get started with rendering a synthetic light field using Blender Cycles and some open-source plug-ins. If you're interested in a crash course on light fields, check out Joan Charmant's video tutorial below, which explains the basics of implementing a light field renderer:

        The following video demonstrates light fields rendered with Cycles:

        Rendering a light field is actually surprisingly easy with Blender's Cycles and doesn't require much technical expertise (besides knowing how to build the plugins). For this tutorial, we'll use a couple of open source plug-ins:

        1) The first one is the light field camera grid add-on for Blender made by Katrin Honauer and Ole Johanssen from the Heidelberg University in Germany: 

        This plug-in sets up a camera grid in Blender and renders the scene from each camera using the Cycles path tracing engine. Good results can be obtained with a grid of 17 by 17 cameras with a distance of 10 cm between neighbouring cameras. For high quality, a 33-by-33 camera grid with an inter-camera distance of 5 cm is recommended.

        3-by-3 camera grid with their overlapping frustrums

        2) The second tool is the light field encoder and WebGL based light field viewer, created by Michal Polko, found at (build instructions are included in the readme file).

        This plugin takes in all the images generated by the first plug-in and compresses them by keeping some keyframes and encoding the delta in the remaining intermediary frames. The viewer is WebGL based and makes use of virtual texturing (similar to Carmack's mega-textures) for fast, on-the-fly reconstruction of new viewpoints from pre-rendered viewpoints (via hardware accelerated bilinear interpolation on the GPU).

        Results and Live Demo

        A live online demo of the light field with the dragon can be seen here: 

        You can change the viewpoint (within the limits of the original camera grid) and refocus the image in real-time by clicking on the image.  

        I rendered the Stanford dragon using a 17 by 17 camera grid and distance of 5 cm between adjacent cameras. The light field was created by rendering the scene from 289 (17x17) different camera viewpoints, which took about 6 minutes in total (about 1 to 2 seconds rendertime per 512x512 image on a good GPU). The 289 renders are then highly compressed (for this scene, the 107 MB large batch of 289 images was compressed down to only 3 MB!). 

        A depth map is also created at the same time an enables on-the-fly refocusing of the image, by interpolating information from several images, 

        A later tutorial will add a bit more freedom to the camera, allowing for rotation and zooming.

        0 0

        July is a great month for rendering enthusiasts: there's of course Siggraph, but the most exciting conference is High Performance Graphics, which focuses on (real-time) ray tracing. One of the more interesting sounding papers is titled: "Towards real-time path tracing: An Efficient Denoising Algorithm for Global Illumination" by Mara, McGuire, Bitterli and Jarosz, which was released a couple of days ago. The paper, video and source code can be found at

        We propose a hybrid ray-tracing/rasterization strategy for realtime rendering enabled by a fast new denoising method. We factor global illumination into direct light at rasterized primary surfaces and two indirect lighting terms, each estimated with one pathtraced sample per pixel. Our factorization enables efficient (biased) reconstruction by denoising light without blurring materials. We demonstrate denoising in under 10 ms per 1280×720 frame, compare results against the leading offline denoising methods, and include a supplement with source code, video, and data.

        While the premise of the paper sounds incredibly exciting, the results are disappointing. The denoising filter does a great job filtering almost all the noise (apart from some noise which is still visible in reflections), but at the same it kills pretty much all the realism that path tracing is famous for, producing flat and lifeless images. Even the first Crysis from 10 years ago (the first game with SSAO) looks distinctly better. I don't think applying such aggressive filtering algorithms to a path tracer will convince game developers to make the switch to path traced rendering anytime soon. A comparison with ground truth reference images (rendered to 5000 samples or more) is also lacking from some reason. 

        At the same conference, a very similar paper will be presented titled "Spatiotemporal Variance-Guided Filtering: Real-Time Reconstruction for Path-Traced Global Illumination". 

        We introduce a reconstruction algorithm that generates a temporally stable sequence of images from one path-per-pixel global illumination. To handle such noisy input, we use temporal accumulation to increase the effective sample count and spatiotemporal luminance variance estimates to drive a hierarchical, image-space wavelet filter. This hierarchy allows us to distinguish between noise and detail at multiple scales using luminance variance.  
        Physically-based light transport is a longstanding goal for real-time computer graphics. While modern games use limited forms of ray tracing, physically-based Monte Carlo global illumination does not meet their 30 Hz minimal performance requirement. Looking ahead to fully dynamic, real-time path tracing, we expect this to only be feasible using a small number of paths per pixel. As such, image reconstruction using low sample counts is key to bringing path tracing to real-time. When compared to prior interactive reconstruction filters, our work gives approximately 10x more temporally stable results, matched references images 5-47% better (according to SSIM), and runs in just 10 ms (+/- 15%) on modern graphics hardware at 1920x1080 resolution.
        It's going to be interesting to see if the method in this paper produces more convincing results that the other paper. Either way HPG has a bunch more interesting papers which are worth keeping an eye on.

        0 0

        2018 will be bookmarked as a turning point for Monte Carlo rendering due to the wide availability of fast, high quality denoising algorithms, which can be attributed for a large part to Nvidia Research: Nvidia just released OptiX 5.0 to developers, which contains a new GPU accelerated post-processing denoising filter.

        The new denoiser was trained with machine learning on a database of thousands of rendered images and works pretty much in real-time. The OptiX 5.0 SDK contains a sample program of a simple path tracer with the denoiser running on top (as a post-process). The results are nothing short of stunning: noise disappears completely, even difficult indirectly lit surfaces like refractive (glass) objects and shadowy areas clear up remarkably fast and the image progressively get closer to the ground truth. 

        The OptiX denoiser works great for glass and dark, indirectly lit areas

        While in general the denoiser does a fantastic job, it's not yet optimised to deal with areas that converge fast, and in some instances overblurs and fails to preserve texture detail as shown in the screen grab below (perhaps this can be solved with more training for the machine learning):

        Overblurring of textures
        The denoiser is provided free for commercial use (royalty-free), but requires an Nvidia GPU. It works with both CPU and GPU rendering engines and is already implemented in Iray (Nvidia's own GPU renderer), V-Ray (by Chaos Group), Redshift Render and Clarisse (a CPU based renderer for VFX by Isotropix).

        Some videos of the denoiser in action in Optix, V-Ray, Redshift and Clarisse:

        Optix 5.0:


        This video provides a high level explanation of the deep learning algorithm behind the OptiX/Iray denoiser based on the Nvidia research paper "Interactive Reconstruction of Monte Carlo Image Sequences using a Recurrent Denoising Autoencoder"

        V-Ray 4.0:

        Redshift: (and a post from Redshift's Panos explaining the implementation in Redshift)


        Other renderers like Cycles and Corona already have their own built-in denoisers, but will probably benefit from the OptiX denoiser as well (especially Corona which was acquired by Chaos Group in September 2017).

        The OptiX team has indicated that they are researching an optimised version of this filter for use in interactive to real-time photorealistic rendering, which might find its way into game engines. Real-time noise-free photorealistic rendering is tantalisingly close.

        0 0

        The Blue Brain Project is a Switzerland based computational neuroscience project which aims to demystify how the brain works by simulating a biologically accurate brain using a state-of-the-art supercomputer. The simulation runs at multiple scales and goes from the whole brain level down to the tiny molecules which transport signals from one cell to another (neurotransmitters). The knowledge gathered from such an ultra-detailed simulation can be applied to simulating drug therapies for neurological diseases (computational medicine) and developing self-thinking machines (computational intelligence).

        To visualize these detailed brain simulations, we have been working on a high performance rendering engine, aptly named "Brayns". Brayns uses raytracing to render massively complex scenes comprised of trillions of molecules interacting in real-time on a supercomputer. The core ray tracing intersection kernels in Brayns are based on Intel's Embree and Ospray high performance ray tracing libraries, which are optimised to render on recent Intel CPUs (such as the Skylake architecture). These CPUs  basically are a GPU in CPU disguise (as they are based on Intel's defunct Larrabee GPU project), but can render massive scientific scenes in real-time as they can address over a terabyte of RAM. What makes these CPUs ultrafast at ray tracing is a neat feature called AVX-512 extensions, which can run several ray tracing calculations in parallel (in combination with ispc), resulting in blazingly fast CPU ray tracing performance which rivals that of a GPU and even beats it when the scene becomes very complex. 

        Besides using Intel's superfast ray tracing kernels, Brayns has lots of custom code optimisations which allows it to render a fully path traced scene in real-time. These are some of the features of Brayns:
        • hand optimised BVH traversal and geometry intersection kernels
        • real-time path traced diffuse global illumination
        • Optix real-time AI accelerated denoising
        • HDR environment map lighting
        • explicit direct lighting (next event estimation)
        • quasi-Monte Carlo sampling
        • volume rendering
        • procedural geometry
        • signed distance fields raymarching 
        • instancing, allowing to visualize billions of dynamic molecules in real-time
        • stereoscopic omnidirectional 3D rendering
        • efficient loading and rendering of multi-terabyte datasets
        • linear scaling across many nodes
        • optimised for real-time distributed rendering on a cluster with high speed network interconnection
        • ultra-low latency streaming to high resolution display walls and VR caves
        • modular architecture which makes it ideal for experimenting with new rendering techniques
        • optional noise and gluten free rendering
        Below is a screenshot of an early real-time path tracing test on a 40 megapixel curved screen powered by seven 4K projectors: 

        Real-time path traced scene on a 8 m by 3 m (25 by 10 ft) semi-cylindrical display,
        powered by seven 4K projectors (40 megapixels in total)

        Seeing this scene projected lifesize in photorealistic detail on a 180 degree stereoscopic 3D screen and interacting with it in real-time is quite a breathtaking experience. Having 3D molecules zooming past the observer will be the next milestone. I haven't felt this thrilled about path tracing in quite some time.

        Technical/Medical/Scientific 3D artists wanted 

        We are currently looking for technical 3D artists to join our team to produce immersive neuroscientific 3D content. If this sounds interesting to you, get in touch by emailing me at

        0 0

        Before continuing the tutorial series, let's have a look at a simple but effective way to speed up path tracing. The idea is quite simple: like an octree, a bounding volume hierarchy (BVH) can double as both a ray tracing acceleration structure and a way to represent the scene geometry at multiple levels of detail (multiresolution geometry representation). Specifically the axis-aligned bounding boxes of the BVH nodes at different depths in the tree serve as a more or less crude approximation of the geometry. 

        Low detail geometry enables much faster ray intersections and can be useful when light effects don't require full geometric accuracy, for example in the case of motion blur, glossy (blurry) reflections, soft shadows, ambient occlusion and indirect illumination. Especially when geometry is not directly visible in the view frustum or in specular (mirror-like) reflections, using geometry proxies can provide a significant speedup (depending on the fault tolerance) at an almost imperceptible and negligible loss in quality.

        - Skipping triangle intersection
        - only ray/box intersection, better for thread divergence

        The renderer determines the appropriate level of detail based on the distance from the camera (for primary rays) or on the distance from the ray origin for secondary rays. The following screenshots show the bounding boxes of the BVH nodes from depth 1 (depth 0 is the rootnode) up to depth 12:

        The screenshot below shows only the bounding boxes of the leafnodes:

        Normals are computed according to 

        TODO link to github code, propose fixes to fill holes, present benchmark results (8x speedup), get more timtams 

        0 0

        In the last two months, Nvidia roped in several high profile, world class ray tracing experts (with mostly a CPU ray tracing background):

        Matt Pharr

        One of the authors of the Physically Based Rendering books (, some say it's the bible for Monte Carlo ray tracing). Before joining Nvidia, he was working at Google with Paul Debevec on Daydream VR, light fields and Seurat (, none of which took off in a big way for some reason.

        Before Google, he worked at Intel on Larrabee, Intel's failed attempt at making a GPGPU for real-time ray tracing and rasterisation which could compete with Nvidia GPUs) and ISPC, a specialised compiler intended to extract maximum parallelism from the new Intel chips with AVX extensions. He described his time at Intel in great detail on his blog: (sounds like an awful company to work for).

        Intel also bought Neoptica, Matt's startup, which was supposed to research new and interesting rendering techniques for hybrid CPU/GPU chip architectures like the PS3's Cell

        Ingo Wald

        Pioneering researcher in the field of real-time ray tracing from the Saarbrücken computer graphics group in Germany, who later moved to Intel and the university of Utah to work on a very high performance CPU based ray tracing frameworks such as Embree (used in Corona Render and Cycles) and Ospray.

        His PhD thesis "Real-time ray tracing and interactive global illumination" from 2004, describes a real-time GI renderer running on a cluster of commodity PCs and hardware accelerated ray tracing (OpenRT) on a custom fixed function ray tracing chip (SaarCOR).

        Ingo contributed a lot to the development of high quality ray tracing acceleration structures (built with the surface area heuristic).

        Eric Haines

        Main author of the famous Real-time Rendering blog, who worked until recently for Autodesk. He also used to maintain the Real-time Raytracing Realm and Ray Tracing News

        What connects these people is that they all have a passion for real-time ray tracing running in their blood, so having them all united under one roof is bound to give fireworks.

        With these recent hires and initiatives such as RTX (Nvidia's ray tracing API), it seems that Nvidia will be pushing real-time ray tracing into the mainstream really soon. I'm really excited to finally see it all come together. I'm pretty sure that ray tracing will very soon be everywhere and its quality and ease-of-use will soon displace rasterisation based technologies (it's also the reason why I started this blog exactly ten years ago).

        0 0

        The Chaos Group blog features quite an interesting article about the speed increase which can be expected by using Nvidia's recently announced RTX cards: 

        "Specialized hardware for ray casting has been attempted in the past, but has been largely unsuccessful — partly because the shading and ray casting calculations are usually closely related and having them run on completely different hardware devices is not efficient. Having both processes running inside the same GPU is what makes the RTX architecture interesting. We expect that in the coming years the RTX series of GPUs will have a large impact on rendering and will firmly establish GPU ray tracing as a technique for producing computer generated images both for off-line and real-time rendering."

        The article features a new research project, called Lavina, which is essentially doing real-time ray tracing and path tracing (with reflections, refractions and one GI bounce). The video below gets seriously impressive towards the end: 

        Chaos Group have always been a frontrunner in real-time photorealistic ray tracing research on GPUs, even as far back as Siggraph 2009 where they showed off the first version of V-Ray RT GPU rendering on CUDA (see or 

        I have to admit that I'm both stoked, but also a bit jealous when I see what Chaos Group has achieved with project Lavina, as it is exactly what I hoped Brigade would turn into one day (Brigade was a premature real-time path tracing engine developed by Jacco Bikker in 2010, which I experimented with and blogged about quite extensively, see e.g. ). 

        Then again, thanks to noob-friendly ray tracing API's like Nvidia's RTX and Optix, soon everyone's grandmother and their dog will be able to write a real-time path tracer, so all is well in the end.  

        0 0

        The Blue Brain Project is a Swiss research project, based in Geneva, which started in 2005 and aims to faithfully simulate a detailed digital version of the mouse brain, (as close to biology as is possible with today's supercomputers).

        Visualising this simulated brain and its components is a massive challenge. Our goal is to build state-of-the-art visualisation tools to interactively explore extremely large and detailed scientific datasets (over 3 TB). The real-time visualisation is rendered remotely on a supercomputing cluster and can be interacted with on any client device (laptop, tablet or phone) via the web.

        To achieve interactive frame rates and high resolution, we are building our tools on top of the industry's highest performance ray tracing libraries (the Ospray library from Intel, which itself is based on Embree, and the OptiX framework for interactive GPU ray tracing from Nvidia). These libraries take advantage of the embarrassingly parallel nature of ray tracing and scale extremely efficiently across multiple cores, devices and nodes in a cluster.

        We are currently looking for software engineers to help accelerate the development of these tools, both in the frontend and backend. Our offices are located at the Campus Biotech in the international district in Geneva, Switzerland.

        Frontend/fullstack web developer

        Your profile

        • 3+ years experience in full stack/frontend engineering
        • 3+ years designing, developing, and scaling modern web applications
        • 3+ years experience with JavaScript, HTML5, CSS3, and other modern web technologies

        Main duties and responsibilities

        Your responsibility will be to develop new features for our web based interactive 3D viewer "Brayns" (on the frontend) and maintain existing ones, and to drive the development of our new hub application where the scientists can manage their data visualisations.

        Required skills and experience

        • TypeScript, JavaScript (ES6)
        • React JavaScript framework
        • REST, WebSockets and Remote Procedure Calls
        • RxJS, NodeJS
        • Deep understanding of asynchronous code and the observable pattern in JavaScript
        • Experience using the browser dev tools for debugging, profiling, performance evaluation, etc.
        • Understanding of both the object oriented and functional programming paradigms
        • Knowledge of code chunking strategies
        • Experience writing unit tests using Jest and component tests using Enzyme (or similar technologies)
        • Experience with source versioning systems (Git, Github, etc.)
        • Knowledge of common UI/UX design patterns and ability to implement/use them accordingly
        • Knowledge of the Material Design spec
        • Fluent English in speech and writing
        • Self-motivated and ability to work independently 
        • Team oriented

        Nice to have

        • Interest in science (in particular neuroscience)
        • Experience with ThreeJS, WebGL, WebAssembly
        • Basic understanding of C++, Python and Docker
        • UI graphics design skills


        For more info, email

        C++ interactive graphics developer 

        Main duties and responsibilities

        Your responsibility will be to develop and research new features for "Brayns", our interactive raytracer for scientific visualisation and maintain existing ones

        Required skills and experience
        • 3+ years of experience in C++/Python software development, testing, release, compilation, debugging, and documentation
        • 2+ years of experience with computer graphics (OpenGL, CUDA)
        • Strong knowledge of object-oriented, parallel, and distributed programming
        • Deep understanding of ray tracing and physically based rendering
        • Experience in software quality control and testing
        • Experience using UNIX/Linux operating systems
        • Experience in Linux-based system administration
        • Experience with Continuous Integration systems such as Jenkins
        • Great team player
        • Fluent English in speech and writing

        Nice to have

        • Interest in science (in particular neurobiology)
        • Experience in software development on supercomputers and distributed systems.


        For more info, email