AI-based image rendering: NeRFs, SMoE and Gaussian Splatting

In a previous life, I was working at TU Berlin and Ghent University in the exciting field of video coding and image-based rendering. I had the opportunity to meet with many brilliant minds at conferences, MPEG and JPEG meetings as well as industry meetups at Google, Netflix and Disney Research.

Author with the original inventors of JPEG — Me (left) - starstruck - meeting the original JPEG inventors at the JPEG meeting celebrating 25 years of JPEG (Turin, Italy). Source: Image by the author.

Although image-based rendering has been around for decades, it wasn’t until recently that a big revival happened. Image-based rendering was considered impractical for many reasons, until deep learning and diffusion models provided solutions to some of the longstanding issues.

In this article, I will provide an overview of the recent advances in AI-driven image-based rendering field based on my personal experience and background. Additionally, I will discuss my contributions to the field. This is definitely more of technical longread. Hope you enjoy.

What is image-based rendering?

Image-based rendering is the field of rendering in which new images are distilled from a set of captured images. It typically involves by interpolating, and tracing light rays in a space. The captured images serve as snapshots of the light rays in the space.

Illustration of synthesized cameras vs reference cameras — Illustration in which the yellow camera viewpoint is synthesized based on the pixel data entering the blue real cameras. Source: Image by the author.

Why is this useful? Image-based rendering allows creating novel views from a scene without knowing anything about the geometry of the scene. Reverse-engineering the geometry (meshes, texture, and reflective properties) from a scene is notoriously difficult and an underdetermined problem. Think about smoke, reflections, and translucent objects. It easily gives rise to the uncanny valley problem.

On the other hand, there are practical problems as well in image-based rendering when the geometry and its properties are unknown in order to generate new images:

Many images need to be captured in order for new images to be able to be generated. This leads to practical data acquisition problems, as well as storage and streaming issues.
Occlusions, one can never see behind objects.

Where does AI come into play?

The naive way to implement the rendering is by using ray-tracing to trace each pixel back to the reference views. Over the years, this has been extended by incorporating more and more intelligent view interpolation techniques. However, the modern way is using an intermediate AI model that is constructed based on the reference viewpoints. Such a model can then be queried at the desired viewing angles.

The primary area of innovation lies in designing these view models that can grasp the essence of a scene to enable reconstructing views at high fidelity. Additionally, these models will be required to have properties that are more concerning applications, such as the ability to relight a scene or efficiently stream parts of a scene over a network. The applications will thus be important in deciding what type of model will be preferred.

Applications

Image-based rendering has many applications which all share the desire to take a scene or an object from the real world and visualize it again in a virtual setting.

1. Immersive applications

The most natural application is in a VR/AR experience in which image-based rendering techniques allow a user to experience a strong immersive feeling due to the high photorealistic light effects, as well as having all possible viewing angles at their disposal giving them 6-degrees of freedom.

It effectively fulfills the potential of virtual reality using camera-captured content. This represents a significant improvement over the restricted range of motion found in current 360-degree videos. There are many possible applications, e.g. entertainment (remote events), telemedicine (remote robot surgery), remote visits to real-estate, cultural heritage, and others.

The model will be required to be streamable efficiently across networks. Ideally, it should have all the features found in typical MPEG data streams, such as random access or different layers of detail.

2. 3D assets for virtual productions

A less obvious, but extremely important new application is the inclusion of scanned objects and spaces to be reused in virtual productions, VFX, video production, and gaming. This is where I truly believe that the need for a compact and versatile data models arises.

This is the area that has seen the most activity over the last few years. Engines such as Unreal and Unity currently have plugins that allow you to import view generating models like NeRFs and Gaussian Splats right into your 3D world. You can then integrate other 3D objects like meshes with your camera-captured content. Watch the video below and you will instantly understand why this is a game changer for productions.

Video showing the production process of incorporating a mountain top camera-captured by a drone into a virtual production. Source: Bad Decisions Studio

For these applications, other requirements may be imposed on view-point generating models, e.g. being able to edit or to relight scenes or objects.

Mathematical models of light

Let’s first get back to the (mathematical) basics. Several mathematical models have been introduced to model light over the last century. All relate one way or another to the plenoptic function. The plenoptic function is a mathematical model that describes all the light rays in a space at a certain time. It captures all possible visual information about a scene at every point in time and encompasses various attributes like position, direction, color, intensity, etc. Different models have been introduced that all relate to the plenoptic function: light fields, radiance fields, ray-space representation. All of the models are different parametrizations or simplifications of the plenoptic function.

The full plenoptic function. The Polarization, Bounce and Phase arguments are typically left out for simplification. Time is only relevant for non-stationary scenes (Source: History of Neural Radiance Fields).

For example, the simplified 4D light field contains information about both the intensity and direction of light rays at every point in space. By capturing and processing this vast amount of data, we can recreate realistic images with accurate lighting effects and depth perception. Light fields are used in photography and imaging technologies, like light field cameras (e.g., Lytro), which capture information about the light direction as well as its intensity. This allows for post-capture refocusing, changing the perspective, and precise depth-based filtering.

It is called the 4D light field because it reduced the plenoptic function into 4 main parameters. There are different parametrizations possible (see below) in which the first (a) is the most common, the parallel two plane representation. Each ray of light goes through an image plane (s,t) and travels through a camera plane (u,v), assuming that the cameras are placed on a single plane.

Source: Changyin Zhou, & Nayar, S. K. (2011). Computational Cameras: Convergence of Optics and Processing. IEEE Transactions on Image Processing, 20(12), 3322–3340.

The 4D light field is a convenient mathematical model, but does not always transfer easily to practical and cost-efficient camera rigs, as I will discuss in the next section.

Alternatively, free viewpoints camera setups are possible (or even just moving a single camera in time), which requires the additional complexity of structure-from-motion (SfM) / photogrammetry techniques to situate the pixel values in the 3D space. For example, Gaussian Splatting relies heavily on SfM for the initialization of it’s kernels.

Structure-from-motion: Finding correspondences between camera viewpoints to locate the pixel values in the physical 3D world. A more flexible approach compared to the constrained 4-D light field. Source: https://towardsdatascience.com/a-comprehensive-overview-of-gaussian-splatting-e7d570081362

Capturing light fields

I will briefly discuss camera setups for light field capturing, mainly to give insight on why light fields have not been a practical solution for a long time. Below you can see a prototype for one of Lytro’s production-level light field camera array. Needless to say, this is extremely costly to rent, run and to provide storage for. Each camera array requires a server rack to capture video.

Adapted from RoadToVR - Exclusive: Lytro Reveals Immerge 2.0 Light-field Camera with Improved Quality, Faster Captures.

In contrast, you can find my poor man’s version that I built at IDLab-MEDIA - UGhent below. Each panel consists of 9 Raspberry Pi minicomputers with the RPI v2 cameras. Each panel was laser cut and assembled using standard easy to find tools. The cost of one panel was sub-1000 EUR, which was my assigned budget.

DIY Raspberry Pi-based light field camera array. The German word “Kabelsalat” is very apt here. Source: Image by the author.

Each panel would thus produce 9 photos/videos that were spatially displaced. The external parameters (camera position/angle) and intrinsic parameters (lens correction) were co-optimized using multi-cam optimization.

Result of one panel: 9 viewpoints from a single camera plane. Source: Image by the author.

If you were to apply naive light field rendering by ray tracing through these nine images, you can achieve a result as shown below. The camera position can be slightly changed, and some refocusing is possible by adjusting a virtual focal plane. However, the level of quality remains low when using 9 cameras with relatively large gaps in between. However, it does give you an impression of where we want to go.

Naive light field rendering applied to my poor man’s light field camera rig.
Source: video by the author. Full video here

Novel AI-driven image-based rendering techniques

I will now discuss four recent advancements in image-based rendering that have been made possible by recent AI breakthroughs. For more detailed information on the history of this field prior to 2017, please refer to my PhD thesis.

Non-exhaustive overview of continuous representations of camera-captured scenes, separated in Gaussian-based methods and whole-scene methods. Source: Image by the author.

SMoE (2017)

Between 2014 and 2020, I published a number of papers that introduced Steered Mixture-of-Experts (SMoE), and eventually published a book on the matter. I had the privilege and pleasure to work for an MPEG grandmaster, Prof. Thomas Sikora, who was thrilled about the idea of abandoning the concept of pixels all together. Initially, the method was meant as a continuous representation for images and videos. Everything was modelled by one large Gaussian Mixture Model (GMM). There would just be more Gaussians where more detail was required.

The Gaussians - called ‘kernels’ - would span over spatial directions and in time, thus replacing the pixel as the building block of imagery. A single Gaussian thus represented a single blob of color that had a spatial and a temporal extend. The reconstruction is performed by taking the most probable pixel luminance at a certain location. This is given by the expectation of the posterior distribution given the location, which leads the Gaussians to work together in a Mixture-of-Experts (MoE) fashion. MoEs approximate a continuous function by combining a set of experts that are responsible for a part of the total function. In this case, each kernel is responsible for modeling a single gradient of a region in an image.

A detailed example on a 32x32 pixel image patch. These 1024 pixels were represented by 10 Gaussian kernels, each representing one localized gradient. Comparison to JPEG is at same bitrate. Source: Image by the author.

An example of mean estimated reconstructions of a 128x128 image from the dataset. Original (left) followed by models with 25, 100, 250, 750, and 2000 components, i.e. ranging from 1 kernel covering ±655 to ±8 pixels on average. Source: Image by the author.

It was soon realized that this method could be extended to any dimensional imagery, thus any light model. This was a huge breakthrough, it meant that this could give rise to a methodology that could modelling, coding, decoding, and render: images, videos, light fields, light field video, and even 360-degree content natively. It has the major advantage that complicate redundancy-reducing methods, typically present in MPEG technologies, do not have to be introduced (e.g. motion compensation using motion vectors).

Basically, one Gaussian kernel represents one bundle of light in space. It has an orientation and an extend through space and time.

A SMoE reconstruction is exemplified below using 9,000 kernels instead of the original 41 million pixels. This means that there is approximately 1 kernel for every 5,000 pixels. In general, the development of this method was mainly focused on image compression rather than computer graphics, so bit-efficiency has always been a priority.

The light field is shown below as a video that goes through all the camera standpoints (taken with a Lytro lenslet camera). The video traverses through the camera from the top left to the bottom right.

Left: original, Right: SMoE reconstruction. Source: Video by the author.

Below is an illustration of the 4D kernels of a crop of the above. Sadly, 4D is inherently difficult for us humans to wrap our heads around. However, the main key takeaway is that if you move your head left and right in a scene, a patch of color will move left or right relative to the distance of that patch of color in the scene. Same goes for moving your head up and down. This relative movement is what is visualized on the a1 and a2 dimensions below. The main observation is that kernels have an extent in all four dimensions. A kernel is thus responsible for a patch of color that can move from left to right or top and bottom based on the camera viewpoint.

Visualization of the 4 dimensions of the light field, i.e. the image dimension and the epipolar planes. Source: Image by the author.

The rendering of such SMoE models was heavily improved by the works of Martijn Courteaux (UGent - IDLab-MEDIA). It allows for real-time rendering of light fields from any angle. Here’s one example based on only 9 images from one of my DIY light field camera array. Since the kernels find the correlation between the different viewpoints, they provide a smooth transition in between and even outside of the original camera plane.

Light Field rendering of a SMoE model. Source: Video by the author.

Visualization of the kernels of the model. Source: Video by the author.

The issue with SMoE was that, although theoretically sound, there always remained a struggle to obtain near-lossless quality or ways of dealing with fine texture details. In my works, I mainly trained the GMMs by using the Expectation-Maximization (EM) algorithm. I even developed a method to scale this algorithm to accommodate hundreds of thousands of kernels on billions of pixels (I might write about that in a later post). There have been follow-up methods published on how to better initialize and train the GMM using MSE-optimization using gradient descent. This involved making the model differentiable, similar to Gaussian Splatting. This greatly improves image quality but broke the theoretical soundness that the model was not a pure Bayesian GMM anymore (for all that it's worth).

The same Martijn Courteaux (sitting in the back in the video) has worked on bringing the modeling and rendering of light fields using SMoE to a whole new level over the last few years. A sneak peak is included below:

Previewing some recent advances in SMoE-based rendering. Source: Martijn Courteaux Youtube Channel .

I will discuss the current state of SMoE vs Gaussian Splatting in our wrapping-up section. But first, I will continue chronologically through the major breakthroughs.

NeRF (2020)

Neural Radiance Fields (NeRFs) and the subsequent research based on NeRFs, are whole-scene methods in which a single neural network captures the entire scene. The neural network maps the viewing angle onto the color output, providing a continuous representation of the entire scene.

Source: History of Neural Randiance Fields

The data that needs to be saved consists solely of the weights (and the architecture) of the trained neural network. The downside is that the whole set of weights is required to reconstruct even portions of the scene. This leads us to the main disadvantages of whole-scene NeRF methods: encoding and decoding complexity and memory requirements. There are no "building blocks" as each scene corresponds with training an entire neural net. Reconstruction corresponds to inferring the entire neural network.

Source: https://www.matthewtancik.com/nerf

The biggest advantage of NeRFs is that they have a complete knowledge of the scene, which can lead to much better image quality and fewer camera viewpoints necessary due to better generalization between image viewpoints.

NeRF reconstruction example. More examples on https://www.matthewtancik.com/nerf (source)

Gaussian Splatting (2023)

Recently, Gaussian Splatting has received much attention, and rightfully so. It is a Gaussian-based method similar to SMoE, but it addresses some of the persistent issues present with SMoE. The optimization of Gaussian Splatting parameters is MSE-based, similar to extensions of SMoE.

The main differences to SMoE are as follows:

The Gaussian kernels exist in the physical 3D coordinate space, whereas SMoE kernels exist in the camera-image-plane coordinate system. As such, they have a more explicit connection to the real geometry. Furthermore, this allows the Gaussians to be initialized better by using structure-from-motion, which greatly improves the optimization process to achieve an optimum quickly and efficiently.
Spherical Harmonics are used as the view-dependent color function which provide more expressive local expert functions compared to the color gradients in SMoE. This has been key to achieve photo-realistic results.
Gaussian splatting benefits from a plethora of optimization possibilities, as it can be implemented as a rasterization method. As such, it benefits from decades of computer graphics advancements.

High-level comparison between SMoE and Gaussian Splatting. Source: image by the author.

Seemingly bikes make for great test cases in the field of view synthesis, since there are many small structures such as spokes and brake cables. Below is an illustration from a different bike scene modelled by Gaussian splatting.

Similar to our SMoE example of a scene, an example is shown here which illustrates the quality of Gaussian Splatting in a similar scene, albeit recorded using a more higher end camera rig.

Source: 3D Gaussian Splatting at Plain Concepts.

It is great to see a method reaching maturity that will definitely make a huge impact on virtual video productions and a variety of game production tools. Especially since it is not a black-box model, but there are building blocks that can be segmented, compressed, streamed... A lot of the paradigms of video coding are applicable to splats which is exciting.

For those who want more details. I would highly recommend reading the excellent Comprehensive Overview of Gaussian Splatting by Kate Yurkova.

ReconFusion (2023)

I included ReconFusion as it clearly demonstrates the benefit of a technique that is purely deep-learning based. ReconFusion is basically the combination of NeRFs with diffusion, which is a rather novel architecture within deep learning that is especially good at generalizing to unseen data. This generalization translates to needing fewer original camera viewpoints as it "inpaints" the missing views based on prior image knowledge gathered by training on a large dataset of viewpoints.

Below you can see the comparison between regular NeRFs and the ReconFusion using diffusion priors. It is clear that ReconFusion requires much less initial camera viewpoints.

Source: https://reconfusion.github.io/

Wrapping up

There are currently two new main paradigms in image-based rendering: the Gaussian-based methods and the whole scene deep-learning methods. Both have their pros and cons listed below.

Comparison between Gaussian-based methods (left) and whole scene NeRF methods (right). Source: image by the author.

I strongly believe that when it comes to creating efficient streamable camera-captured VR content, Gaussian kernel-based methods are the clear choice. While NeRFs can be utilized as 3D assets in virtual productions, Gaussian splats can serve the same purpose just as effectively. Additionally, Gaussian splats offer the advantage of being loosely connected to the underlying geometry. This opens up possibilities for editing these 3D assets in various ways.

NeRFs can still be employed in scenarios with restricted viewpoints and a greater need for image priors. The methods are not mutually exclusive either. It may be logical to initially generate a NeRF using limited camera viewpoints, harness the capabilities of diffusion, and subsequently create a Gaussian Splat for a more practical model.

Sadly, for some reason, currently unpublished improvements to SMoE have been extremely difficult to get published. The novel methods have even been rejected three times at ACM Transactions on Graphics, the SIGGRAPH journal in which Gaussian Splatting was introduced. Anyway, I'll spare you a massive rant on the current state of journal review processes. Nevertheless, I hope to see the publication soon, which would also benefit the Gaussian Splatting community since many improvements are transferable between the two techniques.

One thing is for sure, the field is more alive and kicking than I have ever experienced in my career. At Datameister.ai, we're following up on all the developments in the field and are currently exploring how we can contribute to it. More on that later!

Feel free to discuss the article on reddit: https://www.reddit.com/r/GaussianSplatting/comments/1ax1102/a_higherlevel_view_on_ai_models_in_radiance/

AI-Driven Breakthroughs in Image-Based Rendering: Light Fields, SMoE, Gaussian Splatting, NeRFs and beyond