3D Generative AI: Image-based 3D reconstruction

In 2024, there was a Cambrian explosion of academic work on 3D generation and several commercial tools made their appearance. Last month in particular, there was the exciting release of Trellis by Microsoft, taking a leap forward in the open source/science community. Its performance is great, but of equal importance is the fact that it was largely open sourced and people are already building with it.

The goal of this post is twofold:

This post will be a tutorial for builders and other interested minds, offering a concise overview of the historical developments that led to Trellis and a detailed look at how it works—explaining why its pipeline is designed the way it is.
We will compare Trellis to other state-of-the-art commercial tools —such as Rodin by Hyper3D, Tripo by TripoAI, SPAR3D by StabilityAI and Hunyan3D-2 by Tencent—by examining both the theoretical differences in their pipelines and presenting visual results for a subjective eval of each tool.

Jarne Van den Herrewegen - AI Engineer (author)

A small word about myself first. My name is Jarne, I am a first class AI nerd and ML engineer at Datameister. I hold a soon-to-be-defended PhD in self-supervised 3D deep learning. The best thing for me is rabbit-holing in research and busting out algorithms in disruptive products. Having worked with the Datameister founders Axel and Ruben for 3 years at Oqton, I am more than happy to join them and the other talented meisters!

1. From NeRF to Trellis3D: key concepts of 3D generation

Given the success of image generation models, an extensive line of research on 3D reconstruction from images has formed in the past two years. The goal of this section is to introduce key concepts in 3D generative models using image generation without losing ourselves in paper-filling details. These foundations will clarify how the field evolved and set the stage for understanding Trellis and why its design choices push 3D generation further. In this section I will cover:

NeRFs and Gaussian Splatting: the bridges between 2D and 3D
DreamFusion: pioneering general 3D generation
Large Reconstruction Model: making DreamFusion efficient
InstantMesh: introducing 3D feedback

1.1 Image-based 3D reconstruction

We begin with NeRFs and Gaussian Splatting because they set the foundation for modern 3D generation—both methods demonstrate how limited 2D observations can be leveraged to generate new “3D” views. By understanding the strengths (and limitations) of these approaches, we can see why Trellis adopts certain strategies (like voxelization and latent-space modeling) and how it ultimately distinguishes itself from earlier techniques.

Novel view synthesis. Key to the reconstruction from 2D to 3D, are Neural Radiance Fields (NeRF) and Gaussian Splatting (GS). Given a few images and according camera perspectives, these methods can synthesize unseen camera views with high quality. To predict views that are truthful to the 3D world, both methods build a form of geometric grounding. This geometric information will be the bridge between 2D images and 3D generation. NeRF and GS will be shortly introduced here. For more applications, definitely check out our previous blog about image-based rendering with AI by Ruben, who used to publish papers in the early days of this field!

Neural Radiance Fields, Mildenhall et al. ECCV 2020, capture an environment into a neural network based on images and according camera positions. In essence, a NeRF is a neural network f fitted to infer RGB values and volume density σ (0 = empty space, 1 = solid) for point x and viewing direction d:

To render an image from a Neural Radiance Field, a ray is cast through space for each pixel in the virtual image plane. According to a sampling process, the ray is evaluated at specific points in the NeRF:

The volume density contains information on the composition of the scene and guides the color aggregation process along the rays. This geometric understanding will be the basis for obtaining 3D assets with NeRFs. Extracting a mesh from a density field is possible with a Marching Cubes algorithm. This works by voxelizing the volume and using the NeRF density to determine whether each vertex is inside or outside of a surface. Marching Cubes then checks each voxel (cube) individually for how the surface intersects its edges, and selects a polygonal pattern (from a small lookup table) that best approximates the shape within that region, as shown below.

Marching Cubes lookup table for mesh faces. Red vertices are inside the surface. Source: Isovox

Gaussian Splatting (GS), is a more recent method for novel view synthesis and is aimed at real-time applications. GS represents 3D scenes as a cloud of colored Gaussians, without any neural networks involved. In essence, each Gaussian has a position μ, covariance Σ, maximum density σ and a color c that describe local structure and color. Similar to rendering with Neural Radiance Fields, rays are cast from a virtual image plane through the scene:

Each ray aggregates color from the Gaussian kernels in the scene. The color contribution of each kernel is weighted by its maximum density and proximity to the ray. The following formula shows the contributed density for kernel i for position x on the ray.

In reality the Gaussians are projected onto the image plane and rendering is done through simple α-compositing of kernels sorted on distance. Most kernels can be efficiently pruned, reducing the inference time even more. This type of optimizations, combined with the absence of neural networks, make Gaussian Splatting about 100x more efficient than Neural Radiance Fields in rendering speed.

While Gaussian Splatting represents the scene’s geometry implicitly through the positions and covariances of splats, extracting a clean, explicit mesh is not straightforward—especially for translucent or mirror-like surfaces where splats do not encode a single, solid boundary. Unlike density fields (where Marching Cubes can be directly applied), the soft nature of Gaussian splats leads to ambiguities about whether a particular region is “inside” or “outside” a surface. Some recent adaptations of GS explicitly link the Gaussians to underlying geometry, but this is still an active research area. In Trellis, Gaussian Splatting serves as medium for image reconstruction losses.

1.2 DreamFusion: Pioneering general 3D generation

Dreamfusion is the first academic work that successfully generated 3D assets in a generalizable way, published by Poole et al. at ICLR 2023. Prior to its development, 3D generation models showed good quality, but were constrained to specific object categories like chairs, limiting their versatility. Dreamfusion, on the other hand, builds upon 2D image generation methods that can generalize across millions of objects.

3D model of “a frog wearing a sweater” Source: DreamFusion

The main trick in Dreamfusion’s approach lies in its use of a pre-trained 2D Diffusion model, Imagen in this case, to optimize a Neural Radiance Field representation. Given a text prompt describing the object the user wants to generate, Dreamfusion fits a single Neural Radiance Field initialized from scratch.

Overview for DreamFusion. Adapted from Poole et al.

Optimization at inference. The pre-trained diffusion model is used as a critic to guide the NeRF towards a plausible generation. The optimization process works by taking a camera perspective and rendering the according view from the NeRF. Then, the novel view is combined with sampled noise and is passed through the diffusion model together with the text prompt. The pre-trained diffusion model predicts the noise that needs to be subtracted from its input to obtain a high quality image conditioned on the text prompt. DreamFusion looks at the difference between sampled noise and predicted noise:

If the predicted noise is close to the added noise, the noise difference is small, implying that the rendered view was of high quality.
Conversely, if the predicted noise deviates significantly from the added noise, this indicates that the diffusion model corrected substantial issues in the image, suggesting that the NeRFs performance was poor.

This clever idea is referred to as Score Distillation Sampling (SDS) in literature and forms the loss function for fitting the Neural Radiance Field. By iteratively refining the NeRF using this feedback loop, DreamFusion generates NeRFs of any object that is well-known to the pre-trained image generation model.

DreamFusion remains just a strong piece of research however. It’s practical applicability is limited due to its mid quality and long inference time (1.5h). Considering that the authors only used 64x64 images, the overal results are still impressive. Follow-up work, such as the Large Reconstuction Model, focused on improving the inference time and quality.

1.3 Large Reconstruction Model

To address slow inference in DreamFusion, the Large Reconstruction Model (LRM) was proposed by Hong et al (ICLR 2024). LRM takes a different approach to representing Neural Radiance Fields, specifically leveraging the triplane NeRF to improve inference time.

A triplane NeRF is an adapted version of the original Neural Radiance Field, introduced by Chan et al. 2022. While classic NeRFs are entirely implicit functions, just one neural network for predicting everything, triplane NeRFs use intermediate feature maps. These are predicted with a pre-trained backbone that infers features from input images. The pre-trained backbone adds more prior information to the procedure, improving the final RGBσ regression and reducing the inference time. When a triplane NeRF is queried with an XYZ coordinate, a feature vector is aggregated by averaging the values from the closest pixels in the three orthogonal feature maps, as seen below. This aggregated feature vector is then used to regress the eventual RGBσ output with only a few fully-connected layers.

Triplane NeRF representation. Source: Chan et al. (CVPR 2022)

Avoiding cubic scaling. Ideally, there would be an NxNxN grid with feature vectors instead of 3 feature maps with NxN grids to have even better locality in the NeRF, but the cubic scaling would be too inefficient to scale to high resolution. The triplane representation proves to be a good tradeoff between resolution and scalability. We will later see how Trellis puts a spin on this and uses voxelized 3D surfaces instead of straight planes to improve the locality but remain efficient.

The Large Reconstruction Model is an image-to-3D model that manages to reduce the inference time to less than 10s thanks to the triplane representation. The authors shifted the computational efforts from test time to train time. A training procedure is introduced where a pre-trained DINO encoder is used together with a custom decoder for inferring triplane maps. The model is trained on multi-view renders from the Objaverse dataset, containing 800k textured 3D objects. After training, text-to-image models can be chained before the LRM to create a text-to-3D model.

Overview of the Large Reconstruction Model work. Figure adapted from InstantMesh

The improvement of LRM over DreamFusion comes from its practical applicability. Where DreamFusion showed that a 3D generative model can generalize over millions of categories, LRM showed that 3D generative models can also be practical. For both works, the output quality is still below the level where a creative would start working on the object however. DreamFusion’s overal resolution was too low, as it only used 64x64 images. LRM was trained with 512x512 images, resulting in improved resolution, but it suffered from inconsistent quality between different views. Objects only look good from the perspective of the input image, as shown below.

1.4 InstantMesh

The Large Reconstruction Model (LRM) set the stage for many follow-up works that extended its pipeline to enhance geometric consistency and overall quality. InstantMesh, Xu et al. 2024, is a notable follow-up that introduced multi-view input and direct feedback from 3D meshes to improve the consistency from different perspectives. Furthermore, InstantMesh brings the possibility for PBR materials such as normal maps, important for creative workflows.

InstantMesh pipeline. Source: InstantMesh

Multi-view input. InstantMesh uses 6 input images from different perspectives to have richer geometric information. For a user, it is non-trivial to provide 6 images without having the 3D object already. Therefore, Xu et al. use a multi-view diffusion model, Zero123++, to generate 6 views from one input image. Next, the same encoder as in LRM is applied six times separately and the LRM decoder is retrained to work with 6x image tokens.

Geometric feedback. Another significant improvement involves incorporating explicit 3D data into the training process. Until this point, both DreamFusion and LRM had relied exclusively on image-based losses. From an inferred triplane NeRF, the authors extract a 3D mesh in a differentiable manner, allowing to propagate mesh reconstruction losses back into the LRM. InstantMesh relies on a parameterized Marching Cubes algorithm called FlexiCubes for this.

Quality. InstantMesh brought 3D generation to a point where it is practically useful. It can generate a wide variety of objects, within a minute, with a level of quality that is sufficient to start working on in some cases. In particular for background props that are not animated, InstantMesh can be useful for inspiration or as starting point for a sculpt. For important characters that require detail and animation, the geometry, textures and mesh topology are not good enough yet.

Now the question arises: can 3D generation be pushed further to achieve more faithful geometry, higher-quality textures, and greater editability? Considering that the amount of open source 3D objects is limited, this is a tough challenge. This is precisely where Trellis makes its mark, starting from this pipeline introducing innovations—like voxelized surfaces and latent space Flow Matching to take 3D generation to the next level.

2. Trellis

Last month Microsoft released Trellis, Xiang et al. 2024, which inherits many ideas from InstantMesh and brings each step in the pipeline to another level.

Two improvements stand-out: (1) triplanes are replaced with voxelized 3D surfaces and (2) the generative process is moved from image space to the latent 3D space. The result is high quality geometry, improved textures and a framework that allows more control and editability in the 3D generation process.

Converting textured meshes to surface voxels with DINO feature vectors.

Evolving triplanes to surface voxels. Instead of using triplane feature maps, Trellis introduces sparse voxels with features. In a 64^3 grid, voxels are activated at the surface of the 3D object, as shown in the figure above. 150 renders are made for each object and are encoded into features with a DINOv2 backbone. Each voxel receives features that are projected from the visible DINOv2 output. Compared to a triplane feature representation, the sparse voxel features are closer to the location where the information is needed. At the same time, the dimensionality does not explode. In the Trellis training set, each object reportedly has 20K active voxels on average out of (7.63% grid occupation), where a triplane representation would have 12K feature pixels (4.69% grid occupation).

[Technical] Decoding features to 3D Gaussian Splats, NeRFs and meshes. From the surface voxels with features, Trellis offers three decoders: one for Gaussian Splats, one for NeRFs and one for triangle meshes. Each of these modalities are non-trivial to predict with deep learning models and require specific differentiable loss functions. For Gaussian Splats, 32 gaussian kernels are created for each voxel by predicting the location, scale, orientation, density and color as described in Section 2.1. Views are sampled from the splats and compared to renders from the actual scene with image reconstruction losses (D-SSIM and LPIPS). The Neural Radiance Field decoder predicts 4 feature vectors per voxel that represent a CP-decomposition as in Tensorial Radiance Fields, a supercharged version of triplane NeRFs. The NeRF decoder is trained with the same reconstruction losses as Gaussian Splats. Meshes are predicted by upsampling the voxel grid from 64^3 to 256^3 and regressing SDF values on the 8 voxel vertices together with all required parameters for FlexiCubes. The reconstruction loss for the mesh decoder is defined on the depth and normal maps rendered from the original 3D object and the reconstructed mesh.

Generative model overview. On the algorithmic side, Trellis differs the most from DreamFusion, LRM and InstantMesh by not performing the generative process in image space, but rather on the sparse feature voxels to stay closer to 3D space. There will be two generative models: one for generating surface voxels in 3D and one for generating features in the voxels. Trellis uses Flow Matching in latent space as generative process. Additionally, the authors curated a more qualitative dataset. It is not clear from their report however which improvement was more impactful, the dataset or the algorithm. An overview of all components:

Dataset: a high quality 3D dataset with 500k samples is curated from 10mil objects in Objaverse-XL. From each object, >100 images, depth maps and normal maps are rendered. Surface voxels are generated for all objects. Rendered images are captioned and summarized per object with GPT-4o.
Variational AutoEncoder (VAE): a VAE is trained to create a downsampled latent space in which the generative process will take place, as in Latent Diffusion. This VAE is only used for the model that generates features in the voxels, not for the model that generates the voxel structure.
Flow model training for structure generation: a Flow Matching model is trained to generate voxel structures from a random binary 64x64x64 grid. The model can be conditioned on text or image input. A more detailed description is given below.
Flow model training for latent feature generation: a second Flow Matching model is trained to generate latent features in the surface voxels. This model generates latent features that match the VAE encoder output, conditioned on the voxel structure and text or image input.

Training a Variational AutoEncoder (VAE) in Trellis. Source: Trellis authors

Variational Auto Encoder (VAE) for latent generation. An important trick in generative modeling is to perform the generation process in a downscaled latent space. This speeds up the generation process and improves generalization capabilities significantly as the input dimensionality is lower. Trellis uses this trick as well and trains a VAE encoder to reduce the dimension of the features in each voxel from 1024 to 8. The VAE decoder learns to infer the parameters for the 3 output modalities from these small feature vectors. In practice, the authors first trained an encoder-decoder only with Gaussian Splatting. Afterwards, the the final layer in the decoder was replaced with new layers that were trained separately to predict NeRF and FlexiCubes parameters.

VAE architecture. The encoder and decoder architecture are a 3D variant of Shifted Window Transformers. Attention is only applied to tokens in a local 8x8x8 window to limit the attention matrix. Each token corresponds to one active voxel, empty voxels do not get a token, reducing the attention matrix by another order of magnitude. Tokens are created by projecting the features of an active voxel with a linear layer and adding a sinusoidal positional encoding based on its location in the grid.

Flow Matching for generative modeling. The Trellis authors use Flow Matching in latent space to generate the voxel structure and the voxel features. Flow Matching is similar to Diffusion and has become popular thanks to its faster sampling process. Where Diffusion relies on a Markov Process (MP) to (de)noise samples, Flow Matching defines a more general theory for transforming one distribution to another using vector fields. The theory is more abstract, but in practice the MP is just replaced with linear interpolation. In the formulas below, given a ground truth sample xgt and noise ε at timestep t, a noisy sample xnoised is interpolated. Next, the Flow Matching (FM) objective is defined for the denoising model vθ(x,t) with parameters θ that learns to predict the added noise in xnoised.

Generation process for sparse voxels representing 3D surfaces. Inputs: random binary grid and text/image conditioning. Adapted from Xiang et al.

Structure generation. The first Flow Transformer learns to generate sparse structures, where empty/active voxels are indicated 0/1. Since this Transformer model is not a windowed Transformer and because it works on a dense 64x64x64 grid, there would be 262k tokens to attend to at once. As it is computationally expensive to generate directly in the 64x64x64 space, the authors train additional downsampling and upsampling layers that downsample the binary grid into a 16x16x16 grid with latent vectors of 8 dimensions. Given text or image conditioning (cross-attention with CLIP or DINOv2 tokens) and downsampled random grids, the model is trained to generate latent vectors that are upsampled back to a binary grid.

Structured latent feature generation for each active voxel, initialized with random 8-dim vectors. Adapted from Xiang et al.

Latent feature generation. For the last step, a Flow Transformer is trained to generate features for the surface voxels in the VAE latent space. Similar to the structure generation, the authors introduce additional downsampling and upsampling layers to improve the efficiency. After the latent feature generation, the VAE decoders can infer Gaussian Splats, a NeRF or a 3D mesh.

Image-to-3D comparison between InstantMesh and Trellis. Source: Trellis

Output quality and usability. Overall, Trellis takes a big step forward in texture quality and geometric detail. For simple background props, it is almost possible to generate game-ready geometry. Textures often need to get some reworking to get rendering-ready. The paper reports a user study where Trellis is compared to InstantMesh and 5 other scientific works (not including commercial tools), finding that Trellis is preferred by users in 67.1% of the text-to-3D generations and in 94.5% of the image-to-3D generations.

Practical advantages in Trellis. With these core ideas in place, Trellis offers several practical advantages that set it apart from prior work.

Creatives typically work in iterations to improve their work.While NeRFs and meshes from the previous 3D generative models are typically non-trivial to edit, the sparse voxels in Trellis are much easier to touch on. In addition, the latent generative process can be masked to only change a specific region.
Trellis was released with a MIT-license, which means you can build with it like you want, even commercially. However, there are a couple of Nvidia dependencies that are not available for commercial use. For the open source community, Trellis is an amazing starting point for which we will hopefully see similar tooling as for image generation models, e.g. infill models, upscalers, ...

3. Trellis and friends

While we’ve already discussed Trellis, other commercial tools released in the past 12 months also deliver outstanding results. Notably, Tripo by Tripo AI, Rodin by Hyper3D, Spar3D by StabilityAI and the just released Hunyuan3D-2 by Tencent have pushed the boundaries of what’s possible, all focusing on image-to-3D. To showcase the differences and similarities among these tools, we provide a visual comparison of their outputs below.

Comparison for image-to-3D models, showing textures and geometry. Swipe for more examples. Source: author.

Visual comparison discussion. These conclusions are based on more than 2 examples, but for brevity I have only included 2. As seen in the examples above, Trellis, Tripo and Hunyuan3D are closest to the intended geometry in the image, altough Trellis and Hunyuan3D fail at handling the ladder well in the adventurer case. Rodin also shows high quality geometry, but seems to deviate from the input image. For texture quality, Tripo and Rodin seem to show the highest quality, where Tripo is definitely closest to the original image. Trellis and Hunyuan3D also remain close to the original image, but have a dark tone in all of their generations SPAR3D is clearly subpar in quality, but is much faster. In the table below, there is a detailed comparison considering quality and practical aspects.

Detailed comparison table considering quality, inference time, cost per use and availability.

Behind the scenes. The commercial tools have their grounding in research and all companies published a technical paper at some point. Given the background in Section 1 and Section 2, it is not hard to have a high-level understanding of the other models:

Tripo, report, was originally based on LRM, with improved data curation and small changes to reduce GPU memory usage. Their report is the oldest among all compared models and it is unclear how close their current model is to the one in the report. Given the high quality, it is to be expected that their dataset and/or architecture must have improved significantly.
SPAR3D, report, was only released this month, and shows a generative process working on sparse point clouds (instead of surface voxels in Trellis) that allow very flexible editing. The point cloud is decoded into a triplane NeRF with CP-decomposition as in Trellis, from which geometry is extracted with differentiable Marching Cubes, as well as PBR materials with an illumination model RENI++. Its inference speed and control through point cloud editing are remarkable, let’s hope that the output quality is improved in the coming year. For the moment, SPAR3D remains an strong piece of research, but not a practical tool.
Rodin, report, differs the most from the previously covered works. Just as Trellis, it breaks down the generation process into one generative model for geometry and one for materials. The geometric generative model works on downsampled point clouds (based on 3DShape2VecSet), something in between SPAR3D’s point cloud diffusion and Trellis’ downsampled voxel generation, followed by a decoder that predicts a volumetric function for applying the Marching Cubes algorithm to. For generating textures, Rodin has a very interesting approach which combines existing image generation models and UV mapping.
Hunyuan3D-2, report, is similar to Rodin’s pipeline. Their VAE is also inspired by the 3DShape2VecSet architecture and there are separate models for geometry generation and texture generation. The most notable differences introduced by Hunyuan3D-2 are in the texture generation, e.g. there is a delighting model for obtaining albedo colors in input images, and the use of Flow Matching Transformers as used in Flux.

Conclusion

In conclusion, there have been major leaps forward in the past 2 years for 3D generation. The resulting models are never just one neural network but a pipeline of existing image models glued together with small custom networks and other algorithms. The results are great for generating simple assets as inspiration and starting point for 3D modeling. For important characters and objects requiring animations, there are still significant barriers before they become truly production-ready. Most notably, topologies are often not suited for rigging or animation, and the outputs can require extensive cleanup or manual retopology. Even high-quality generation models miss crucial surface attributes like UV layouts or PBR-friendly materials, making advanced texturing or lighting workflows a challenge. For many industries—ranging from game development to robotics—consistency and precise scale are essential, yet these generative models lack the tools to control the generative process with low tolerances.

It’s these types of pain points that Datameister tackles head-on, merging research with engineering rigor to produce tools that are not just concept demonstrations but fully integrated systems for commercial and creative use. In case you are looking for a partner to build 3D generative applications, do not hesitate to contact us.

For 2025, Datameister is looking to hire several ML engineers and interns working on computer vision & graphics in the creative world, robotics simulation and sport analytics. If you consider applying for internship or a full-time position, send us an email at hello@datameister.ai!

A big thank you to Ruben Verhack and Liam Wezenbeek for proofreading and providing feedback!