2025 was the year where 3D generation finally took off and matured from research demos to early adoption in several industries. 3D generative AI enables fast ideation in design workflows, allowing to explore bold ideas faster and explicitly in 3D. Datameister has created precise control and editing capabilities such as masked generation for 3D generative models. These capabilities already serve several design studios in automotive, fashion and medical industries.
One of the biggest open-source releases in this field was definitely Trellis by Microsoft, released in December 2024. It was one of the first models that was worth the name “foundation model” for 3D asset. Trellis incorporates knowledge on 3D geometry and texture directly into its architecture. For a full introduction on Trellis and 3D generative modeling in general, you can read this blog post by our colleague Jarne.

Introducing Trellis 2
Last week, the Trellis authors released follow-up work with the unsurprising name Trellis 2. The release further scales 3D generation and improves the underlying latent representations. It is a 4B parameter model, doubling the size of the previous release, representing a significant step forward for 3D generative modeling for the following reasons:
- Direct integration of PBR material representations in the pipeline
- Flexible trade-off between the granularity of the 3D voxel grid and the generation speed
- Ability to upsample geometries and according materials
- Improved overall quality in both reconstruction and visual fidelity
Trellis 2 is able to handle structures of up to 1536 x 1536 x 1536 voxels with generation within ~60 seconds on a single H100 GPU. The biggest architectural changes leading to these results can be found in the way that 3D assets are encoded in Structured Latents (SLat’s):
- a novel sparse 3D voxel structure called Omni-Voxel representation (O-Voxels) encodes precise geometry and complex textures simultaneously
- a new compression architecture Sparse Compression VAE replaces the original flow transformer VAE. It converts the O-Voxels into SLat’s with impressive downsampling efficiency

The sparse voxel structure of Trellis is retained, ensuring editing and masked generation are still possible. This is one of the most important reasons we started using Trellis in the first place. As we will point out below, O-Voxels actually make Trellis 2 even more suited for these type of tasks. Let’s go through the two main innovations in detail, and see how they affect the use of Trellis 2 in practice!
Omni-voxels: the building blocks of Trellis 2
In the original Trellis architecture, 3D feature maps of the asset were obtained by taking renders of the original mesh. From these renders, DinoV2 features were extracted and projected onto a voxelized structure. This came with several disadvantages: details and sharp edges were easily lost, open parts of the mesh were hard to handle, and lighting effects were baked into the textures. Moreover, feature extraction from DinoV2 slowed down the pipeline.

The Trellis 2 authors introduce the Omni-Voxel representation instead, a native type of mapping that can be instantly derived from the original asset. An O-Voxel is essentially a collection of parameter tuples (fshape(i), fmat(i), p):
- fshape(i) contains geometric parameters for creating a Flexible Dual Grid, a representation that takes into account edge intersection information of the mesh with the voxel grid
- fmat(i) holds material properties, including the base color, metallic ratio, roughness, and opacity following standard physically-based rendering (PBR) conventions. These properties are pooled per voxel in the dual grid
- p simply denotes the coordinate of the i-th voxel
As in the original Trellis architecture, features only exist on the surface of the mesh, creating a sparse voxel structure. Geometry and appearance now live in the same latent space and will be spatially aligned during generation.
Bi-directional translation
O-Voxels are essentially voxelized information about the mesh grounded in physical reality. This makes the process of O-Voxelization optimization- and rendering-free. Creating a voxelized structure from the mesh of a 3D asset therefore only takes a few seconds on CPU. Moreover, because of their algorithmic nature, O-Voxels can be easily converted back to a mesh.
The bidirectional conversion between O-Voxels and meshes implies that the output of Trellis 2 can only be a mesh. The original Trellis architecture also had output decoders for Gaussian Splats and Neural Radiance Fields. Users of Trellis 2 interested in these 3D representations will have to rely on other methods to convert the resulting meshes into Gaussian Splats or RF’s.
Sparse Compression VAE
The second important change in Trellis 2 comprises the Sparse Compression VAE (SC-VAE). This VAE is no longer flow-transformer based, but is a fully convolutional network. It is specifically designed to achieve high-ratio voxel size downsampling. This results in a compact latent space, even for high-resolution voxel structures (up to 1536 x 1536 x 1536), while remaining computationally efficient.

The dedicated design of the SC-VAE allows for an impressive 16x downsampling ratio. A fully textured 1024 x 1024 x 1024 asset can then be encoded in only around 9.6k latent sparse surface voxels on average. The authors furthermore programmed Triton kernels to create a custom high-performance backend for the SC-VAE called FlexGMM. This backend is compatible with both NVIDIA and AMD (in theory) to further speed up inference and training.
Generative modeling
O-Voxelization and the SC-VAE are essential in the first training stage of Trellis 2. In this stage, the SC-VAE learns effective Structured Latent representations (SLat’s) from the training data. Generative modeling is the second step, turning a text or image prompt into SLat’s and aligning them with the previously learned representations. They are subsequently decoded by the SC-VAE to an O-Voxel grid, which can then be converted back to the final output mesh.
The generation process in Trellis 2 consists of three steps instead of two in the original Trellis architecture:
- Sparse structure generation: predicting the sparse voxel grid i.e., which voxels are active. This first step has remained identical.
- Geometry generation: geometry latents are predicted independently from material latents in this second step. For this, a first SC-VAE is trained to only model shape latents. The decoded result of this is the fshape parameters of the final O-Voxels.
- Material generation: a novel material generation stage models PBR materials directly in the native 3D space, jointly conditioned on the input image and predicted geometry latents. A second SC-VAE is trained for this, similarly conditioned on the first SC-VAE shape latents. The decoded result of this is the fmat parameters of the final O-Voxels.

Trellis 2 thus splits the SLat generation step in two parts, making use of the fact that an O-Voxel is described by a geometrical feature and a material feature independently from one another. This ensures materials remain consistent under arbitrary topology.
Conclusion
Trellis 2 sets a new benchmark for foundational 3D models by combining native 3D representations with efficient compression and more robust generative stages. The shift to Omni-Voxels and geometry–material decoupling results in higher fidelity assets, faster processing, and better alignment with real-world 3D workflows, especially where control and physical correctness matter most.
We have already started integrating Trellis 2 in our pipelines, and the results look promising. If you are looking to bring state-of-the-art 3D generative models into reliable, production-grade workflows, Datameister is ready to help you turn cutting-edge research into real-world impact.