Trellis 2: Scaling 3D Generation with Improved Efficiency and Control

In a year marked by rapid advances in 3D generative modeling, Trellis 2 makes for one of the most exciting architectural updates this year. It introduces Omni-Voxels, a native 3D representation that encodes geometry and PBR materials directly in aligned 3D space. Combined with the new Sparse Compression VAE, this enables more efficient compression of very high-resolution assets at improved inference speeds.

Because the sparse, surface-based structure is preserved, Trellis 2 maintains strong support for masked generation and localized edits, now operating on a more compact and better-aligned latent representation. Geometry and materials are handled separately, making edits more predictable and ensuring material consistency under topology changes.

Result: a scalable 3D generation pipeline with higher fidelity, improved computational efficiency, and significantly better editability and control for real-world design workflows.

2025 was the year where 3D generation finally took off and matured from research demos to early adoption in several industries. 3D generative AI enables fast ideation in design workflows, allowing to explore bold ideas faster and explicitly in 3D. Datameister has created precise control and editing capabilities such as masked generation for 3D generative models. These capabilities already serve several design studios in automotive, fashion and medical industries.

One of the biggest open-source releases in this field was definitely Trellis by Microsoft, released in December 2024. It was one of the first models that was worth the name “foundation model” for 3D asset. Trellis incorporates knowledge on 3D geometry and texture directly into its architecture. For a full introduction on Trellis and 3D generative modeling in general, you can read this blog post by our colleague Jarne.

Trellis 2 sets the new standard for 3D generative modeling

Introducing Trellis 2

Last week, the Trellis authors released follow-up work with the unsurprising name Trellis 2. The release further scales 3D generation and improves the underlying latent representations. It is a 4B parameter model, doubling the size of the previous release, representing a significant step forward for 3D generative modeling for the following reasons:

Direct integration of PBR material representations in the pipeline
Flexible trade-off between the granularity of the 3D voxel grid and the generation speed
Ability to upsample geometries and according materials
Improved overall quality in both reconstruction and visual fidelity

Trellis 2 is able to handle structures of up to 1536 x 1536 x 1536 voxels with generation within ~60 seconds on a single H100 GPU. The biggest architectural changes leading to these results can be found in the way that 3D assets are encoded in Structured Latents (SLat’s):

a novel sparse 3D voxel structure called Omni-Voxel representation (O-Voxels) encodes precise geometry and complex textures simultaneously
a new compression architecture Sparse Compression VAE replaces the original flow transformer VAE. It converts the O-Voxels into SLat’s with impressive downsampling efficiency

The most important innovations in the Trellis 2 encoder (source: from the Trellis 2 project page)

The sparse voxel structure of Trellis is retained, ensuring editing and masked generation are still possible. This is one of the most important reasons we started using Trellis in the first place. As we will point out below, O-Voxels actually make Trellis 2 even more suited for these type of tasks. Let’s go through the two main innovations in detail, and see how they affect the use of Trellis 2 in practice!

Omni-voxels: the building blocks of Trellis 2

In the original Trellis architecture, 3D feature maps of the asset were obtained by taking renders of the original mesh. From these renders, DinoV2 features were extracted and projected onto a voxelized structure. This came with several disadvantages: details and sharp edges were easily lost, open parts of the mesh were hard to handle, and lighting effects were baked into the textures. Moreover, feature extraction from DinoV2 slowed down the pipeline.

O-Voxels replace the voxelized structure obtained through projection of 2D DinoV2 features in the original Trellis architecture. This allows for near-instant conversion between 3D assets and voxelized structure.

The Trellis 2 authors introduce the Omni-Voxel representation instead, a native type of mapping that can be instantly derived from the original asset. An O-Voxel is essentially a collection of parameter tuples (fshape(i), fmat(i), p):

fshape(i) contains geometric parameters for creating a Flexible Dual Grid, a representation that takes into account edge intersection information of the mesh with the voxel grid
fmat(i) holds material properties, including the base color, metallic ratio, roughness, and opacity following standard physically-based rendering (PBR) conventions. These properties are pooled per voxel in the dual grid
p simply denotes the coordinate of the i-th voxel

As in the original Trellis architecture, features only exist on the surface of the mesh, creating a sparse voxel structure. Geometry and appearance now live in the same latent space and will be spatially aligned during generation.

Bi-directional translation

O-Voxels are essentially voxelized information about the mesh grounded in physical reality. This makes the process of O-Voxelization optimization- and rendering-free. Creating a voxelized structure from the mesh of a 3D asset therefore only takes a few seconds on CPU. Moreover, because of their algorithmic nature, O-Voxels can be easily converted back to a mesh.

The bidirectional conversion between O-Voxels and meshes implies that the output of Trellis 2 can only be a mesh. The original Trellis architecture also had output decoders for Gaussian Splats and Neural Radiance Fields. Users of Trellis 2 interested in these 3D representations will have to rely on other methods to convert the resulting meshes into Gaussian Splats or RF’s.

Sparse Compression VAE

The second important change in Trellis 2 comprises the Sparse Compression VAE (SC-VAE). This VAE is no longer flow-transformer based, but is a fully convolutional network. It is specifically designed to achieve high-ratio voxel size downsampling. This results in a compact latent space, even for high-resolution voxel structures (up to 1536 x 1536 x 1536), while remaining computationally efficient.

The dedicated design of SC-VAE focuses on efficiency for high-resolution feature maps

The dedicated design of the SC-VAE allows for an impressive 16x downsampling ratio. A fully textured 1024 x 1024 x 1024 asset can then be encoded in only around 9.6k latent sparse surface voxels on average. The authors furthermore programmed Triton kernels to create a custom high-performance backend for the SC-VAE called FlexGMM. This backend is compatible with both NVIDIA and AMD (in theory) to further speed up inference and training.

Generative modeling

O-Voxelization and the SC-VAE are essential in the first training stage of Trellis 2. In this stage, the SC-VAE learns effective Structured Latent representations (SLat’s) from the training data. Generative modeling is the second step, turning a text or image prompt into SLat’s and aligning them with the previously learned representations. They are subsequently decoded by the SC-VAE to an O-Voxel grid, which can then be converted back to the final output mesh.

The generation process in Trellis 2 consists of three steps instead of two in the original Trellis architecture:

Sparse structure generation: predicting the sparse voxel grid i.e., which voxels are active. This first step has remained identical.
Geometry generation: geometry latents are predicted independently from material latents in this second step. For this, a first SC-VAE is trained to only model shape latents. The decoded result of this is the fshape parameters of the final O-Voxels.
Material generation: a novel material generation stage models PBR materials directly in the native 3D space, jointly conditioned on the input image and predicted geometry latents. A second SC-VAE is trained for this, similarly conditioned on the first SC-VAE shape latents. The decoded result of this is the fmat parameters of the final O-Voxels.

End-to-end generation pipeline of Trellis 2

Trellis 2 thus splits the SLat generation step in two parts, making use of the fact that an O-Voxel is described by a geometrical feature and a material feature independently from one another. This ensures materials remain consistent under arbitrary topology.

Conclusion

Trellis 2 sets a new benchmark for foundational 3D models by combining native 3D representations with efficient compression and more robust generative stages. The shift to Omni-Voxels and geometry–material decoupling results in higher fidelity assets, faster processing, and better alignment with real-world 3D workflows, especially where control and physical correctness matter most.

We have already started integrating Trellis 2 in our pipelines, and the results look promising. If you are looking to bring state-of-the-art 3D generative models into reliable, production-grade workflows, Datameister is ready to help you turn cutting-edge research into real-world impact.