Three challenges in finetuning Trellis

If you’ve tried finetuning a large 3D generative model like Trellis, you’ve likely discovered that it can be deceptively hard. At Datameister, we regularly tackle this challenge when adapting generative models to domain-specific use cases. If you’re new to Trellis or want a quick refresher on its pipeline, this blog by Jarne is a good starting point.

The out-of-the-box performance of Trellis is impressive. The base model was trained on a general dataset of 500,000 objects and generates reasonable meshes for a wide range of object categories. In some use cases, however, “reasonable” is not good enough and details matter. In such scenarios, you could try to finetune the Trellis model on a custom dataset built from object meshes of interest.

The Trellis authors provide both training code and a toolkit for data preprocessing to get you started. In practice, however, finetuning is less plug-and-play than it might appear and introduces several challenges. In this post, we share insights gathered across multiple Datameister projects, using the mechanical components benchmark (MCB) as a guiding dataset where we will focus on image-conditioned mesh generation. We will cover:

Which Trellis components are worth finetuning?
The three main bottlenecks: data, memory and overfitting
What changes with Trellis 2?

1. Which component of Trellis should you finetune?

The Trellis pipeline consists of eight models that operate in sequence. The first question to consider then is which one of these is worth finetuning.

There are 2 major categories to choose from:

VAEs
Used to encode a mesh, input image or feature map into latent space or decode them the other way around.
Diffusion flow models
Responsible for generating object structure, geometric detail and texturing in latent space.

Overview of the Trellis pipeline. *Source:* arXiv:2412.01506

In reality, we found that the VAEs already work really well for most use cases. Well-known models that started from the Trellis model, such as Hi3DGen, actually do not touch these and use them as-is.

The biggest gains are found in finetuning the diffusion flow models that generate object components. Trellis has two such flow models to be used in subsequent stages.

Sparse structure flow: Generates a binary voxel grid from noise based on the input image conditioning.
Structured latent flow: Generates a feature map for all voxels that contains geometric detail and texture information.

Which of these two you should finetune depends on your needs. Let us look at the case of the “screws and bolts” subset of the MCB dataset. Every so often, we would generate objects that had either:

nonsensical global geometry
incomplete generations at tricky camera angles
strange appendages

We found that finetuning the structured latent flow really focused on the details of the geometry and less on these general problems. Finetuning the sparse structure flow dramatically increased robustness and consistency, even at tricky angles.

Input images used as condition for TRELLIS mesh generation.

Generated meshes with the base TRELLIS model.

Generated meshes with a finetuned sparse structure flow model.

2. Challenges of finetuning Trellis’ sparse structure flow

In our experience, finetuning the sparse structure flow consistently ran into three bottlenecks:

Data availability and consistency
GPU memory constraints
Rapid overfitting

In the next three sections we will go over each of these, starting with the input data.

2.1 Data: quantity, quality and preprocessing cost

As with most finetuning workflows, training data is a critical component in a successful finetuning. There are a few aspects to keep in mind here:

How much data is enough?

Part of the answer to that question is how varied your input meshes are. For example:

A subset of ~2000 motors and rotors in MCB proved difficult to finetune due to high diversity in general shape.
A subset of ~1000 screws with similar global geometry converged much more reliably.

As a rule of thumb:

High variation across geometric shapes hurts convergence more than limited data volume.
A few thousand relatively homogeneous objects are often sufficient.
For niche applications, smaller datasets can still work if the geometry is consistent.

Data quality matters

Trellis training was performed on data that passes a filter on an “aesthetic score”, meaning the object is clean and well-modeled. The cleaner your input data will be, the easier of a time you will have. Avoid open meshes and noisy appendages.

Processing: the hidden cost

One of the largest hurdles if you have a big, high-quality dataset comes from the preprocessing steps. Trellis provides a preprocessing toolkit that

downloads the input meshes
renders ~150 views per object to create feature maps
voxelizes and normalizes the meshes
encodes meshes to the relevant latent spaces
creates renders for image conditioning
bundles all information into a metadata file used by the training dataloaders

This all works very fluently out-of-the-box, but depending on the resources you have available, this can take a long time.

For a set of about 1000 objects, the whole pipeline took a little over 24 hours to run on a single RTX4090. The largest bottleneck here was the rendering needed to create the feature maps and conditioning, taking up 80% of that time. In the traditional toolkit, this rendering step is also responsible for normalizing meshes used as ground truth during the training process, making it a required part of the pipeline.

Storage is a second constraint here. This same subset of 1000 objects resulted in ~60GB of intermediate files. Scaling this up to larger datasets of say 16000 objects would require close to 1TB of disk space. About 85% of these files are feature renders and feature maps.

If your focus is on sparse structure for geometry, you can bypass these issues with a few minor changes to the toolkit so there is no need to run the initial rendering. This means you can get your dataset ready in a couple of hours and need much less disk space.

With the data pipeline reviewed, the next bottleneck becomes GPU memory.

2.2 Memory

A second challenge that immediately becomes apparent when launching a finetuning job is GPU memory. For a batch size of 1, GPU memory usage immediately surged to ~20GB with the default settings. This makes running a larger batch size for more stable training unfeasible.

Several practical adjustments can be made to overcome this bottleneck:

Mixed precision training:
Switching from fixed to mixed precision freed up substantial memory, allowing batch sizes of up to 8.
Gradient accumulution:
Trellis effectively supports gradient accumulation through its batch splitting settings. During training, batches are split into smaller micro-batches that are processed sequentially, with gradients aggregated before a single optimizer step. This allows for larger effective batch sizes without loading all samples into GPU memory at once.

Once memory constraints are addressed, overfitting becomes the dominant failure mode.

2.3 The overfitting problem

Early experiments consistently showed steadily decreasing training loss, while validation loss began increasing after roughly 3,000 iterations. Visual inspection of generated samples revealed that while the model had learned the general structure of the input meshes, it emphasized a small set of features from the most dominant substructure in the data and applied them across all test samples.

A key lesson was that decreasing training loss alone is a poor indicator of generalization quality. The default Trellis training setup lacks validation loops, and once these were added, the divergence between training and validation behavior became obvious. Although small training sets played a role, further adjustments were needed to achieve stable and robust finetuning.

Freezing layers

The image-conditioned Trellis model is quite large with its 550 million trainable parameters. It would be fair to assume that overfitting could be mitigated by only finetuning a fraction of those parameters. You could for example freeze all blocks of the flow model and only leave the final block and the output layer unfrozen. This leaves you with only 8% of the original parameters.

In practice, this approach did not work and the model did not learn anything. Both training and validation loss remained stagnant over the course of a few hundred thousand iterations. It seems like the last few layers on their own can not cover for new patterns.

A more flexible alternative is to use low-rank adaptation (LoRA), which enables parameter-efficient finetuning across the full model. While LoRA is well established in language and diffusion models, its use in 3D generative models remains limited and is not supported out-of-the-box in Trellis. As strong results were already obtained through data curation and hyperparameter tuning, we did not explore a LoRA here, but we are excited to give it a try.

3. How about Trellis 2?

Microsoft recently released the second version of their Trellis model, appropriately named Trellis 2. There are a few big changes, which we discussed in a previous blog post, but what does it mean for finetuning?

At first sight, it does not seem to impact many of the aspects we have discussed above. The first part of the pipeline - where sparse structure is created - seems largely unaffected. However, if you are more interested in finetuning the detailed geometry and texture generation models, you will find some updates that will have impact on your finetuning experience.

The first big change will be in the data preprocessing. As Trellis 2 uses a native representation of the geometry detail and appearance features, there is no more need for the bulk of the rendering. In its place, you will have to preprocess the o-voxel translation, which is said in the paper to take only a few seconds on CPU. This is a big speedup compared to the rendering technique.
Secondly, it should be able to deal with open and non-manifold meshes much better. This means that non-standard meshes in your dataset should have a less negative impact on your finetuning process.
Finally, Trellis 2 introduces three flow models instead of two, splitting the second stage into separate geometric detail and texture flows, which could enable more targeted and parameter-efficient finetuning.

As the full training code and dataset toolkit has not been released yet at the time of writing, we have not had a chance to play around with finetuning Trellis 2. However, we are excited to do so as soon as possible!

Closing Remarks

Finetuning Trellis is less about finding the right configuration and more about understanding where the model is brittle. Meaningful improvements came from choosing the right component to finetune, curating consistent training data, and closely monitoring generalization rather than relying on training loss alone.

At Datameister, we continue to explore and refine finetuning strategies for Trellis and other 3D generative models, with a focus on robustness, consistency, and real-world deployment. If you have an application that requires accurate and dependable 3D generation, do not hesitate to reach out!