Datameister @ECCV 2024: Building a foundation

Avanti! 🚀

From September 29 until October 4, MiCo Milano was the setting for the 18th European Convention on Computer Vision, the two-yearly convention on the best the field of Computer Vision has to offer. As true AI builders, we at Datameister are always eager to stay updated on the latest trends and advancements in the field. Naturally, our presence in Milan was a no-brainer.

In this two-fold blogpost, we aim to give a short overview of general research and upcoming topics we found interesting, useful for us, or simply really exciting. The access to advanced foundation models - think of DINO, Segment Anything (SAM), or Stable Diffusion - as well as the birth of novel techniques such as Gaussian Splatting has given researchers the opportunity to explore a wide range of new ideas, of which we would like to share the most interesting ones with you.

MiCo Milano - Europes largest convention centre - was the venue for ECCV 2024

Before I do that, allow me to introduce myself briefly. I am Larsen, electrical engineer graduated from Ghent University with a knack for AI, signal processing, sports, and any intersection between those. Ruben and Axel guided me carefully through my masters thesis back in 2023, which made me decide to join Datameister in June this year. Between that and my graduation in June last year, I spent 7 months in Paris working at the start-up Emobot at Station F, one of the largest start-up incubators in Europe (some of you will probably know Station F as the place where Hugging Face first started in 2017).

Now that we got acquainted, let’s dive into the exciting stuff! We saw three major topics coming back during workshops and poster sessions: 3D Gaussian Splatting, Diffusion for image (and short video) generation and 3D object generation and representation. In this blogpost, the first two topics will be covered. I chose to go deeper into the details of some of my personal favorites rather than for this to be a mere listing of papers. This one is for the tech-enthusiasts who are not afraid of some digging - let’s go!

3D Gaussian Splatting

We kick things off with one of the favorite topics of our co-founder Ruben - if these things spark your interest, go check out his earlier blogpost on image-based rendering. The introduction of 3D Gaussian Splatting (3DGS) in 2023 meant an interesting exception to the reign of neural networks of the last years. The method allows for high-quality, real-time novel-view synthesis representing a scene using 3D Gaussians.

Gaussian Splatting is a fairly novel rasterization technique that aims to draw 3D Gaussians representing physical a scene on the screen (source: AI-Driven Breakthroughs in Image-Based Rendering)

The properties of the 3D Gaussians are optimized in an end-to-end manner using gradient descent on a pixel-per-pixel loss. The resulting assembly of Gaussians accurately captures the scene and produces photo-realistic 3d views. The main drawbacks the initially proposed algorithm suffered from were high memory usage and artifacts due to occlusions, insufficient training views or lightning conditions causing view-dependent appearances of an object.

Foundation models to the rescue

Many researchers at ECCV aimed to tackle one or more of the shortcomings listed above. Foundation models as Dinov2 and SAM more often than not played a significant role in that. Let’s go in short over some other interesting work on 3DGS we’ve seen.

WildGaussians handles occlusions and appearance changes while maintaining the real-time rendering speed of 3DGS. Most notably, it leverages the difference between pre-trained DINOv2 features of the ground-truth and rendered image to model regions of uncertainty. It does so by using a cosine similarity measure to train an uncertainty predictor. This predictor is leveraged to mask uncertain pixels in the pixel-per-pixel loss for rendering training. The resulting uncertainty modeling reduces the influence of occluders such as transient objects or pedestrians in training images.

Uncertainty modeling using DINOv2 features to predict and mask uncertain regions in training images (source: WildGaussians paper)

Gaussian Grouping augments each Gaussian with an identity encoding allowing them to be grouped by their semantic meaning in the scene. The encodings are supervised using automatically generated SAM masks for each view in the training collection. Masks from different views are associated using video-like object tracking (a call for the use of SAM-2 here?). This “identity-aware Gaussian Splatting” elegantly unlocks possibilities for 3D object removal, inpainting, style transfer, and more. Go take a look on the authors’ project page, there is some cool stuff on there!

Gaussian grouping leverages identity encodings in the process to group Gaussians representing the same object together, in order to be able to perform tasks such as 3D Object Removal and Object Inpainting (source: Gaussian Grouping paper)

SAGS or Structure-Aware 3D Gaussian Splatting leverages Graph Neural Networks (GNNs) operating on a k-Nearest Neighbor graph that links points within a local region. These points are expected to share common structural features. Allowing these points to interact with each other through the use of GNNs enhances scene understanding and reduces artifacts.

And the award for coolest paper name goes to... Gaussian Frosting!

Clearly, Gaussian Splatting is very well suited for fancy paper names. One more such example -and our personal favorite on this topic - is Gaussian Frosting, a particularly fresh and creative idea fusing 3D mesh representations with Gaussian scene modeling. The key idea to Gaussian Frosting is to augment a mesh with a layer of Gaussians in order to better capture fine surface details. This allows for the representations to be editable just as a mesh, while maintaining the high rendering quality of 3DGS.

Gaussian Frosting starts from the observation that 3D Gaussians obtained by standard 3DGS - so-called ‘unconstrained Gaussians’ - are not regularized to align well with the surface of the scene. Algorithms such as Marching Cubes therefore fail to construct a mesh representation from an unconstrained 3DGS scene. Another paper called SuGaR, by the same authors, already proposed a solution to regularizing Gaussians for surface alignment - constructing ‘regularized’ Gaussians - and subsequent mesh extraction (using Poisson reconstruction rather than Marching Cubes):

Regularized Gaussians align better to the surface of objects in the scene and therefore result in much better mesh representations (source: Gaussian frosting paper)

Gaussian Frosting builds further on this by constructing a ‘frosting layer’ of additional Gaussians on the extracted mesh, of which the thickness depends on the ‘fuzziness’ of the material. This ‘fuzziness’ can be formally defined by considering the thickness of both the regularized Gaussians and the unconstrained Gaussians. This allows for automatically estimating the required thickness of the frosting layer:

The required thickness of the frosting layer is determined by considering both the thickness of the regularized and unconstrained Gaussians (source: Gaussian frosting paper)

Thickening the frosting layer around these fuzzy regions simply allows for more Gaussians to capture fine-grained details:

In constrast to normal 3DGS, Gaussian Frosting allows to render fine details meticulously (source: Gaussian frosting paper)

Furthermore, the authors are able to keep the frosted Gaussians within the frosting layer during optimization as well as when deforming the base mesh. This allows to use traditional mesh editing tools in e.g. Blender to edit or composite scenes with great rendering quality, even for complex volumes or fuzzy materials. One such example can be seen here underneath. Who thought Buzz Lightyear would be able to ride a giant kitten in the classic 3DGS bike scene?

Scene composition with deformable meshes using Gaussian Frosting. The resulting composited scene is of high quality and can properly deal with fuzzy surfaces and occlusions (source: Gaussian frosting paper)

For those familiar with Blender, the Gaussian frosting authors provide a Blender add-on to play around with Gaussian Frosting! You can find it here.

Diffusion

Latent Diffusion proved to be a game-changer for conditioned 2D image generation back in 2021. Moving the diffusion process to the latent space allowed to exploit the potential of image denoising more efficiently. I would like to highlight three fairly different papers using diffusion for generation directly. They have one common factor: the results are an impressive tribute to the capabilities of latent diffusion models nowadays.

CosHand: Controlling the world by the sleight of hand

Diffusion models most likely have some understanding of the interaction of objects with the world, simply because of the vast amount of data models such as Stable Diffusion have seen during training. CosHand is an example of a “world model”. World models predict the future conditioned on past observations and an action. This setup is quite different from normal text- or image-conditioned diffusion models. World models should ensure consistency across the input and predicted image, and accurately model physical behavior of objects with their surrounding environment.

CosHand makes use of SAM to obtain a large dataset of combinations of hand masks and images, on which a latent diffusion model with strong physical priors can be finetuned (source: CosHand paper)

CosHand models hand-environment interaction and aims to predict the change in position, appearance or geometry of an object caused by hand motions, seen from a single view. Given an input image X_t, the hand mask in that image h_t and a future hand mask - the ‘hand query’ - h_(t+1), it predicts how the image would change because of the hand motion.

Starting from a pre-trained Stable Diffusion model to leverage its strong priors, CosHand conditions on encoded versions of Xt, ht and ht+1, as well as on a CLIP embedding of Xt during finetuning. The latter conditioning ensures semantic consistency between the input and output image. Hand masks are generated with - you’ll never guess - Segment Anything, and model inputs and outputs are sampled from the SomethingSomethingv2 video dataset.

CosHand has a sense of depth, and is able to generalize to unseen but similar mechanisms such as robot arms (source: CosHand paper)

The results are - least to say - fascinating. CosHand is able to predict the impact of a hand motion on a scene. It also has a sense of depth, as illustrated on the left of the figure here above. Even more fascinating is the fact that it is also able to do the same thing for robot arms; it clearly has some understanding about physical concepts at an abstract level beyond its dataset! CosHand is an excellent example of unlocking strong physical priors from diffusion models.

Generative Camera Dolly: It’s all about perspective

At the end of November 2023, Stability AI brought Stable Video Diffusion (SVD) for image-to-video synthesis into the world. Generative Camera Dolly (GCD) operates by the same principle as CosHand: finetuning a diffusion model that was pre-trained on large scale - in Dolly’s case video - data. Given a static, single-view video of a scene, GCD is able to imagine what the scene would look like from other different perspectives. The output is a video-to-4D transformation, generating a video as if the camera was moving along the scene - just as a camera dolly.

Generative Camera Dolly is a Stable Video Diffusion model finetuned to incorporate explicit control over camera movement during video generation (source: GCD paper)

The authors condition a pre-trained SVD model on the relative camera viewpoint difference, captured by a series of rotation matrices R_t and translation matrices T_t. This allows for explicit control over camera movement during video generation.

A desired transformation can be specified as e.g. up 15°, right 60° and back 10m. Upon generating this camera movement, the model is still capable of recovering full scene layout and reconstructing temporally hidden objects despite occlusions. It also correctly imagines the continued motion of objects in a scene as the camera moves.

Cool video examples can be found on the GCD Project Page, be sure to go give it a look!

FMBoost: Go with the flow

Diffusion models generally still suffer from slow inference and high computational resource usage, especially for high resolution image generation. Apart from (latent) diffusion processes, another paradigm for image generation has gained some attraction lately: flow matching. Flow matching is a theoretically quite heavy concept, based on Continuous Normalizing Flows. Flow matching aims to construct a flow ϕ that maps an initial distribution p0 (for image generation a standard Gaussian distribution) to another, possibly more complex distribution p1 (the distribution of the image space):

High-level overview of flow matching: a flow ϕ allows to construct a mapping from an initial distribution p0 to another, more complex distribution p1

ϕ is the solution of the differential equation dx = u_t(x)dt, with u_t(x) a time dependent vector field. Flow matching essentially found a regression objective to estimate the vector field u_t(x) from sampling from the target distribution p_1. The flow ϕ can then be found using highly optimized ODE solvers, making flow matching really efficient for image generation as it does not require many stochastic denoising steps such as is the case with diffusion models. However, this comes at the cost of flow matching models being less expressive and diverse.

FMBoost combines the best of both worlds: it leverages latent diffusion models in a small latent space to generate diverse samples at a small resolution. Thereafter, flow matching is used to efficiently upsample the latent code to a higher dimensional latent space, from which a high resolution image can be decoded with a pre-trained VAE decoder:

FMBoost combines the strength of both Latent Diffusion Models (LDM) and Flow Matching (the CFM module) for efficient high resolution image generation. The latent decoder is a pre-trained VAE decoder.

The Coupling Flow Matching module (CFM module) is trained to efficiently transport the low resolution latent code to a high resolution one. Diffusion in the low-dimensional latent space ensures sufficient diversity in the generated images. The result is a plug-and-play method for boosting the resolution of latent diffusion models in an efficient manner. It is really neat to see the difference between the low resolution image resulting from the latent code from the LDM, and the high resolution after flow matching:

Difference between the original image resulting from the low resolution LDM and the high resolution image after several upsampling steps using a chain of CFM modules.

That’s it for this first ECCV blogpost! Hope the stuff from here above could fascinate you as much as it did fascinate us. Every corner of the 3D AI scene is moving at an unprecedented speed, that is for sure. In the next blogpost, a brand new colleague of ours - and expert in the field - will take you through more fascinating ECCV work on 3D object generation and representation. Stay tuned 😎