Indoor Semantic Segmentation of Point Clouds: From LiDAR Capture to Real-World Use

TL;DR Indoor semantic segmentation promises structured scene understanding from LiDAR point clouds, but models trained on outdoor datasets fail to generalize indoors due to differences in viewpoint, environment and sensor characteristics.

At Datameister, we leverage our 3D AI expertise to bridge this gap using domain-aware processing with a state-of-the-art model and domain adaptation through training on indoor-specific datasets. We validate our pipeline end-to-end on our own quadruped platform and perform semantic segmentation and downstream tasks on real-world scans.

Result: industry-grade indoor spatial intelligence using hardware that is up to 10x cheaper than traditional automotive setups.

Applications: autonomous navigation and manipulation, spatial virtual twins, real-to-sim pipelines, scan-to-3D automation.

Raw LiDAR data is not useful on its own, it is just a set of thousands to millions of 3D coordinates describing the geometry of your environment. To make this data actionable, it needs to be transformed into a structured representation that systems can actually reason about.

This is exactly where the Datameister Semantic Segmentation Pipeline comes in, transforming raw (sparse) point clouds into meaningful scene understanding that enables downstream tasks such as digital twins, real-to-sim, asset management, etc.

Starting from a sparse raw point cloud to a semantically labeled point cloud, enables downstream tasks such as scene reconstruction.

If you are building robotics systems, digital twins, or indoor mapping pipelines, LiDAR is likely a core input. Many teams assume they can reuse outdoor segmentation models, but quickly run into the reality that these models fail to transfer across domains, sensors, and platforms. So how do we achieve reliable indoor scene understanding?

We’ve structured this post as follows. First, we’ll define semantic segmentation on point clouds and why it is the foundation of downstream tasks. Second, we look at how these point clouds are captured and examine the current gold standard of the field: outdoor scenes. Third, we explore why these approaches break down indoors, and what it takes to make them work in practice. Finally, we wrap up with the high-impact downstream applications this unlocks.

1. Pixel-Level to Point-Level Semantic Segmentation

Semantic segmentation assigns domain-specific labels to pixels (2D) or points (3D), which is a first step in robotic perception of an environment. In the example case of 2D images, the task is a pixel-level classification problem, where instead of classifying an entire image as “car”, we assign labels to each pixel, e.g. wheels, body and windows. Analogously, this task can be extended to point clouds (3D), resulting in a point-classification problem.

**Toy example illustrating pixel-level classification (2D) vs. point-level classification (3D)**

Regardless of the input dimensionality, the goal is to produce a dense, spatially consistent map where the boundaries are smooth and logical. In the context of indoor scene understanding, semantic segmentation adds the fundamental structure that allows the model to distinguish points belonging to the wall, the floor or furniture.

Note: Unlike instance segmentation, semantic segmentation treats all objects of the same category as a single entity. As a rule of thumb: semantic segmentation identifies "what," while instance segmentation identifies "which." Instance segmentation can be done by a separate model, or as a post-processing step of semantic segmentation.

1.1 How 3D Point Clouds Are Captured

The tradeoff in LiDAR sensor for our application is accessibility (e.g., price, form factor) vs. industry adoption, rather than accuracy. Modern sensors like the Livox Mid-360 deliver sufficient accuracy at 1/10th of the price of traditional automotive-grade systems, fundamentally changing deployment feasibility.

At a high-level, LiDAR sensors emit laser pulses, measure their return time, and converts these measurements into 3D points. Repeating this across many directions produces a geometric sampling of the scene, often together with extra attributes such as intensity, timestamp, and sometimes RGB after sensor fusion.

Three categories of LiDAR sensors, illustrating the tradeoff between industry adoption and accessibility (e.g., price, form factor)

The available LiDAR sensors span a wide spectrum: industry-grade sensors like the Ouster ($10K+) that have been the standard for years, versus newer, compact alternatives like the Livox (<$1K). This rapid evolution has dramatically increased accessibility, bringing LiDAR to everyday devices, including some of the smartphones we carry.

In real-world scenarios, cheaper sensors unlock dramatically cheaper deployments, but shifts the complexity to the software, which we’ll take care of.

2. Why Outdoor Segmentation Models Don’t Transfer to Indoor Environments

Most teams assume that segmentation models trained on outdoor datasets can be reused indoors. In practice, this assumption breaks down and often leads to unreliable predictions and failed deployments.

These models are built on datasets such as KITTI, NuScenes, and Waymo Open Dataset, captured using automotive-grade LiDAR sensors in structured outdoor environments. Years of investment in autonomous driving have made these models highly performant in that specific setting, as demonstrated by impressive real-time segmentation results on benchmarks like SemanticKITTI.

Real-Time Semantic Segmentation on the SemanticKITTI dataset (slowed down and looped).
The following classes are shown: **Road** (pink), **Building** (yellow), **Cars** (blue), **Pedestrians** (red).

Indoor environments, however, introduce fundamentally different conditions. Viewpoints change, sensor characteristics differ, and scenes become less structured. As a result, models trained on outdoor data rely on assumptions about geometry and point cloud patterns that do not hold indoors.

Without adaptation, even state-of-the-art outdoor models become unreliable in indoor settings, making them unsuitable for real-world indoor deployment.

2.1 Impact of Viewpoint on Point Cloud Density

First, viewpoint differences fundamentally change the structure of the point cloud. Models trained on car-mounted LiDAR expect a specific distribution of points that does not match indoor or robot-mounted setups.

A different viewpoint of the LiDAR relative to the object has an impact on the incident angle and thus captured point density.

For example, a car-mounted LiDAR observes a person from above, resulting in higher point density near the head and lower density toward the legs. A quadruped robot captures the same person from a lower angle, producing the opposite pattern. Since many models (e.g. PointPillars available in OpenPCSeg) implicitly learn these density distributions, this mismatch directly impacts their predictions.

Less intuitively, even if the object is identical, a change in viewpoint can cause models to misclassify or completely miss it, making outdoor-trained models unreliable on indoor robotic platforms.

2.2 Repetitive vs. Non-Repetitive LiDAR Sensors

Second, LiDAR sensors differ not just in cost or form factor, but in how they sample the environment. These differences directly shape the point cloud and influence how models interpret it.

Spinning sensors such as Ouster use repetitive scanning patterns, capturing the environment in fixed, predictable rings. This provides high temporal consistency, which is critical for detecting dynamic changes like moving objects. However, these sensors cannot sample the gaps between rings, limiting spatial coverage.

In contrast, sensors like the Livox Mid-360 use non-repetitive scanning patterns, where each pass samples different parts of the scene. Individual scans are sparse, but by accumulating multiple frames over time, the sensor achieves much higher spatial density than traditional spinning LiDARs.

Comparison repetitive (e.g. Ouster) vs. non-repetitive scanning (e.g. Livox) patterns

Unfortunately, models trained on repetitive scan patterns implicitly expect structured ring-like inputs. When deployed on non-repetitive sensors, this mismatch degrades performance, making cross-sensor deployment unreliable without adaptation.

Note: Recently, efforts have been made to train models that transfer across sensors and domains, such as Utonia, but more in-the-field results are needed to validate this approach in real-world applications.

2.3 Unstructured, Cluttered, and Highly Variable Scenes

Finally, indoor environments are far less structured than outdoor scenes. Models trained on outdoor data rely on predictable patterns such as roads, sidewalks, and traffic infrastructure.

Indoor spaces, in contrast, are highly variable and cluttered. Layouts differ significantly, objects are more diverse, and the number of relevant classes increases. Instead of focusing on a few categories like cars and pedestrians, indoor models must distinguish between furniture, fixtures, and structural elements at a much finer level of detail.

A key observation is that indoor segmentation is not just a harder version of the same problem. It requires different assumptions, higher label granularity, and more robust modeling to work reliably in practice.

3. Making Indoor Segmentation Work on Smaller, Lower-Cost Platforms

Making indoor segmentation work on accessible hardware in real-world scenarios is not a matter of fine-tuning a model. It requires rethinking the entire pipeline, considering how data is captured to how it is processed and interpreted.

To make this work in practice, the goal is not just accuracy on benchmarks, but a system that is portable, affordable, and robust enough for real-world deployment.

Our setup reflects this philosophy. We use a compact platform built around the Unitree GO2 quadruped, combined with an NVIDIA compute module and a Livox Mid-360 LiDAR. This configuration provides a practical balance between cost, mobility, and sensing capability for indoor environments.

Render of our development & validation setup: Unitree GO2 with Livox Mid-360 LiDAR and NVIDIA compute module

The following sections detail how we turn this setup into a reliable semantic segmentation pipeline, enabling further downstream spatial AI.

3.1 From Sparse Scans to Segmentation-Ready Data

A single LiDAR scan from a sensor with non-repetitive scan pattern is too sparse for reliable segmentation. As discussed in Section 2.2, non-repetitive sensors like the Livox Mid-360 distribute points across different locations in each scan, meaning individual frames lack sufficient spatial coverage.

The animation below shows how this changes over time. Each frame adds new information, gradually filling in the scene as scans are accumulated.

Frame accumulation over time and robot motion on the Livox Mid-360 creates a dense point cloud for semantic segmentation.

On a moving platform, this only works if motion is properly compensated. By synchronizing frame accumulation with the on-device SLAM system, we reconstruct a spatially consistent point cloud of the entire floor that is ready for segmentation.

Without scan accumulation and motion compensation, low-cost LiDAR data is too sparse to be useful. With it, we unlock dense, high-quality inputs from lightweight hardware.

3.2 Datameister’s End-to-End Segmentation Pipeline

A pre-trained model alone is not enough to achieve reliable indoor segmentation. While large models like Sonata provide state-of-the-art semantic segmentation, they must be adapted to the specific characteristics of indoor environments and sensor setups to achieve a robust output.

Datameister’s semantic segmentation pipeline from capture on our platform to a semantically segmented point cloud.

Our pipeline combines domain-aware pre- and post-processing with a state-of-the-art segmentation model. We build on top of Sonata, a pre-trained PointTransformer V3 encoder, and use a decoder head trained on ScanNET-20, an indoor dataset for semantic segmentation. This allows the model to learn the structure and variability of indoor scenes while classifying points into relevant categories such as floors, walls, and furniture. Depending on the domain, we can train the decoder for the desired classes while preserving the broad geometric understanding learned during large-scale pre-training.

Crucially, this model operates on the accumulated and motion-compensated point clouds described in Section 3.1. This integration with the on-device SLAM system ensures that the input is dense and spatially consistent, which is essential for reliable predictions in real-world environments.

The pipeline diagram shows how these components come together, from data capture on our customized Unitree GO2 to a semantically segmented point cloud. The accompanying animation demonstrates the pipeline in action on our office environment, highlighting how unlabeled scans are transformed into structured scene understanding.

Raw scan to semantic segmentation: Floor (green), Walls (cyan), Chairs (yellow), Tables (orange), Doors (purple), Windows (red).

This proves our approach works beyond lab setup or benchmark. It is a practical system that delivers high-quality indoor segmentation on compact, low-cost hardware in real-world conditions.

4. Downstream Applications

Semantic segmentation is not the end goal. It is the foundation for turning raw spatial data into usable environments, moving from a research problem to a business enabler. Once a point cloud is structured and labeled, it becomes a building block for a wide range of practical applications.

4.1 Semantic-to-Parametric Reconstruction

A semantically labeled point cloud provides a structured understanding of the scene, enabling reliable detection of key surfaces and objects such as floors, walls, ceilings, doors, and large fixtures.

Building on these labeled points, the environment can be transformed into a parametric representation. In this representation, the scene is described using geometric primitives for layout (such as walls and floors), while more complex objects are approximated with 3D assets. This creates a clean and editable reconstruction of the environment.

We’ve moved from raw sensor data into a structured digital environment that can be directly used by downstream systems.

A reconstruction of our office using geometric primitives (walls and floor) and 3D assets (tables and chairs)

4.2 From Reconstruction to Real-World Applications

Once the environment is reconstructed, it can be used across multiple domains:

Real-to-sim pipelines: converting scanned environments into simulation-ready spaces for robotics testing, navigation, validation, and continuous improvement.
Example: A robotics integrator scans a customer’s production hall and generates a simulation environment to validate navigation, obstacle avoidance, and task execution before deployment.
CAD and digital twins: creating a structured base layer that designers and engineers can use to design, analyze and build on their existing modeling tools, e.g., AutoCAD, Blender.
Example: A renovation contractor digitizes a century-old building and obtains a clean CAD model that can be edited in tools like AutoCAD for planning structural changes.
Asset inventory and facility management: identifying and localizing structural elements and large objects for tracking and maintenance.
Example: A facility manager scans a warehouse to automatically catalog equipment, enabling faster audits and maintenance planning.

In other words, semantic segmentation does not just help a robot "understand" a room. It helps teams turn raw captures into environments they can analyze, simulate, and build on.

5. Conclusion

When a human looks at a raw point cloud, they can quickly make sense of it. We intuitively separate floor from wall, ignore noise, and recognize objects even from incomplete geometry. For a robot, that same scene is just a large collection of coordinates until additional structure is added.

That is where Datameister comes in. We turn raw LiDAR captures into structured scene understanding through semantic segmentation pipelines designed for real-world indoor environments. The goal is not just better benchmark performance, but reliable inputs for the systems that depend on them, from robots that need to navigate to simulations and digital twins that require accurate spatial context.

So what? This enables teams to move from raw spatial data to environments they can simulate, analyze, and build on, without relying on expensive, automotive-grade hardware.

If you are building systems that scan, map, or interact with the physical world, raw data and hardware constraints should not be your bottleneck. We help teams turn LiDAR data into structured, actionable spatial intelligence, ready for deployment.

Ready to bring indoor spatial AI into your products? Let’s build it together.

semantic segmentation point clouds LiDAR indoor mapping spatial AI robotics digital twins real-to-sim 3D computer vision SLAM point cloud processing autonomous systems spatial intelligence 3D reconstruction AI for robotics more less