Datameister Blog

Datameister #1 in Intrinsic's robotic-manipulation challenge qualifier

Ruben Verhack — Thu, 11 Jun 2026 07:30:00 GMT

TL;DR Datameister is first on the Intrinsic AI for Industry Challenge qualifying leaderboard with a score of 293.38, out of roughly 160 teams. The challenge is a cable-insertion benchmark in simulation, run by Intrinsic with Open Robotics, Google DeepMind, and NVIDIA on the evaluation committee. The qualifying window was two and a half months. We landed first place in three weeks.

The skill on test is precision cable insertion for electronics assembly: a Universal Robots UR5e cobot with three wrist-mounted cameras and a force-torque sensor picks flexible cables and inserts connectors into a server-tray workcell. Our entry composes a sequence of step-specific policies on top of production-grade perception and 3D-AI primitives from the DM Library. Each step uses the tool that performs best on that specific sub-problem, evaluated empirically.

Result: first on the qualifying leaderboard of roughly 160 teams, ahead of a field that includes well-funded competitors with millions in venture funding.

Applications: cable assembly and connector insertion for electronics manufacturing, deformable-object manipulation, server-tray wiring, manipulation skill catalog for industrial robotics integrators.

Our entry to the AIC: a Universal Robots UR5e inserting cables into a server-tray fixture, scored on success, precision, safety, and cycle time. (source: Datameister)

Electronics manufacturing has been one of the last big holdouts against full robotic assembly. Most steps on a modern line are automated. Cables and connectors are not. Server trays, network gear, and consumer electronics still end up with people handling the wiring step by step, because the physics of flexible cables and the tolerances of connector insertion are genuinely hard for a robot to get right. Intrinsic ran the AI for Industry Challenge to attack exactly that bottleneck, with a $180,000 prize pool and a benchmark that anyone could enter. Datameister is first on the qualifying leaderboard. This post walks through what the challenge is, the skill on test, how we approached it, and what comes next.

What the AI for Industry Challenge is

The AI for Industry Challenge is an open competition run by Intrinsic (an Alphabet company) together with Open Robotics. The evaluation committee includes Francesco Nori (Director of Robotics, Google DeepMind), Amit Goel (Director of Product Management, NVIDIA), Geoffrey Biggs (CTO, Open Robotics), Susanne Nördinger (Universal Robots), Zhe Shi (Foxconn), and Wendy Tan White (CEO, Intrinsic). $180,000 in prizes, three phases:

Qualification (March 2 to mid-May 2026): train and submit a cable-insertion model evaluated in simulation. Roughly 2.5 months. Around 160 teams entered. We just passed this phase.
Phase 1 (mid-May to August 2026): the top 30 teams get access to Intrinsic Flowstate and the Intrinsic Vision Model to build out a full cable-handling solution.
Phase 2 (August to September 2026): the top 10 teams deploy their solutions remotely to a physical workcell at Intrinsic's HQ in California for real-world evaluation.

We were one of the 30 teams that advanced to Phase 1, with the highest score on the qualifying leaderboard.

The skill on test: precision cable insertion

The skill the AIC measures is what Datameister calls precision cable insertion for electronics assembly. The participant toolkit specifies the hardware: a Universal Robots UR5e cobot, a Robotiq Hand-E gripper, an Axia80 force-torque sensor on the wrist, and three wrist-mounted Basler cameras streaming uncompressed RGB at 20 fps. The task is to pick flexible cables and insert their connectors into a server-tray fixture, generalising across plug types and port configurations.

Our entry running the precision cable-insertion task end to end. The arm relies on the wrist-mounted cameras and the F/T sensor; ground-truth poses are off during scored evaluation. (source: Datameister)

Submissions are scored by an automated pipeline on four criteria, per the challenge rules: task success (a binary per insertion), precision (how close the connector lands to its target), safety (penalties for collisions, excessive forces, and excessive jerk on cables or connectors), and efficiency (cycle time for the full set of inserts). The single qualifying score is the weighted combination.

How we approached precision cable insertion

The harder a robotics problem looks end-to-end, the more it rewards being decomposed. Cable insertion is genuinely a collection of distinct sub-problems, each with its own properties and its own active state of the art.

Our entry composed a sequence of step-specific policies on top of the DM Library, with each step picking the right tool for that step. What runs underneath each step is whatever performs best on that specific sub-problem, evaluated empirically. The composition is what makes the skill, not any one policy choice underneath it.

Four things made three weeks possible:

We started from production-grade primitives. The perception and 3D-AI building blocks in our DM Library have been live and battle-tested across customer projects for years. We did not have to build a perception stack from scratch under tournament time pressure. We composed on top of one.
We picked the right tool per sub-problem rather than committing to one stack. Each sub-problem gets evaluated on its own merits, and the policy that wins on that sub-problem is what runs there.
We had the internal process to move research into a working primitive on a short timeline. Fresh-from-arXiv research moves fast and is rough around the edges. The way we work in our monorepo (see The Monorepo as AI Factory) lets us bring new ideas into the library quickly while keeping the existing primitives stable. New techniques get evaluated per sub-problem and only land in the library when they earn their place.
We evaluated step by step. Each sub-problem had its own metric and its own bar. That is the discipline that beats throwing a single monolithic model at the whole task and hoping.

The underlying point: skills are what the buyer talks about, primitives are implementation, and the right move is to compose skills out of primitives where each primitive is the best available tool for its job. The AIC is the first public test that confirms it.

The result: first on the qualifying leaderboard

Qualifying-phase leaderboard, top ten of roughly 160 teams. (leaderboard data: Intrinsic AI for Industry Challenge; chart: Datameister)

First place at 293.38 on the qualifying leaderboard. Roughly 160 teams entered qualifying; the top 30 advance to Phase 1, and we go in with the highest score. Three weeks of focused work against a two-and-a-half-month window, in a field that includes well-funded competitors with millions of euros in venture funding.

The score margin to second is small (about one point), which we read as a signal that the other top teams are doing serious work too. The standing matters mostly for the next phase; what matters more is how this composes-on-top-of-primitives approach extends to the physical workcell.

What's next: Phase 1 and the physical workcell

Phase 1 starts immediately. Top 30 teams gain access to Intrinsic Flowstate and the Intrinsic Vision Model to build out a complete cable-handling solution on top of their qualifying models. Phase 1 winners are announced by August 4, 2026. The top 10 from there move to Phase 2, where solutions are remotely deployed to a real UR5e workcell at Intrinsic's HQ in California, with the final results announced by September 8, 2026.

We will write up the full technical story once the challenge concludes.

Working with us

Precision cable insertion is one skill in a growing catalog we build for European robotics integrators, on standardized off-the-shelf hardware (the UR5e here, Franka and AGIBOT elsewhere). We take on the hard part between a fresh model and a manipulation skill that runs reliably on a real line, so an integrator can put cognitive robots to work without standing up an in-house AI research team. The integrator keeps the IP for whatever is specific to their use case and data.

If you are an integrator, OEM, or manufacturer trying to get AI-driven manipulation running on real hardware, and the path from research to a reliable deployed skill is your bottleneck, we are glad to talk.

Autonomous from Scratch: Building Datameister's Physical AI Foundation

Thibaud Despriet — Wed, 03 Jun 2026 08:00:00 GMT

TL;DR We built and operate an autonomous navigation and data-capture skill running end-to-end on a Unitree GO2, an NVIDIA Jetson, and a Livox MID-360. Unified under a single ROS 2 framework, the robot navigates a site autonomously, builds a persistent 3D map of what it sees, captures the run as a replayable bag, and lets you watch it live from a browser, all without a tethered laptop.

This is the on-robot side of Datameister's Physical AI work, paired with the Datameister Platform on the training side. The quadruped is the mobile base, the legs, running in parallel with our stationary cobot manipulation work. The two converge into mobile manipulators and eventually humanoid platforms. Every run also generates the real-world data our training pipeline needs.

Result: a Physical AI stack we own end-to-end, ready to drive autonomous inspection runs today and to feed the manipulation skills coming next.

Applications: autonomous inspection patrol, indoor and outdoor 3D mapping, digital twin capture, real-to-sim data generation, repeatable site walkthroughs, mobile-manipulator roadmap, foundation for humanoid platforms.

Robots are stepping out of the lab. Quadrupeds are scanning construction sites, humanoids are picking parts on factory floors, mobile manipulators are running restocking shifts in retail. The hardware is moving faster than the software underneath it, and most teams trying to deploy these systems into real environments are stuck on the same gap: the autonomy stack between perception and reliable, deployable action does not exist as a product you can buy.

That stack is what makes the rest of the Physical AI work possible: the manipulation skills, the deployment feedback loops, the cognitive work above. It is also the kind of stack a hardware-agnostic robotics integrator would build if they had an in-house AI research team. We built ours from scratch on a Unitree GO2, an NVIDIA Jetson, and a Livox MID-360, unified under a single ROS 2 framework. This post walks through why we built it, what it lets us do today, and what it now enables on top.

Why we built our own robotics stack

What the existing options leave you with

Anyone who has tried to take a vendor demo from the lab to a real site knows the gap. Consumer-tier firmware is built around scripted demos. The platform will follow you around a room or run those scripts, but underneath it functions as a black box: no persistent map, no goal-driven navigation, no observability worth the name. Push past that, asking the platform to reliably execute a route while logging the full state of its perception, planning, and control pipelines, and you hit a wall.

A second wave of robotics companies tries to bridge that gap with a Robot-as-a-Service wrapper: source the hardware from a major OEM, plug in a foundation model, and sell warehouse operations or machine tending as a managed service. It looks like the answer on the surface. In practice the wrapper hides the layers that need to be hardened (SLAM, planners, transforms, sensor sync, control gating); they show up the moment you try to take the system from a controlled proof-of-concept to a deployable fleet, and a vendor whose differentiator is the wrapper cannot solve them on a project timeline.

For a European robotics integrator, neither is really the alternative anyway. The actual alternative is hiring and retaining an in-house AI research team of five to ten engineers across perception, planning, control, MLOps, and simulation, to build the same stack themselves. Most integrators of the 20-to-80-person shape cannot justify that headcount per customer engagement. Closing the gap between the OEM hardware they already buy and the skill their customer needs is systems engineering work, and no vendor sells it as a product.

The Unitree GO2 (left) and the Unitree G1 (right). Sources: New Atlas and Unitree.

What we wanted instead: Autonomous from Scratch

We realized that to truly enter the realm of Physical AI, we had to build the system autonomous from scratch. The brief was simple to state but required intense systems engineering to deliver. We needed an on-robot stack that pairs with the Datameister Platform on the training side: a setup where the robot acts as a mobile, intelligent data-harvesting node, and where the skills the Platform produces have a body to run on.

Instead of a black box, we wanted a transparent architecture we own end-to-end, delivering:

Persistent 3D Mapping & SLAM: Real-time, LiDAR-based 3D SLAM capable of generating high-resolution point clouds. We needed the system to build highly accurate, persistent maps on the fly, with the ability to refine them using loop closures.
Goal-Driven Autonomous Navigation: Point-to-point routing and continuous loop patrolling powered by Nav2, utilizing an adaptive gait controller to handle complex terrain.
Always-On Data Harvesting: Continuous capture of the platform's complete operational state. We needed every sensor feed, planning decision, and control output to be seamlessly logged, with built-in conversion pipelines making that data ready for machine learning ingestion.
Headless, Untethered Operations: A custom boot-and-provisioning sequence that allows the robot to power on, serve its own network or join the site's WiFi, and start working immediately with zero laptop tethers or manual script executions.

That is the system we shipped on the Unitree GO2: a hardware-agnostic platform for cognitive robots, designed to extend to mobile manipulators and humanoid platforms as we work toward them.

How this fits alongside the Datameister Platform

A note on terminology before going further. The Datameister Platform is our training and deployment infrastructure for visual AI and 3D AI. It lives off the robot, in the cloud or on-prem at a customer site, and handles capture, real-to-sim reconstruction, simulation orchestration, training pipelines, and evaluation. The autonomy stack this post is about lives on the robot. It hosts the skills the Platform produces, captures the real-world data the next training round runs on, and handles the perception, navigation, and control underneath. Two distinct pieces, designed to feed each other.

The hardware: Unitree GO2 + NVIDIA Jetson + Livox MID-360

The hardware footprint is deliberately lean and self-contained. The Unitree GO2 provides a capable quadruped chassis with a front-mounted RGB camera and an SDK that lets us override its default vendor behaviors to inject our own Nav2 velocity commands. The NVIDIA Jetson J4012 sits on the robot's back and runs everything: the drivers, SLAM, Nav2, elevation mapping, the Foxglove bridge, and the heavy data-recording pipelines. No tethered laptop, no companion PC. The Livox MID-360 is a 360° non-repetitive scanning LiDAR. While sparse on a single scan, it builds incredibly dense maps over time, making it the perfect, lightweight match for a mobile robot that can't carry a heavy, costly spinning sensor.

This is the same standardized off-the-shelf hardware tier our broader work targets; integrators deploying mobile manipulators or humanoids end up on the same compute footprint, which is part of why owning the stack at this level compounds across platforms. The full hardware stack. Compute, perception, and locomotion are all on-board, with no off-robot dependency. (source: Datameister)

The live system today: sensing, navigating, recording

The first complete skill running end-to-end on this stack is autonomous navigation and data capture: give the robot a site and a route, and it patrols autonomously, builds a persistent 3D map of what it sees, captures the run as a replayable bag, and lets you watch it live from a remote browser. The rest of this section unpacks the subsystems underneath.

The same on-board compute we already proved out for advanced indoor 3D perception now also runs that skill, so perception and action share one platform with no off-board computer in the loop. The stack is organized around the perception-to-action loop: it anchors observations in a shared map, turns them into navigation context, acts through Nav2, and records the run for replay and improvement. Each subsystem matters on its own, but the real value is how they connect inside one deployable ROS 2 workflow.

Multi-modal perception

The robot captures the world and itself simultaneously, across multiple streams. GLIM fuses the Livox LiDAR and IMU into real-time SLAM, producing odometry and geometry anchored in a shared spatial frame. The left image below shows the result: a complete city block mapped consistently from thousands of individual scans.

The GO2 front camera is synchronized with that frame and projected onto the geometry, producing a colorized point cloud shown on the right. The color layer adds object and scene context that geometry alone does not carry.

Beyond the spatial picture, the stack continuously logs every joint state, motor command, battery level, and control decision: a complete operational record of what the robot perceived, decided, and did. The result is a time-synced stream of everything the robot sees, does, and decides, built in real-time.

Real-time spatial mapping using GLIM (left) alongside the synchronized, camera-colorized point cloud (right). (source: Datameister)

Autonomous navigation and terrain adaptation

Once the robot has a shared spatial frame, it can act inside it. Nav2 handles goal-driven navigation on top of GLIM odometry, publishing updated plans and costmaps as it moves. Through Foxglove, shown in the left image below, the robot can be remotely directed for simple point-to-point movement or set on repeatable waypoint routes, allowing the robot to revisit the exact same paths for patrols, inspections, and repeatable data capture.

Reaching those waypoints reliably means accounting for what's underfoot. To handle uneven environments, elevation_mapping_cupy adds a local 2.5D height map and traversability layer on top of the SLAM output, visible on the right. This signal actively feeds both planning and locomotion: Nav2 switches to a terrain-aware parameter set, while an adaptive gait controller adjusts the robot's steps when the ground becomes harder to cross. The result transforms a static map into a truly navigable environment, telling the robot where to step, where to slow, and where to stop.

Foxglove route planning (left) and the traversability-aware elevation map (right). (source: Datameister)

Remote ops and data harvesting

Every run leaves a trail. The stack records live runs to ROS 2 bags by default, ensuring each session can be inspected or replayed later instead of disappearing when the robot stops. The recording captures the live point-cloud and state streams, so each bag can be replayed through the launch flow or exported into ML-ready datasets for training, evaluation, and map review. Every walkthrough becomes both an autonomy run and a reusable data capture session.

Simultaneously, the entire stack is built for untethered, remote observability. The robot exposes its live ROS graph and URDF description through Foxglove, rendering a live digital twin in the same scene as the map, point cloud, and planned motion directly from a remote browser. The field workflow is completely self-contained. On boot, the Jetson starts the stack through a systemd service and connects over Tailscale for remote viewing, development, and debugging. If it cannot join a known network, it falls back to Wi-Fi provisioning mode. The bags those runs produce are also the input the next section is built on.

From autonomy to manipulation

The autonomy stack is the foundation. The next round of work sits on top of it, and it runs in two directions at once.

The data flywheel

Every run leaves more than a successful patrol or a clean map. The same ROS 2 bags that record live runs are exactly the input the Datameister Platform needs upstream: real-to-sim reconstruction, scenario augmentation, simulation-grounded policy training, evaluation against held-out scenes. This is the same analytics loop we already run on the Platform for visual AI, now fed by a body that walks around. Every walkthrough produces a complete digital twin of the space; that twin is exactly what our indoor semantic segmentation pipeline was built to consume, with semantic labels for every object and space.

Trained skills come back down to the robot, where the autonomy stack hosts them at runtime. That is the flywheel: capture on the robot, train upstream on the Platform, deploy back to the robot. The autonomy stack and the Platform are designed to feed each other rather than to overlap.

Advanced 3D perception running on our autonomy stack. (source: Datameister, from "Indoor Semantic Segmentation of Point Clouds")

Legs first, then arms

The quadruped is the mobile base, the legs. In parallel, we have a stationary cobot manipulation track running on standardized off-the-shelf arms (Franka, UR, AGIBOT). The two converge into mobile manipulators, and beyond that, into humanoid platforms. Manipulation is where most of the current industrial pain concentrates: precision insertion, bin picking, machine tending, SKU-variant grasping, depalletization, on hardware integrators are already buying. We are currently #1 on the Intrinsic AI for Industry Challenge leaderboard as of mid-May 2026, with the precision-insertion track being forced to production through the challenge.

A foundation that ports across robots

None of this is GO2-specific by design. Our autonomy work (perception, navigation, and data capture) sits above the low-level locomotion and kinematics the platform vendor provides, so the investment is set up to extend to mobile manipulators and humanoids as we work toward those deployments. The on-board compute footprint (NVIDIA Jetson J4012 + Livox MID-360) is the same tier that runs production-grade perception, navigation, and the policy classes downstream of them: classical control where it works, learned policies where they earn their cost. That portability matters for any team running multiple hardware platforms across customer deployments.

Where this fits, and working with us

The autonomy stack is the on-robot side of Datameister's Physical AI work. The off-robot side is the Datameister Platform (training infrastructure that handles capture, real-to-sim, simulation, training, and evaluation), a perception and 3D-AI library that has been live in production for years as the building blocks, and a focused catalog of reference manipulation skills being built compositionally on top of those primitives. Datameister is currently #1 on the Intrinsic AI for Industry Challenge leaderboard as of mid-May 2026.

For European robotics integrators building skills on your customers' hardware (inspection today, mobile manipulation and humanoid next), three buyer paths sit on the same stack:

Use what we ship today: the autonomy stack as a reference, plus production-grade primitives in perception, 3D deep learning, and generation. Reference manipulation skills land progressively over the course of 2026.
Fine-tune anything in the Library on our infrastructure, against your customer's hardware and environment. You own the resulting Client-Specific Code.
Build your own skills on our infrastructure, with Datameister upstream of the skill behavior. You own the IP that differentiates you.

Our MSA framework (Platform, Library, Client-Specific Code, Buy-Out License) sits up front so the IP questions are settled before engineering starts. On-prem-default for industrial deployments, EU-domiciled, ISO27001-tracked.

If you are working on robotics in a different shape (a foundation-model lab with a deep cognitive subproblem you would rather not own end-to-end, an OEM looking at training infrastructure for your own platform, or a team trying to bring a specific robot to life in the real world) we are also glad to talk. We have handled the foundational systems engineering so that, together, we can focus on what your customer actually needs the robot to do.

The Monorepo as AI Factory

Ruben Verhack — Tue, 05 May 2026 07:51:44 GMT

TL;DR Datameister runs on a single Monorepo. The decision to centralize our codebase early on now shapes how we ship: knowledge-sharing across teams happens by default, our tooling investment boosts developer velocity across every project, and a consistent quality bar holds as we scale.

This foundation powers our product today: a core technology platform built on shared libraries and deep, collective research; an application layer that turns those capabilities into client-specific applications (which remains the client’s); and the cloud infrastructure that delivers the reliability, security, and observability needed to run it all at scale.

Result: we ship production-grade AI applications in a fraction of the typical timeline. Documentation, security, monitoring, and easy integration are all built in from day one, not bolted on later.

At Datameister, our codebase and intellectual property have been built around a Monorepo, and we are now seeing the long-term benefits of that architecture. If you’re curious about what those benefits are and how they shaped both our technology and our company, this article is for you.

First, what is a Monorepo? It is short-hand for monolithic repository. The word monolithic comes from the Ancient Greek μόνος (monos, “single”) and λίθος (lithos, “stone”), conveying the idea of a single, solid block. Applied to a repository, it means treating the codebase as one unified whole rather than dividing it across multiple repositories. Although the term itself only gained popularity relatively recently (we can trace back the original Wikipedia entry to 2018), the underlying idea is far from new: it closely aligns with the concept of a shared codebase.

Today, many companies have demonstrated that a Monorepo isn’t just a promising idea, but a very viable choice even for extremely large codebases:

Build systems

It is important to note that a Monorepo is merely an architectural decision, and its success largely depends on the build system (~the tooling) to construct and maintain it. Don’t worry, we’ll discuss this in more detail later in the article. Here we mention two build systems in particular: Bazel and Pants, as shown in the image below.

NOTE: Users are based on the Bazel Community page and the Pants Community page

Bazel

Originally Blaze: Google’s internal build tool
Purpose: tackle scaling issues with extremely large codebase
Widely adopted in the industry
Large community: Airbnb, Dropbox, LinkedIn, ...

Pants

Created by Twitter, Foursquare and Square
Similar purpose: replace fragmented build tooling by single project
Growing community: IBM, Salesforce, and... Datameister!

A brief history

Before diving into the technical details, I’d like to introduce myself and share the origin story of our Monorepo. My name is Pieter Cooreman. I studied Electrical Engineering at Ghent University, but I was always drawn to Computer Science as well. After my engineering studies, I did a Postgraduate in Weather and Climate Modeling driven by personal interest. During that year, I was also looking for a student job and that’s when Larsen introduced me to Datameister. The rest is, well, history.

When I joined, the codebase was still in its early stages. That moment gave us the opportunity to pause and ask ourselves a fundamental question: how could we design our codebase so that it would act as a catalyst rather than a burden in the long run? If we didn’t take the time to really think it through, we would sooner or later bump against a wall. After extensive research, we concluded that the Monorepo was the perfect fit for our company. Looking back, I’m amazed by the progress: how the codebase matured, how we began to reap the benefits from this design choice, and how it continues to evolve through constant team efforts.

Why choose for a Monorepo?

A quick Google search can give you an extensive list of pros and cons, but that doesn’t really tell you much without any context. That’s because your design choices only pay off when they align with your use-case. In other words, you lose every advantage you don’t make proper use of.

The figure below illustrates the difference between a Monorepo and a decentralized, Polyrepo architecture. In a Monorepo, integrations are built on top of a shared foundation, which allows for reusability across projects. On the other hand, a Polyrepo typically consists of separate vertical stacks with little to no interaction between them.

We’ve identified three core benefits that played a big role in our decision, as they closely align with our mentality.

Knowledge-sharingWith a centralized codebase, knowledge flows naturally across the whole team. By creating shared, easily accessible libraries, we don’t have to reinvent the wheel for every project. As such, core algorithms, models, and frameworks are readily available. Moreover, everyone can contribute to these libraries, which builds a strong and continuously evolving foundation.

High developer velocityA Monorepo also enables shared tooling, which ensures everyone is familiar with the development workflow, even when switching between projects. Consequently, slow or repetitive steps can be eliminated which substantially boosts developer velocity. Furthermore, it’s easier to maintain and improve a unified tooling framework in contrast to many different tooling sets.

Consistent qualityThe shared infrastructure is definitely a strength, but it also brings with it much greater responsibility. For example, a bug in a shared library can impact all dependent code, which raises the bar for quality. This encourages consistency across the entire codebase, spanning from a coherent coding style to reliable design patterns in more complex tasks.

Now you might wonder, how do we make full use of these advantages?

From idea to product

As shown in the figure, the key advantage of our Monorepo setup is how much it compresses the time from idea to production. The reason is simple: the foundation is already in place.

Instead of going through repeated setup and integration steps for every project, developers can immediately focus on building the client-specific application logic. There’s no need to reassemble the surrounding platform (monitoring, infrastructure, tooling, etc.), because it already exists as a cohesive foundation. Rather than copying and patching together setups that were never designed to be universal, we build on a system that was designed from the start to work across all projects.

The result is a fast, iterative workflow:

ideas quickly become prototypes
prototypes can be refined in short cycles
validated solutions can be deployed immediately

Where Polyrepo setups introduce delays through coordination and repeated setup, the Monorepo enables a continuous flow, allowing us to move from idea to production in a fraction of the time.

Build-A-Monorepo

This still leaves one important question: how do you actually set up and maintain a Monorepo in practice? In our experience, this is far from trivial and very much an ongoing process rather than a one-time task.

As mentioned earlier, the success of a Monorepo depends heavily on the build system. While some organizations ended up building this tooling from the ground up (Google being the canonical example with Blaze), we were fortunate not to have to start from scratch. While we initially considered using Bazel as our build system, from personal experience we noticed that its support for Python fell a bit short. For an AI-focused company like ours, with Python as the backbone, this creates a lot of unwanted friction.

Pants, by contrast, is designed with Python as a core use case. It provides strong support for dependency inference, reproducible builds, and fast, incremental workflows, while remaining approachable for developers who are not build-system specialists. This balance has been key for us: the build system is powerful enough to scale with the Monorepo, but accessible enough that developers can reason about it and extend it themselves.

Deploying an endpoint in an afternoon

To make this tangible, let's walk through a toy example. Say we want to deploy a new endpoint that accepts an image, runs an AI model on it, and tells you which objects can be seen in the image. In a traditional setup, this means days of scaffolding before we even get to the interesting part. In our Monorepo, we write one thing: the image processing logic. Everything around it is already in place.

When we spin up a new endpoint:

API documentation is auto-generated and stays in sync with the code, alongside developer-friendly API collections for quick manual testing.

Authentication is wired in by default, consistent across all our services.

Async job orchestration is available when we need it. For heavier workloads, a built-in job system handles processing with status tracking and retries, so the endpoint can accept a request, hand off the work, and let the caller poll for results.

Structured logging and monitoring come for free. Metrics are collected automatically, and spinning up a monitoring dashboard is merely a few lines of Python thanks to a dashboards-as-code library that plugs straight into our infrastructure.

MCP integration is a one-liner on top of the existing API, making the endpoint callable as a tool by AI agents without any extra plumbing.

Crucially, none of this is copy-pasted boilerplate. Every endpoint shares the same libraries, the same dependency tree, and the same CI/CD pipeline. When a security patch lands or a dependency gets updated, every service benefits at once rather than quietly drifting apart. The developer's entire focus stays on the thing that actually matters: the model and the business logic. Getting it to production is just a pull request away.

The benefits of our Monorepo extend even beyond these “supporting” components. It’s also how we share and scale research. Instead of letting experiments live in isolated repos, we turn them into reusable building blocks. Ranging from object detection (like our DETR work) to 3D asset generation and real-to-sim scene reconstruction (see Trellis), new ideas quickly become tools that anyone can use in production!

A Double-Edged Sword

It is important to remember that a Monorepo is not a magical, one-size-fits-all solution. It offers strong benefits, but also introduces its own set of challenges.

Scaling

Tools like Pants help, but growth still adds complexity
More developers and projects require clear structure and boundaries
Continuous feedback and disciplined repo management remain essential

Isolation

Everything is interconnected, which is powerful but risky
Changes can have wide impact across projects
Balance is key: isolate where needed, reuse where it helps

Edge cases

Not all code fits naturally into a Monorepo
Forcing everything in can be counterproductive
The Monorepo can still serve as a reference for quality and tooling

In short:
A Monorepo doesn’t remove complexity, but it makes it manageable. With the right tools and engineering culture, it’s a strong foundation, but it will always remain a work in progress.

Conclusion

Looking back, the real benefits of the Monorepo didn’t come from the architecture alone, but from how it shaped the way we work at Datameister. It encourages strong collaboration, pushes us toward higher standards, and lets us move faster with more confidence. Over time, those small advantages compounded into a development environment where research ideas quickly turn into production-ready capabilities, and every new project starts on top of everything we've built before.

That compounding is what our clients end up feeling directly. When we take on a new project, the foundation is already there. What's left to build is the model and the application logic the client actually hired us for. The path from first conversation to a robust deployment keeps getting shorter, because each model we ship sharpens the components underneath it.

After enough cycles, the codebase starts behaving more like an AI factory than a codebase: a system where mature sub-components let us take on harder problems by composing what's already there instead of starting over. It's why we can credibly move into physical AI today, where the work depends on a stack of capabilities that have to be solid individually before they're worth anything together. The Monorepo isn't a plug-and-play decision. It's the ongoing investment that turns every project we ship into raw material for the next, harder one.

Indoor Semantic Segmentation of Point Clouds: From LiDAR Capture to Real-World Use

Jirne Meurisse — Thu, 16 Apr 2026 08:58:38 GMT

TL;DR Indoor semantic segmentation promises structured scene understanding from LiDAR point clouds, but models trained on outdoor datasets fail to generalize indoors due to differences in viewpoint, environment and sensor characteristics.

At Datameister, we leverage our 3D AI expertise to bridge this gap using domain-aware processing with a state-of-the-art model and domain adaptation through training on indoor-specific datasets. We validate our pipeline end-to-end on our own quadruped platform and perform semantic segmentation and downstream tasks on real-world scans.

Result: industry-grade indoor spatial intelligence using hardware that is up to 10x cheaper than traditional automotive setups.

Applications: autonomous navigation and manipulation, spatial virtual twins, real-to-sim pipelines, scan-to-3D automation.

Raw LiDAR data is not useful on its own, it is just a set of thousands to millions of 3D coordinates describing the geometry of your environment. To make this data actionable, it needs to be transformed into a structured representation that systems can actually reason about.

This is exactly where the Datameister Semantic Segmentation Pipeline comes in, transforming raw (sparse) point clouds into meaningful scene understanding that enables downstream tasks such as digital twins, real-to-sim, asset management, etc.

Starting from a sparse raw point cloud to a semantically labeled point cloud, enables downstream tasks such as scene reconstruction.

If you are building robotics systems, digital twins, or indoor mapping pipelines, LiDAR is likely a core input. Many teams assume they can reuse outdoor segmentation models, but quickly run into the reality that these models fail to transfer across domains, sensors, and platforms. So how do we achieve reliable indoor scene understanding?

We’ve structured this post as follows. First, we’ll define semantic segmentation on point clouds and why it is the foundation of downstream tasks. Second, we look at how these point clouds are captured and examine the current gold standard of the field: outdoor scenes. Third, we explore why these approaches break down indoors, and what it takes to make them work in practice. Finally, we wrap up with the high-impact downstream applications this unlocks.

About the author: Hi, I’m Jirne, an AI Engineer at Datameister. I started out tinkering with electronics, 3D printing, and photography long before pursuing a degree in Computer Science Engineering. During my studies at Ghent University, I became particularly interested in Computer Vision and Computer Graphics. Today, I focus on building practical AI systems that bring spatial intelligence into real-world applications.

1. Pixel-Level to Point-Level Semantic Segmentation

Semantic segmentation assigns domain-specific labels to pixels (2D) or points (3D), which is a first step in robotic perception of an environment. In the example case of 2D images, the task is a pixel-level classification problem, where instead of classifying an entire image as “car”, we assign labels to each pixel, e.g. wheels, body and windows. Analogously,this task can be extended to point clouds (3D), resulting in a point-classification problem.

Toy example illustrating pixel-level classification (2D) vs. point-level classification (3D)

Regardless of the input dimensionality, the goal is to produce a dense, spatially consistent map where the boundaries are smooth and logical. In the context of indoor scene understanding, semantic segmentation adds the fundamental structure that allows the model to distinguish points belonging to the wall, the floor or furniture.

Note: Unlike instance segmentation, semantic segmentation treats all objects of the same category as a single entity. As a rule of thumb: semantic segmentation identifies "what," while instance segmentation identifies "which." Instance segmentation can be done by a separate model, or as a post-processing step of semantic segmentation.

1.1 How 3D Point Clouds Are Captured

The tradeoff in LiDAR sensor for our application is accessibility (e.g., price, form factor) vs. industry adoption, rather than accuracy. Modern sensors like the Livox Mid-360 deliver sufficient accuracy at 1/10th of the price of traditional automotive-grade systems, fundamentally changing deployment feasibility.

At a high-level, LiDAR sensors emit laser pulses, measure their return time, and converts these measurements into 3D points. Repeating this across many directions produces a geometric sampling of the scene, often together with extra attributes such as intensity, timestamp, and sometimes RGB after sensor fusion.

Three categories of LiDAR sensors, illustrating the tradeoff between industry adoption and accessibility (e.g., price, form factor)

The available LiDAR sensors span a wide spectrum: industry-grade sensors like the Ouster ($10K+) that have been the standard for years, versus newer, compact alternatives like the Livox (<$1K). This rapid evolution has dramatically increased accessibility, bringing LiDAR to everyday devices, including some of the smartphones we carry.

In real-world scenarios, cheaper sensors unlock dramatically cheaper deployments, but shifts the complexity to the software, which we’ll take care of.

2. Why Outdoor Segmentation Models Don’t Transfer to Indoor Environments

Most teams assume that segmentation models trained on outdoor datasets can be reused indoors. In practice, this assumption breaks down and often leads to unreliable predictions and failed deployments.

These models are built on datasets such as KITTI, NuScenes, and Waymo Open Dataset, captured using automotive-grade LiDAR sensors in structured outdoor environments. Years of investment in autonomous driving have made these models highly performant in that specific setting, as demonstrated by impressive real-time segmentation results on benchmarks like SemanticKITTI.

Real-Time Semantic Segmentation on the SemanticKITTI dataset (slowed down and looped).
The following classes are shown: Road (pink), Building (yellow), Cars (blue), Pedestrians (red).

Indoor environments, however, introduce fundamentally different conditions. Viewpoints change, sensor characteristics differ, and scenes become less structured. As a result, models trained on outdoor data rely on assumptions about geometry and point cloud patterns that do not hold indoors.

Without adaptation, even state-of-the-art outdoor models become unreliable in indoor settings, making them unsuitable for real-world indoor deployment.

2.1 Impact of Viewpoint on Point Cloud Density

First, viewpoint differences fundamentally change the structure of the point cloud. Models trained on car-mounted LiDAR expect a specific distribution of points that does not match indoor or robot-mounted setups.

A different viewpoint of the LiDAR relative to the object has an impact on the incident angle and thus captured point density.

For example, a car-mounted LiDAR observes a person from above, resulting in higher point density near the head and lower density toward the legs. A quadruped robot captures the same person from a lower angle, producing the opposite pattern. Since many models (e.g. PointPillars available in OpenPCSeg) implicitly learn these density distributions, this mismatch directly impacts their predictions.

Less intuitively, even if the object is identical, a change in viewpoint can cause models to misclassify or completely miss it, making outdoor-trained models unreliable on indoor robotic platforms.

2.2 Repetitive vs. Non-Repetitive LiDAR Sensors

Second, LiDAR sensors differ not just in cost or form factor, but in how they sample the environment. These differences directly shape the point cloud and influence how models interpret it.

Spinning sensors such as Ouster use repetitive scanning patterns, capturing the environment in fixed, predictable rings. This provides high temporal consistency, which is critical for detecting dynamic changes like moving objects. However, these sensors cannot sample the gaps between rings, limiting spatial coverage.

In contrast, sensors like the Livox Mid-360 use non-repetitive scanning patterns, where each pass samples different parts of the scene. Individual scans are sparse, but by accumulating multiple frames over time, the sensor achieves much higher spatial density than traditional spinning LiDARs.

Comparison repetitive (e.g. Ouster) vs. non-repetitive scanning (e.g. Livox) patterns

Unfortunately, models trained on repetitive scan patterns implicitly expect structured ring-like inputs. When deployed on non-repetitive sensors, this mismatch degrades performance, making cross-sensor deployment unreliable without adaptation.

Note: Recently, efforts have been made to train models that transfer across sensors and domains, such as Utonia, but more in-the-field results are needed to validate this approach in real-world applications.

2.3 Unstructured, Cluttered, and Highly Variable Scenes

Finally, indoor environments are far less structured than outdoor scenes. Models trained on outdoor data rely on predictable patterns such as roads, sidewalks, and traffic infrastructure.

Indoor spaces, in contrast, are highly variable and cluttered. Layouts differ significantly, objects are more diverse, and the number of relevant classes increases. Instead of focusing on a few categories like cars and pedestrians, indoor models must distinguish between furniture, fixtures, and structural elements at a much finer level of detail.

A key observation is that indoor segmentation is not just a harder version of the same problem. It requires different assumptions, higher label granularity, and more robust modeling to work reliably in practice.

3. Making Indoor Segmentation Work on Smaller, Lower-Cost Platforms

Making indoor segmentation work on accessible hardware in real-world scenarios is not a matter of fine-tuning a model. It requires rethinking the entire pipeline, considering how data is captured to how it is processed and interpreted.

To make this work in practice, the goal is not just accuracy on benchmarks, but a system that is portable, affordable, and robust enough for real-world deployment.

Our setup reflects this philosophy. We use a compact platform built around the Unitree GO2 quadruped, combined with an NVIDIA compute module and a Livox Mid-360 LiDAR. This configuration provides a practical balance between cost, mobility, and sensing capability for indoor environments.

Render of our development & validation setup: Unitree GO2 with Livox Mid-360 LiDAR and NVIDIA compute module

The following sections detail how we turn this setup into a reliable semantic segmentation pipeline, enabling further downstream spatial AI.

3.1 From Sparse Scans to Segmentation-Ready Data

A single LiDAR scan from a sensor with non-repetitive scan pattern is too sparse for reliable segmentation. As discussed in Section 2.2, non-repetitive sensors like the Livox Mid-360 distribute points across different locations in each scan, meaning individual frames lack sufficient spatial coverage.

The animation below shows how this changes over time. Each frame adds new information, gradually filling in the scene as scans are accumulated.

Frame accumulation over time and robot motion on the Livox Mid-360 creates a dense point cloud for semantic segmentation.

On a moving platform, this only works if motion is properly compensated. By synchronizing frame accumulation with the on-device SLAM system, we reconstruct a spatially consistent point cloud of the entire floor that is ready for segmentation.

Without scan accumulation and motion compensation, low-cost LiDAR data is too sparse to be useful. With it, we unlock dense, high-quality inputs from lightweight hardware.

3.2 Datameister’s End-to-End Segmentation Pipeline

A pre-trained model alone is not enough to achieve reliable indoor segmentation. While large models like Sonata provide state-of-the-art semantic segmentation, they must be adapted to the specific characteristics of indoor environments and sensor setups to achieve a robust output.

Datameister’s semantic segmentation pipeline from capture on our platform to a semantically segmented point cloud.

Our pipeline combines domain-aware pre- and post-processing with a state-of-the-art segmentation model. We build on top of Sonata, a pre-trained PointTransformer V3 encoder, and use a decoder head trained on ScanNET-20, an indoor dataset for semantic segmentation. This allows the model to learn the structure and variability of indoor scenes while classifying points into relevant categories such as floors, walls, and furniture. Depending on the domain, we can train the decoder for the desired classes while preserving the broad geometric understanding learned during large-scale pre-training.

Crucially, this model operates on the accumulated and motion-compensated point clouds described in Section 3.1. This integration with the on-device SLAM system ensures that the input is dense and spatially consistent, which is essential for reliable predictions in real-world environments.

The pipeline diagram shows how these components come together, from data capture on our customized Unitree GO2 to a semantically segmented point cloud. The accompanying animation demonstrates the pipeline in action on our office environment, highlighting how unlabeled scans are transformed into structured scene understanding.

Raw scan to semantic segmentation: Floor (green), Walls (cyan), Chairs (yellow), Tables (orange), Doors (purple), Windows (red).

This proves our approach works beyond lab setup or benchmark. It is a practical system that delivers high-quality indoor segmentation on compact, low-cost hardware in real-world conditions.

4. Downstream Applications

Semantic segmentation is not the end goal. It is the foundation for turning raw spatial data into usable environments, moving from a research problem to a business enabler. Once a point cloud is structured and labeled, it becomes a building block for a wide range of practical applications.

4.1 Semantic-to-Parametric Reconstruction

A semantically labeled point cloud provides a structured understanding of the scene, enabling reliable detection of key surfaces and objects such as floors, walls, ceilings, doors, and large fixtures.

Building on these labeled points, the environment can be transformed into a parametric representation. In this representation, the scene is described using geometric primitives for layout (such as walls and floors), while more complex objects are approximated with 3D assets. This creates a clean and editable reconstruction of the environment.

We’ve moved from raw sensor data into a structured digital environment that can be directly used by downstream systems.

A reconstruction of our office using geometric primitives (walls and floor) and 3D assets (tables and chairs)

4.2 From Reconstruction to Real-World Applications

Once the environment is reconstructed, it can be used across multiple domains:

Real-to-sim pipelines: converting scanned environments into simulation-ready spaces for robotics testing, navigation, validation, and continuous improvement.
Example: A robotics integrator scans a customer’s production hall and generates a simulation environment to validate navigation, obstacle avoidance, and task execution before deployment.

CAD and digital twins: creating a structured base layer that designers and engineers can use to design, analyze and build on their existing modeling tools, e.g., AutoCAD, Blender.
Example: A renovation contractor digitizes a century-old building and obtains a clean CAD model that can be edited in tools like AutoCAD for planning structural changes.

Asset inventory and facility management: identifying and localizing structural elements and large objects for tracking and maintenance.
Example: A facility manager scans a warehouse to automatically catalog equipment, enabling faster audits and maintenance planning.

In other words, semantic segmentation does not just help a robot "understand" a room. It helps teams turn raw captures into environments they can analyze, simulate, and build on.

5. Conclusion

When a human looks at a raw point cloud, they can quickly make sense of it. We intuitively separate floor from wall, ignore noise, and recognize objects even from incomplete geometry. For a robot, that same scene is just a large collection of coordinates until additional structure is added.

That is where Datameister comes in. We turn raw LiDAR captures into structured scene understanding through semantic segmentation pipelines designed for real-world indoor environments. The goal is not just better benchmark performance, but reliable inputs for the systems that depend on them, from robots that need to navigate to simulations and digital twins that require accurate spatial context.

So what? This enables teams to move from raw spatial data to environments they can simulate, analyze, and build on, without relying on expensive, automotive-grade hardware.

If you are building systems that scan, map, or interact with the physical world, raw data and hardware constraints should not be your bottleneck. We help teams turn LiDAR data into structured, actionable spatial intelligence, ready for deployment.

Ready to bring indoor spatial AI into your products? Let’s build it together.

From Studio to Robot: Well-Integrated 3D Generation

Jarne Van den Herrewegen — Tue, 24 Mar 2026 10:45:03 GMT

TL;DR 3D workflows in manufacturing & creative industries struggle to benefit from generative AI because the outputs lack style consistency, control is limited, and new tools don't integrate with existing workflows. This post shows a Blender add-on that embeds controlled 3D generation directly into the modeling environment, letting users define where generation can and can't happen. By combining controlled 3D generation and code generation with predefined actions, e.g. to search existing assets or to generate new models, AI-based workflows are leading to a new paradigm of ideation and creation. Crucially, the same control mechanisms make it useful for generating synthetic training and evaluation data for robotics, computer vision.

Result: Controlled 3D generation embedded directly in the modeling environment, users steer what gets generated and what stays locked, with asset library integration to stay grounded.

Applications: Design ideation and rapid prototyping across industrial design, game development, VFX, virtual production, synthetic data generation for robotics and computer vision.

1. Friction for 3D Generation Tools

A pattern that often occurs across 3D workflows: a team tries a generative AI tool, gets some interesting outputs, and then hits a wall when they try to efficiently use those outputs in production. The results don't match their style, the control mechanisms are lacking, and the tool lives outside the environment where all their other work happens. The generative model was impressive. The integration wasn't there. This problem shows up in 3D modeling, concept art, industrial design, mechanical engineering, game development, VFX, virtual production, synthetic data generation, digital twins, ... anywhere teams work with complex assets and established pipelines.

The same friction shows up in a different form for robotics teams building perception models or manipulation policies. A robotics team needs thousands of realistic, annotated 3D scenes to train and evaluate navigation and grasping policies. A computer vision team needs diverse variations of a warehouse layout to train an object detector. They could model each scene by hand, which doesn't scale. An alternative is unconstrained generation, which produces broad data that looks plausible but goes too far and overcomplicates the learning problem. The control problem is fundamentally the same: you need generation that respects the structure you already know to be correct.

Overview of a custom Blender add-on for generative editing. Green is the generation zone, red is the no-go zone and yellow is the base object. Three generations in front.

Ideally, generation would live inside your existing tools, respect the geometry you've already committed to, and even pull from your asset library to stay grounded in your style and design language. This post walks through one concrete example showing an integrated workflow: a custom Blender add-on for constrained 3D generation and editing.

The Control Problem in 3D Ideation

Early-stage ideation in 3D is time-intensive by nature. Sketching an idea in 2D is fast; translating that into actual 3D geometry is not. Generative AI changes this equation: it can produce 3D assets in a fraction of the time, making it genuinely useful for rapid prototyping and exploring design variations early in the process.

The catch is control. Most 3D generation models give you limited ways to communicate your intent. You get an output, it fills in the blanks in ways you may not want, and iterating is tedious. On top of that, 3D modeling is easiest in tools you already know inside-out, there's no reason to reinvent that. The goal is to bring generation into that existing environment, not replace it.

A small word about the author. My name is Jarne, I am a first class AI nerd and ML engineer at Datameister. With a PhD in self-supervised 3D deep learning i.e., teaching models to learn meaningful representations from large custom 3D data collections, I bring deep understanding to the table. The best thing for me is rabbit-holing in research and busting out algorithms in disruptive products.

2. API-Based 3D Generation in Blender

The example showcases a Blender add-on that integrates constrained 3D generation directly into the modeling environment. The starting point is an existing model in the scene, not a blank canvas. The user defines where generation should happen and where it shouldn't using geometry defined in Blender, sends that to our server, and gets back a set of variations to evaluate.

Connected through an API

The generation runs through an API call on our Datameister GPU platform using an adapted version of Trellis, which means we handle latency, queue management, and can adapt the model to a client's specific data and style. Queue status and active users are surfaced directly in the UI; mesh normalization and other pre/post-processing steps are handled automatically.

Blender is the environment we used here, but the same approach applies to any software that supports plugins: CAD tools, simulation environments, game engines, or pipelines like ComfyUI workflows. The integration layer is what matters for end users. This also means we're independent of any specific model and can choose whichever works best for a given use case.

Green is the generation zone, red is the no-go zone and yellow is the base object. Generation results at the bottom.

Steering with Go-Zones and No-Go Zones

The constraint system is a custom-built algorithm and the piece that makes this practical for real production work. The add-on lets users define go-zones, bounded volumes where the model is free to generate, and no-go zones, regions that must not be touched. These are set directly in the Blender viewport and passed to the model at inference time.

Instead of generating into a vacuum and manually cleaning up afterward, the creator is steering from the start. Take an existing asset, mark the part you want to explore variations on, lock the rest, and get back options that respect the structure you've already committed to. That's a different experience from prompt-and-hope. You can read more about the approach in our post on constraint-aware 3D generative design.

For synthetic data, this constraint system maps directly to domain randomization with guardrails. You have a validated base scene: a robot workcell, a warehouse aisle, a surgical tray, and you want to generate diverse variations for training and evaluation while keeping the structural layout physically plausible. Go-zones let you define where variation should happen while no-go zones lock what must remain fixed.

3. Staying Close to Your Tools and Your Assets

When AI lives inside your modeling tool, it can see your scene, query your asset library, and match your established style and design language. Say you have a catalog of 500 parts or environment props built over several years of production, the add-on can search that library, surface the closest matches to what you're generating, and use them as style and geometric references. 3D generation stays grounded in what you've already approved rather than producing something generically plausible that needs reworking. The same applies to synthetic data: a robotics team doesn't want hallucinated objects, they want controlled variation on validated CAD models from their own part catalog. This is an underrated aspect of well-integrated tooling.

This also opens up a different way to build full scenes: combine existing assets for the parts that are already right, and use generation selectively for the gaps: a unique prop, a surface variation, a structural element you don't have yet. Generation where it matters, reuse everywhere else.

Staying close to familiar tools accelerates adoption. No new platform, no file format changes, no disruption to review. This applies whether you're a 3D studio, a game developer, an engineering firm, a manufacturer, or any team with an established 3D pipeline. Integration that respects existing conventions is what makes the difference between something teams try once and something that sticks.

4. Where This Is Going: Blender MCP & friends

Blender MCP is another example of AI integration, albeit more open-ended. We also have our own version of it at Datameister, called DD3M. It operates with fewer constraints and allows chatbots to execute virtually any action in the scene, which produces results that range from genuinely impressive to frustrating, sometimes within the same session.

Ask it to build a beach scene from a reference image and it'll pull HDRIs, place assets, and set up lighting in minutes. Ask it to position an object at precise coordinates and it may place it incorrectly three times in a row. The capability is clearly there; the reliability isn't.

What bridges that gap is guardrailing and purpose-built tooling. Rather than exposing the full action space to a language model and hoping for the best, the more productive path is defining a set of fixed operations the model can invoke with confidence and letting the model choose between the operations and the free-form code generation. This separates what the model decides from what the model executes. Two concrete directions stand out:

Automating repetitive work. Scene cleanup, naming conventions, LOD generation, UV unwrapping, file format conversions, ... tasks that are well-defined but tedious are a natural fit. A language model doesn't need creative latitude here; it needs reliable tools to call. For synthetic data pipelines, this extends naturally to automating scene randomization: camera placement sweeps, lighting rig variations, material permutations, physics-based object drops.

Accelerating asset reuse. Most studios are sitting on years of approved assets. With the right tooling, a model can search that library semantically, surface relevant matches, and drop them into the scene.

5. Conclusion: the Advantage of Independent Development

We're not a foundation model lab. We take the best available (also legally-available) models, open-source, fine-tuned, or third-party, and build the layer that makes them usable in production: deployment infrastructure, custom tooling, fine-tuning on client data, and in some cases algorithms built from scratch. The go-zone/no-go zone system for 3D generation is an example of the latter, not something you get by calling an off-the-shelf API. The same depth applies across domains: strong generalist models exist for 3D, vision, and language, but making them work well for a specific team in a specific context is a different problem entirely.

The thread running through this post is that well-integrated controlled generation is a shared primitive. The same mechanisms that let a designer explore variations on a product housing while locking the mounting points also let a robotics team generate thousands of training scenes while keeping the physical layout plausible. The integration work is there to reduce the friction, facilitate creative flow or automate synthetic data pipelines.

If your team is working with 3D assets and wondering how generative AI could fit into your existing pipeline without disrupting it, we'd love to talk. Whether you're exploring early-stage ideation tools, looking to make better use of an existing asset library, or want to automate the tedious parts of your workflow, we can help figure out where AI adds real value and build the integration to facilitate fast adoption. Reach out to start the conversation.

Why the Future of 3D Generative AI is Programmatic

Ruben Verhack — Thu, 29 Jan 2026 09:47:50 GMT

TL;DR Current 3D generative AI tools are black boxes, they produce meshes that look appealing but are static, fragile, and impractical to edit meaningfully. DD3M is an early programmatic generative AI framework integrated directly in Blender. Unlike tools that produce opaque geometry, it generates editable Blender Python construction logic.

By pairing Large Language Models (LLMs) with Vision Language Models (VLMs), DD3M creates a transparent workflow. It combines API knowledge with an iterative visual feedback loop to build assets step-by-step.

Result: Clean, programmatic geometry and materials that behave like native Blender assets: fully editable, pipeline-native, and production-compatible.

Applications: Rapid concept prototyping, 3D pipeline automation, asset generation, design automation, automated procedural scene orchestration, and scene randomization

The promise of AI-generated 3D content is undeniable, but current generative AI tools powered by diffusion models face a fundamental problem: their outputs are visually impressive but production-brittle. Because the geometry is baked into a static mesh, you cannot meaningfully edit parameters or structure. If the proportions are wrong, you are forced to regenerate from scratch.

DD3M serves as a proof-of-concept for a new paradigm in 3D generation. It is a Blender-native agentic framework, enabling a code-first workflow directly within the viewport. Instead of producing static geometry, DD3M generates and executes Blender Python construction logic. The result is an editable blueprint, an asset defined by code that artists can refine via image or natural language prompts or by adjusting the output directly using standard Blender tools. The resulting assets are fully exportable to universal formats like OpenUSD.

Creation workflow in DD3M’s Blender add-on

DD3M moves away from "one-shot" generation toward a non-destructive, iterative cycle. The system functions through three distinct stages of evolution:

Direct Generation: The system synthesizes an initial script directly from your prompt.

Automated Refinement: A built-in Vision-Language Model (VLM) "sees" the output and applies automatic code fixes to correct geometry or materials.

User-Directed Edits: The user requests specific changes (e.g., "Make the base wider"). Rather than rebuilding the mesh, DD3M updates only the relevant Python code blocks, keeping the rest of the asset intact.

DD3M is a powerful alternative for other programmatic 3D generation approaches such as using the Blender MCP with Claude Opus 4.5.

1. The Black Box Bottleneck

Recent 3D generation tools operate as black boxes. You feed them a prompt or image, and they return a finished 3D model, often via diffusion, without revealing the construction process or accessible parameters.

While visually impressive, this opacity effectively hides how the object was made. For hobbyists, this feels like magic; for professional technical artists requiring transparency and control, it represents a dead end.

1.1 Why is Static Geometry a Bottleneck for 3D production?

Current generative tools produce "frozen" assets. The resulting mesh is a snapshot: vertex positions, topology, and materials are baked into the output, leaving no accessible parameters for adjustment.

Geometry and materials of a Mesh-generated asset, shown alongside its underlying topology (source: from meshy.ai)

This brings us to the core limitations. Changing details like dimensions or materials requires a complete regeneration, a new forward pass through the model. This forces users to "roll the dice" on a fresh output rather than tweaking specific elements. For professional modeling workflows, which rely on precise, iterative refinement, this inability to edit structure without resetting the asset is often a showstopper.

1.2 The Programmatic Solution to Editable 3D Generation

Unlike black box systems, DD3M generates Blender Python scripts. This creates an editable blueprint where construction logic, parameters, and materials remain intact as code, rather than a frozen mesh.

This foundation enables a controlled editing workflow. Users primarily adjust assets via refinement prompts, which DD3M translates into targeted modifications of specific, semantically organized code blocks. This ensures changes are localized and stable, allowing for quick iteration without regenerating from scratch.

Examples of DD3M outputs and their subsequent refinements

Crucially, the resulting assets are structured assemblies of distinct parts, not monolithic meshes. They remain fully compatible with Blender’s native interface, allowing artists to tweak geometry and materials using standard tools. These manual UI edits integrate smoothly with subsequent AI refinements. While direct code editing is available for deeper control, the system is designed so that the asset evolves through prompts and interaction rather than remaining a static output.

1.3 DD3M: A Modular System That Scales

DD3M utilizes a modular agentic architecture combining LLMs, VLMs, and retrieval, rather than a single monolithic model. Specialized agents handle distinct tasks: planning, coding, critique, and refinement, allowing the system to scale naturally. As foundation models improve, DD3M’s coding reasoning and visual analysis capabilities upgrade automatically without architectural changes.

A retrieval backbone, containing Blender API documentation and verified prompt-script pairs, anchors this workflow. By mapping user intent to verified code patterns, this layer ensures stability and robustness even as Blender’s API evolves.

This design is highly extensible. New tools, custom addons, and libraries can be integrated via function calls or Model Context Protocol (MCP) endpoints. Consequently, DD3M acts less like a static product and more like an evolving technical artist, adapting to new AI models and production requirements without locking users into a fixed stack.

2. Where Current Programmatic Approaches Fail

Generating Blender code from text seems straightforward, but the gap between valid syntax and usable 3D content is significant. Simpler baselines consistently fail to bridge this gap, highlighting the necessity of DD3M’s architecture.

While multi-agent coordination is complex, DD3M demonstrates that a well-designed system effectively overcomes these limitations, achieving reliable programmatic generation where naive approaches fail.

2.1 Why Single-Model Generation Fails

The "naive" approach, giving a single LLM the Blender API documentation, fails due to contextual blindness. While modern LLMs are strong Python coders, they live entirely in the text domain. They blindly generate code without seeing the result, unable to detect if a mesh is misshapen, materials are missing, or objects are floating incorrectly.

Blender’s specialized API amplifies this issue; even small inaccuracies lead to unpredictable failures. Without a visual feedback loop, a single-model system cannot recover from errors or iterate on the design, users are forced to restart and hope for a better result.

Results for the prompt “a frog with a crown” generated by Gemini 3 Pro (left) and DD3M (right)

Even when syntactically correct, single-model outputs tend to be simplistic. As shown above, lacking the ability to evaluate and refine the render prevents the model from achieving high quality. This proves that visual feedback and iterative correction are not luxuries, but essential requirements for closing the loop between code and 3D creation.

2.2 LL3M: The Multi-Agent Pioneer

LL3M pioneered the multi-agent approach, proving that coordinating specialized agents, for planning, coding, and critique, grounded in RAG could effectively solve the "contextual blindness" of single-model systems.

However, as an academic prototype, it prioritized feasibility over production performance. Its limitations render it impractical for professional use:

Latency: Generation speeds are too slow for interactive, high-velocity creative workflows.

Reactive vs. Proactive: LL3M writes code "blind" guessing at visuals. DD3M solves this by generating a Visual Blueprint before coding begins.

Geometric Fidelity: Outputs often resemble collections of basic primitives rather than cohesive, organic assets.

Results for the prompt “a frog with a crown” generated by LL3M (left) and DD3M (right)

The comparison above illustrates the gap: LL3M produces a rudimentary result, while DD3M generates a complexer, stylistically distinct asset. While LL3M proved the architecture works, DD3M engineered it for industry by optimizing agent coordination, error loops, and visual planning.

2.3 The Tool-Calling Alternative: Blender MCP

The landscape of programmatic 3D generation also includes tool-calling frameworks like Blender MCP. Unlike DD3M’s approach of writing ground-up logic, Blender MCP operates by giving an LLM access to a set of predefined, validated "tools" or Python snippets. For example, instead of the AI struggling to write a complex shader from scratch, it can simply trigger a tool call for "search_polyhaven_assets". This high-level abstraction provides guardrails, ensuring that the AI works with sensible operations. The documentation for each tool call gives a rich semantic context to the model allowing it to reason on a higher level.

This stability comes with a trade-off in creative range and manual overhead. Because Blender MCP is strictly limited to the tool calls that currently exist in its library. While Blender’s Python API is fully exposed and offers a wide range of options, expanding the MCP's capabilities requires significant human effort to create and validate every potential action.

Importantly, DD3M offers a robust feedback loop powered by a Vision-Language Model (VLM) that critiques the work in progress all the time, allowing it to adapt and refine assets autonomously. This is an under-used capability in many LLM based applications. This capability is only used by Blender MCP at the end of the generation. Ultimately, the two systems can work together for the right use case. A combination of Blender MCP’s reliable tool-based operations and DD3M’s flexible, vision-corrected generation would create a powerful, complementary workflow for modern 3D pipelines.

Results for the prompt “a frog with a crown” generated by Blender MCP Claude Opus 4.5 (left) and DD3M (right)

3. DD3M in Practice

DD3M bridges natural language and downstream-ready assets by consistently producing clean, editable Python scripts, regardless of whether the input is text or an image. This reliability is built on a three-phase workflow:

Initial Creation: The system generates a structured plan and retrieves API docs and examples. It then writes a foundational script capturing the object's core geometry, layout, and materials.

Auto-Refinement: A self-correction loop executes the script and renders the output. A VLM critiques these renders against the prompt, triggering code updates until the asset meets fidelity standards.

User Refinement: Users can request adjustments (proportions, materials, style) via simple prompts. These trigger the same targeted correction loop, modifying specific code blocks without regenerating the asset from scratch.

3.1 Prompt-Based Generation

In text-only workflows, DD3M synthesizes an internal Visual Blueprint, a generated reference image capturing intended proportions, silhouette, and style. This acts as a stabilizing guide for subsequent geometric and material decisions.

Prompt-based generation using DD3M, showing the three different phases the framework goes through

This approach makes generation surprisingly stable. Rather than guessing or drifting with revisions, DD3M acts like a technical artist: sketching a concept first, then building a clean, programmatic asset that evolves iteratively without resetting.

3.2 Image-Based Generation

Users can provide direct visual references, photos, concept art, or sketches, to guide generation. In this workflow, the user-uploaded image replaces the synthesized blueprint, serving a dual role: it informs what to build and acts as the evaluation standard for the VLM critique phase.

Image-based generation using DD3M, showcasing the resembles of the 3D model and the reference picture

This is critical for production pipelines. Artists can input actual concept art to ensure consistency, receiving a programmatic implementation that closely matches the source. The result remains fully editable, allowing for further textual refinements to meet exact requirements.

3.3 Iterative Refinement

User refinement operates through the same mechanism as the automated VLM critique, simply swapping the feedback source. DD3M interprets user guidance to locate relevant construction steps, adjusting only those specific code blocks before re-executing and verifying the output.

Because this reuses the targeted-update pipeline, iterations remain stable and predictable, avoiding full regeneration. Crucially, manual adjustments made in Blender or the script are respected; DD3M treats them as the current state and refines around them, ensuring a continuous evolutionary loop without overwriting user work.

3.4 Limitations

While DD3M offers significant advantages in editability, the programmatic approach is not a universal solution for every 3D task. Recognizing where code excels, and where it struggles, is key to integrating it effectively.

Structured vs. Chaotic Form: The system thrives on objects with clear logical structures, such as architecture or machinery to stylized characters. However, it is less effective for highly irregular or chaotic forms, like entangled plants or undefined soft shapes. Describing these arbitrary, flowing curves via Python is often less efficient than traditional sculpting.

Inference Latency: Quality and stability come at the cost of speed. Due to the multi-agent feedback loop, generation is not real-time. While significantly faster than manual modeling, the process is slower than one-shot diffusion inference, prioritizing topological validity and editability over raw speed.

4. The Architecture Behind DD3M

5. Expanding upon DD3M

While DD3M excels at generating individual procedural assets, its architecture is designed for broader workflows. By treating the LLM as a logic engine rather than just a mesh generator, capabilities can expand from modeling single objects to orchestrating entire scenes and creating interactive tools. Think of DD3M as a higher-level reasoning engine that has access to a toolbox of other narrow AI or standard Blender tools.

5.1 Hybrid Workflows utilizing Existing or Diffusion-Based Assets

A purely programmatic approach is powerful, but production often relies on vast asset libraries. Future DD3M iterations could act as intelligent orchestrators, determining when to construct new geometry and when to retrieve, import, and place existing models, effectively serving as a semantic layout artist.

Example showcasing how various asset libraries can be used to construct scenes (source: SceneWeaver)

This expansion enables hybrid generation pipelines. Beyond static libraries, DD3M could delegate organic or sculpted assets to diffusion-based black box models. In this role, DD3M becomes the "glue" of the 3D pipeline, generating the logic that assembles, scales, and unifies diverse assets into a cohesive, editable scene.

5.2 Automated Tool Creation

Currently, editing relies on prompts or direct code interaction. The next step is automating User Interface creation and integration with chatbots just like Blender MCP. Future coding modules could identify critical variables, such as dimensions, density, or material attributes, and automatically expose them as native UI elements.

Instead of a one-off script, the system would expose a fully parameterized tool complete with custom panels and sliders. This transforms the AI into a developer of interactive tools, allowing non-technical users to manipulate the model in real-time and decoupling the asset from the generation process.

6. Closing Thoughts

Current generative 3D tools prioritize visual appeal over utility, often at the cost of production readiness. DD3M fundamentally shifts this approach, treating generation not as a one-shot inference but as an iterative, code-based process. By leveraging multi-agent architectures to write and refine scripts, it solves the critical consistency and editability issues plaguing existing black box solutions.

Building these advanced systems requires a deep integration of foundation models with complex technical pipelines. At Datameister, we specialize in bridging this gap. We do not just build models that dream; we build architectures that work. Book a technical discovery call with our engineers to see how we can modernize and automate your 3D stack, and stay ahead of the curve.

Three challenges in finetuning Trellis

Liam Wezenbeek — Fri, 09 Jan 2026 16:37:55 GMT

TL;DR
Finetuning Trellis can dramatically improve image-conditioned 3D mesh generation, but it is not exactly plug-and-play. Out-of-the-box settings quickly run into bottlenecks around preprocessing time, GPU memory and overfitting.

Drawing from experience gained across Datameister projects, we outline where finetuning efforts tend to succeed or fail in practice. We show how data coherence, preprocessing choices, memory-aware training, and careful regularization shape the outcome far more than aggressive hyperparameter tuning.

Result: With a focused and pragmatic approach, Trellis can be adapted into a robust, domain-specific 3D generator.

If you’ve tried finetuning a large 3D generative model like Trellis, you’ve likely discovered that it can be deceptively hard. At Datameister, we regularly tackle this challenge when adapting generative models to domain-specific use cases. If you’re new to Trellis or want a quick refresher on its pipeline, this blog by Jarne is a good starting point.

The out-of-the-box performance of Trellis is impressive. The base model was trained on a general dataset of 500,000 objects and generates reasonable meshes for a wide range of object categories. In some use cases, however, “reasonable” is not good enough and details matter. In such scenarios, you could try to finetune the Trellis model on a custom dataset built from object meshes of interest.

The Trellis authors provide both training code and a toolkit for data preprocessing to get you started. In practice, however, finetuning is less plug-and-play than it might appear and introduces several challenges. In this post, we share insights gathered across multiple Datameister projects, using the mechanical components benchmark (MCB) as a guiding dataset where we will focus on image-conditioned mesh generation. We will cover:

Which Trellis components are worth finetuning?

The three main bottlenecks: data, memory and overfitting

What changes with Trellis 2?

1. Which component of Trellis should you finetune?

The Trellis pipeline consists of eight models that operate in sequence. The first question to consider then is which one of these is worth finetuning.

There are 2 major categories to choose from:

VAEs
Used to encode a mesh, input image or feature map into latent space or decode them the other way around.

Diffusion flow models
Responsible for generating object structure, geometric detail and texturing in latent space.

Overview of the Trellis pipeline. Source: arXiv:2412.01506

In reality, we found that the VAEs already work really well for most use cases. Well-known models that started from the Trellis model, such as Hi3DGen, actually do not touch these and use them as-is.

The biggest gains are found in finetuning the diffusion flow models that generate object components. Trellis has two such flow models to be used in subsequent stages.

Sparse structure flow: Generates a binary voxel grid from noise based on the input image conditioning.

Structured latent flow: Generates a feature map for all voxels that contains geometric detail and texture information.

Which of these two you should finetune depends on your needs. Let us look at the case of the “screws and bolts” subset of the MCB dataset. Every so often, we would generate objects that had either:

nonsensical global geometry
incomplete generations at tricky camera angles
strange appendages

We found that finetuning the structured latent flow really focused on the details of the geometry and less on these general problems. Finetuning the sparse structure flow dramatically increased robustness and consistency, even at tricky angles.

Input images used as condition for TRELLIS mesh generation.

Generated meshes with the base TRELLIS model.

Generated meshes with a finetuned sparse structure flow model.

2. Challenges of finetuning Trellis’ sparse structure flow

In our experience, finetuning the sparse structure flow consistently ran into three bottlenecks:

Data availability and consistency

GPU memory constraints

Rapid overfitting

In the next three sections we will go over each of these, starting with the input data.

2.1 Data: quantity, quality and preprocessing cost

As with most finetuning workflows, training data is a critical component in a successful finetuning. There are a few aspects to keep in mind here:

How much data is enough?

Part of the answer to that question is how varied your input meshes are. For example:

A subset of ~2000 motors and rotors in MCB proved difficult to finetune due to high diversity in general shape.

A subset of ~1000 screws with similar global geometry converged much more reliably.

As a rule of thumb:

High variation across geometric shapes hurts convergence more than limited data volume.

A few thousand relatively homogeneous objects are often sufficient.

For niche applications, smaller datasets can still work if the geometry is consistent.

Data quality matters

Trellis training was performed on data that passes a filter on an “aesthetic score”, meaning the object is clean and well-modeled. The cleaner your input data will be, the easier of a time you will have. Avoid open meshes and noisy appendages.

Processing: the hidden cost

One of the largest hurdles if you have a big, high-quality dataset comes from the preprocessing steps. Trellis provides a preprocessing toolkit that

downloads the input meshes

renders ~150 views per object to create feature maps

voxelizes and normalizes the meshes

encodes meshes to the relevant latent spaces

creates renders for image conditioning

bundles all information into a metadata file used by the training dataloaders

This all works very fluently out-of-the-box, but depending on the resources you have available, this can take a long time.

For a set of about 1000 objects, the whole pipeline took a little over 24 hours to run on a single RTX4090. The largest bottleneck here was the rendering needed to create the feature maps and conditioning, taking up 80% of that time. In the traditional toolkit, this rendering step is also responsible for normalizing meshes used as ground truth during the training process, making it a required part of the pipeline.

Storage is a second constraint here. This same subset of 1000 objects resulted in ~60GB of intermediate files. Scaling this up to larger datasets of say 16000 objects would require close to 1TB of disk space. About 85% of these files are feature renders and feature maps.

If your focus is on sparse structure for geometry, you can bypass these issues with a few minor changes to the toolkit so there is no need to run the initial rendering. This means you can get your dataset ready in a couple of hours and need much less disk space.

With the data pipeline reviewed, the next bottleneck becomes GPU memory.

2.2 Memory

A second challenge that immediately becomes apparent when launching a finetuning job is GPU memory. For a batch size of 1, GPU memory usage immediately surged to ~20GB with the default settings. This makes running a larger batch size for more stable training unfeasible.

Several practical adjustments can be made to overcome this bottleneck:

**Mixed precision training:**Switching from fixed to mixed precision freed up substantial memory, allowing batch sizes of up to 8.

Gradient accumulution:
Trellis effectively supports gradient accumulation through its batch splitting settings. During training, batches are split into smaller micro-batches that are processed sequentially, with gradients aggregated before a single optimizer step. This allows for larger effective batch sizes without loading all samples into GPU memory at once.

Once memory constraints are addressed, overfitting becomes the dominant failure mode.

2.3 The overfitting problem

Early experiments consistently showed steadily decreasing training loss, while validation loss began increasing after roughly 3,000 iterations. Visual inspection of generated samples revealed that while the model had learned the general structure of the input meshes, it emphasized a small set of features from the most dominant substructure in the data and applied them across all test samples.

A key lesson was that decreasing training loss alone is a poor indicator of generalization quality. The default Trellis training setup lacks validation loops, and once these were added, the divergence between training and validation behavior became obvious. Although small training sets played a role, further adjustments were needed to achieve stable and robust finetuning.

Freezing layers

The image-conditioned Trellis model is quite large with its 550 million trainable parameters. It would be fair to assume that overfitting could be mitigated by only finetuning a fraction of those parameters. You could for example freeze all blocks of the flow model and only leave the final block and the output layer unfrozen. This leaves you with only 8% of the original parameters.

In practice, this approach did not work and the model did not learn anything. Both training and validation loss remained stagnant over the course of a few hundred thousand iterations. It seems like the last few layers on their own can not cover for new patterns.

A more flexible alternative is to use low-rank adaptation (LoRA), which enables parameter-efficient finetuning across the full model. While LoRA is well established in language and diffusion models, its use in 3D generative models remains limited and is not supported out-of-the-box in Trellis. As strong results were already obtained through data curation and hyperparameter tuning, we did not explore a LoRA here, but we are excited to give it a try.

3. How about Trellis 2?

Microsoft recently released the second version of their Trellis model, appropriately named Trellis 2. There are a few big changes, which we discussed in a previous blog post, but what does it mean for finetuning?

At first sight, it does not seem to impact many of the aspects we have discussed above. The first part of the pipeline - where sparse structure is created - seems largely unaffected. However, if you are more interested in finetuning the detailed geometry and texture generation models, you will find some updates that will have impact on your finetuning experience.

The first big change will be in the data preprocessing. As Trellis 2 uses a native representation of the geometry detail and appearance features, there is no more need for the bulk of the rendering. In its place, you will have to preprocess the o-voxel translation, which is said in the paper to take only a few seconds on CPU. This is a big speedup compared to the rendering technique.

Secondly, it should be able to deal with open and non-manifold meshes much better. This means that non-standard meshes in your dataset should have a less negative impact on your finetuning process.

Finally, Trellis 2 introduces three flow models instead of two, splitting the second stage into separate geometric detail and texture flows, which could enable more targeted and parameter-efficient finetuning.

As the full training code and dataset toolkit has not been released yet at the time of writing, we have not had a chance to play around with finetuning Trellis 2. However, we are excited to do so as soon as possible!

Closing Remarks

Finetuning Trellis is less about finding the right configuration and more about understanding where the model is brittle. Meaningful improvements came from choosing the right component to finetune, curating consistent training data, and closely monitoring generalization rather than relying on training loss alone.

At Datameister, we continue to explore and refine finetuning strategies for Trellis and other 3D generative models, with a focus on robustness, consistency, and real-world deployment. If you have an application that requires accurate and dependable 3D generation, do not hesitate to reach out!

Trellis 2: Scaling 3D Generation with Improved Efficiency and Control

Larsen D'hiet — Mon, 22 Dec 2025 21:55:52 GMT

In a year marked by rapid advances in 3D generative modeling, Trellis 2 makes for one of the most exciting architectural updates this year. It introduces Omni-Voxels, a native 3D representation that encodes geometry and PBR materials directly in aligned 3D space. Combined with the new Sparse Compression VAE, this enables more efficient compression of very high-resolution assets at improved inference speeds.

Because the sparse, surface-based structure is preserved, Trellis 2 maintains strong support for masked generation and localized edits, now operating on a more compact and better-aligned latent representation. Geometry and materials are handled separately, making edits more predictable and ensuring material consistency under topology changes.

Result: a scalable 3D generation pipeline with higher fidelity, improved computational efficiency, and significantly better editability and control for real-world design workflows.

2025 was the year where 3D generation finally took off and matured from research demos to early adoption in several industries. 3D generative AI enables fast ideation in design workflows, allowing to explore bold ideas faster and explicitly in 3D. Datameister has created precise control and editing capabilities such as masked generation for 3D generative models. These capabilities already serve several design studios in automotive, fashion and medical industries.

One of the biggest open-source releases in this field was definitely Trellis by Microsoft, released in December 2024. It was one of the first models that was worth the name “foundation model” for 3D asset. Trellis incorporates knowledge on 3D geometry and texture directly into its architecture. For a full introduction on Trellis and 3D generative modeling in general, you can read this blog post by our colleague Jarne.

Trellis 2 sets the new standard for 3D generative modeling

Introducing Trellis 2

Last week, the Trellis authors released follow-up work with the unsurprising name Trellis 2. The release further scales 3D generation and improves the underlying latent representations. It is a 4B parameter model, doubling the size of the previous release, representing a significant step forward for 3D generative modeling for the following reasons:

Direct integration of PBR material representations in the pipeline

Flexibletrade-off between the granularity of the 3D voxel grid and the generation speed

Ability to upsample geometries and according materials

Improved overall quality in both reconstruction and visual fidelity

Trellis 2 is able to handle structures of up to 1536 x 1536 x 1536 voxels with generation within ~60 seconds on a single H100 GPU. The biggest architectural changes leading to these results can be found in the way that 3D assets are encoded in Structured Latents (SLat’s):

a novel sparse 3D voxel structure called Omni-Voxel representation (O-Voxels) encodes precise geometry and complex textures simultaneously

a new compression architecture Sparse Compression VAE replaces the original flow transformer VAE. Itconverts the O-Voxels into SLat’s with impressive downsampling efficiency

The most important innovations in the Trellis 2 encoder (source: from the Trellis 2 project page)

The sparse voxel structure of Trellis is retained, ensuring editing and masked generation are still possible. This is one of the most important reasons we started using Trellis in the first place. As we will point out below, O-Voxels actually make Trellis 2 even more suited for these type of tasks. Let’s go through the two main innovations in detail, and see how they affect the use of Trellis 2 in practice!

Omni-voxels: the building blocks of Trellis 2

In the original Trellis architecture, 3D feature maps of the asset were obtained by taking renders of the original mesh. From these renders, DinoV2 features were extracted and projected onto a voxelized structure. This came with several disadvantages: details and sharp edges were easily lost, open parts of the mesh were hard to handle, and lighting effects were baked into the textures. Moreover, feature extraction from DinoV2 slowed down the pipeline.

O-Voxels replace the voxelized structure obtained through projection of 2D DinoV2 features in the original Trellis architecture. This allows for near-instant conversion between 3D assets and voxelized structure.

The Trellis 2 authors introduce the Omni-Voxel representation instead**,** a native type of mapping that can be instantly derived from the original asset. An O-Voxel is essentially a collection of parameter tuples (fshape(i), fmat(i), p):

fshape(i) contains geometric parameters for creating a Flexible Dual Grid, a representation that takes into account edge intersection information of the mesh with the voxel grid

fmat(i) holds material properties, including the base color, metallic ratio, roughness, and opacity following standard physically-based rendering (PBR) conventions. These properties are pooled per voxel in the dual grid

p simply denotes the coordinate of the i-th voxel

As in the original Trellis architecture, features only exist on the surface of the mesh, creating a sparse voxel structure. Geometry and appearance now live in the same latent space and will be spatially aligned during generation.

Bi-directional translation

O-Voxels are essentially voxelized information about the mesh grounded in physical reality. This makes the process of O-Voxelization optimization- and rendering-free. Creating a voxelized structure from the mesh of a 3D asset therefore only takes a few seconds on CPU. Moreover, because of their algorithmic nature, O-Voxels can be easily converted back to a mesh.

The bidirectional conversion between O-Voxels and meshes implies that the output of Trellis 2 can only be a mesh. The original Trellis architecture also had output decoders for Gaussian Splats and Neural Radiance Fields. Users of Trellis 2 interested in these 3D representations will have to rely on other methods to convert the resulting meshes into Gaussian Splats or RF’s.

Sparse Compression VAE

The second important change in Trellis 2 comprises the Sparse Compression VAE (SC-VAE). This VAE is no longer flow-transformer based, but is a fully convolutional network. It is specifically designed to achieve high-ratio voxel size downsampling. This results in a compact latent space, even for high-resolution voxel structures (up to 1536 x 1536 x 1536), while remaining computationally efficient.

The dedicated design of SC-VAE focuses on efficiency for high-resolution feature maps

The dedicated design of the SC-VAE allows for an impressive 16x downsampling ratio. A fully textured 1024 x 1024 x 1024 asset can then be encoded in only around 9.6k latent sparse surface voxels on average. The authors furthermore programmed Triton kernels to create a custom high-performance backend for the SC-VAE called FlexGMM. This backend is compatible with both NVIDIA and AMD (in theory) to further speed up inference and training.

Generative modeling

O-Voxelization and the SC-VAE are essential in the first training stage of Trellis 2. In this stage, the SC-VAE learns effective Structured Latent representations (SLat’s) from the training data. Generative modeling is the second step, turning a text or image prompt into SLat’s and aligning them with the previously learned representations. They are subsequently decoded by the SC-VAE to an O-Voxel grid, which can then be converted back to the final output mesh.

The generation process in Trellis 2 consists of three steps instead of two in the original Trellis architecture:

Sparse structure generation: predicting the sparse voxel grid i.e., which voxels are active. This first step has remained identical.

Geometry generation: geometry latents are predicted independently from material latents in this second step. For this, a first SC-VAE is trained to only model shape latents. The decoded result of this is the fshape parameters of the final O-Voxels.

Material generation: a novel material generation stage models PBR materials directly in the native 3D space, jointly conditioned on the input image and predicted geometry latents. A second SC-VAE is trained for this, similarly conditioned on the first SC-VAE shape latents. The decoded result of this is the fmat parameters of the final O-Voxels.

End-to-end generation pipeline of Trellis 2

Trellis 2 thus splits the SLat generation step in two parts, making use of the fact that an O-Voxel is described by a geometrical feature and a material feature independently from one another. This ensures materials remain consistent under arbitrary topology.

Conclusion

Trellis 2 sets a new benchmark for foundational 3D models by combining native 3D representations with efficient compression and more robust generative stages. The shift to Omni-Voxels and geometry–material decoupling results in higher fidelity assets, faster processing, and better alignment with real-world 3D workflows, especially where control and physical correctness matter most.

We have already started integrating Trellis 2 in our pipelines, and the results look promising. If you are looking to bring state-of-the-art 3D generative models into reliable, production-grade workflows, Datameister is ready to help you turn cutting-edge research into real-world impact.

Why DETRs are replacing YOLOs for real-time object detection

Larsen D'hiet — Fri, 21 Nov 2025 10:26:11 GMT

TL;DR Detection Transformers (DETRs) have matured into real-time–capable detectors that now outperform YOLO models in accuracy at comparable inference speeds. Key innovations like deformable attention, denoising training, and top-k query selection paved the way for real-time DETRs, with D-Fine as the most notable recent advancement.

All DETR architectures are released under the permissive Apache 2.0 License, making them easy to adopt and modify in commercial settings. This removes the licensing hurdles common in other detector families and enables fully proprietary deployments without friction.

Result: At Datameister, we’ve integrated these state-of-the-art DETR models into our vision library to deliver accurate, efficient, production-ready detection systems for specialized real-world use cases.

Real-time object detection lies at the heart of any system that must interpret visual data efficiently, from video analytics pipelines to autonomous robotics. Detector architectures for such tasks need to deliver both high throughput and accuracy in order to excel.

In our own pipelines, we phased out older CNN-based detectors in favor of D-Fine, a more recent model that is part of the DEtection Transformer (DETR) family. Transformer-based detectors have matured quickly, and D-Fine in particular provides stronger accuracy while maintaining competitive inference speed.

Our office dog Nala sitting on a chair, as detected our own D-Fine model in the DM vision library.

YOLO has long been the leading standard for real-time detection, but the latest DETR variants are now consistently proving to be the better alternative. Beyond the accuracy gains, an equally important advantage is the far more permissive license that comes with it.

YOLO’s licensing issue

The YOLO series is developed and maintained by Ultralytics. All YOLO code and weights are released under the AGPL-3.0 license. Long story short, this license only allows commercial usage under the strict condition that any code modifications or weights should be made publicly available. On the contrary, all DETR models to date were released under the Apache 2.0 License, allowing for free use and modifications for commercial and proprietary use.

Next to licensing, there are others reasons why we like working with DETRs:

DETRs treat object detection as a direct set-prediction problem. This eliminates hand-crafted components such as non-maximum suppression that introduce additional hyperparameters and slow down the detection pipeline.

Modern GPU architectures are heavily optimized for efficient attention operations such as flash attention, making transformers increasingly more suitable for real-time applications.

Transfer learning****from vision foundation models such as the recent DINOv3 fundamentally augments the capabilities of DETRs.

We have had nothing but great experiences with DETRs so far. They adapt remarkably well to new datasets, even when trained from scratch. For the right use cases, pre-training the models on datasets such as COCO and Objects365 further boosts performance. About time for a post on this exciting topic!

About the author

I am Larsen, AI Engineer at Datameister and part of the journey since the early days of the company. I like finding the intersection between state-of-the-art research and practical system design, where real-life systems can be powered by frontier AI algorithms. For this blogpost, I dove deep into the details of DETR and its successors and distilled the most important insights in a short read. Go grab a coffee and enjoy this one!

A short overview of what you can expect of the remainder of this blogpost. We will:

dive in detail into the original DETR paper to understand its core concepts;
discuss the most important advancements leading to the real-time adoption of DETRs;
compare two leading DETR models to the latest YOLO 11 model to draw some important conclusions.
Let’s go!

DETR: transformer for NMS-free object detection

All Detection Transformer architectures have the same underlying structure. A (CNN-) backbone is used to extract image features. These features are fed to a transformer encoder-decoder structure that is able to predict accurate bounding boxes for object in the image. The resulting N decoder output embeddings are independently projected to bounding box coordinates and class labels.

High-level overview of the DETR architecture, adapted from the original paper (2020)

Why transformers?

Intuitively, the encoder in DETR transforms the dense backbone features into a semantically structured representation of the image that captures relationships between regions through global self-attention.

The transformer decoder takes a fixed set of N learned object queries, each representing a potential object slot. It then iteratively refines these to produce final bounding boxes and class predictions. It does this through two attention operations:

Self-attention among the queries, enabling them to model interactions and avoid duplicate detections (e.g., two queries focusing on the same object).

Cross-attention between the queries and the encoder’s output features, allowing each query to attend to the most relevant parts of the image and extract the corresponding visual evidence.

Attention layer in the DETR decoder. The output embedding of the cross-attention module serves as the content query for the next layer. The output features of the encoder are the key and value for cross-attention. The positional query is learnable and shared over self-attention and cross-attention in all layers

Through the clever use of attention in the decoder, DETR replaces traditional components like anchor boxes and non-maximum suppression with a fully end-to-end transformer-based detection process.

Direct set prediction

DETR reframes object detection as a direct set-prediction problem. Given an image, it predicts a fixed set of Nbounding boxes corresponding to the object queries. Because N typically exceeds the number of actual objects, many predictions correspond to a special “no-object” class and are discarded at inference. During training, the Hungarian algorithm performs bipartite matching between predicted and ground-truth boxes, ensuring each ground-truth box is paired with exactly one prediction in a permutation-invariant way. The loss is then computed on these matched pairs.

Overcoming DETRs shortcomings

Despite its elegance and powerful prediction paradigm, slow training converge and low performance on small objects limited adoption in practical systems early on. Over the years, several enhancements drastically improved the performance of Detection Transformers:

Deformable DETR introduced deformable attention, an efficient multi-scale attention mechanism tailored to the task of object detection.

The authors of Efficient DETR were the first to use top-k query selection for better initialization of the object queries for the decoder.

DN-DETR drastically improved training convergence using an auxiliary denoising task of training bounding boxes.

DETR evolution throughout time. Real-time variants arose from 2024 in two families: the RT-DETR family indicated in blue, and the LW-DETR indicated in purple.

Real-time transformer object detection

From 2024 onwards DETRs really started to challenge YOLO in real-time detection, eventually surpassing them in accuracy while remaining competitive in speed and efficiency. There are two schools of thought that compete for the state-of-the art nowadays:

RT-DETR (real-time DETR) sticks to the original DETR architecture and focuses on optimizing the encoder and the initialization of the object queries. D-Fine currently leads this family with a heavily optimized training strategy centered on the decoder. Very recently, DEIMv2 extends it further by integrating DINOv3 features in its backbone.

LW-DETR (light-weight DETR) adopts a simpler idea: replace the traditional CNN backbone and encoder with a pure Vision Transformer (ViT). RF-DETR (Roboflow DETR) leverages this especially well by starting from a pretrained DINOv2 encoder.

Work on Detection Transformers is very much alive: DEIMv2 was released less than two months ago, while Roboflow put their paper on RF-DETR on Arxiv just last week!

Object detection performance

How do these advancements reflect on performance benchmarks? The figure here underneath summarizes the performance of YOLO11, D-Fine, and RF-DETR for relevant model sizes on the well-known COCO dataset.

Performance comparison between leading model architectures for their corresponding nano (N), small (S), medium (M), and large (L) variants. Indicative latency measures for each model size indicated between brackets.
*Not pretrained on Objects365 dataset **RF-DETR L is not released yet

Some important take-aways from these numbers:

Both D-Fine and RF-DETR clearly outperform YOLO 11 for all sizes.

RF-DETR’s smaller models stand out, with the nano variant outperforming the others by a wide margin. This is likely because RF-DETR-N already benefits from a strong DINOv2 backbone.

D-Fine’s performance scales the best with model size, with the large variant scoring a whopping 57.4 mAP.

Parameter count

So, RF-DETR for small, very fast models and D-Fine when things get more complex? There is another side to the story. To finish of this post, I’d like to highlight an important difference between D-Fine and RF-DETR. For that, let’s take a look at the following figure:

Model sizes in million parameters for YOLO11, D-Fine and RF-DETR for their corresponding nano (N), small (S), medium (M) and large (L) variants. YOLO11 shows the best downward trend for larger model sizes with D-Fine close.

One of the first things to stand out is that D-Fine and YOLO11 become significantly lighter as their model sizes shrink, while RF-DETR’s parameter count declines by only around 5 million. This somewhat surprising observation results from the fact that RF-DETR was trained with a technique called Neural Architecture Search (NAS). NAS automatically finds network architectures that are Pareto optimal for the accuracy-latency trade-off.

Interestingly, the “small” RF-DETR architectures found by NAS end up only slightly lighter than the “large” variants. RF-DETR model sizes thus reflect speed rather than parameter count. D-Fine‘s model sizes on the contrary are on par with YOLO 11, making them the more versatile DETR architecture that can be adapted in a wide range of scenarios, including resource-constrained edge environments.

Conclusion

Real-time Detection Transformers represent one of the most significant recent shifts in computer vision. Their rapid evolution shows how transformers have become not only viable but actually preferred in scenarios that demand both high speed and high accuracy, even in resource-constrained scenarios. Just as important, their Apache 2.0 License makes them easy to use, enabling practical adoption beyond academic benchmarks.

D-Fine and RF-DETR have set the new standard for real-time object detection moving forward. D-Fine shows the best scaling in both speed, accuracy, and model size. The small RF-DETR variants are remarkably accurate and fast for their size, but the bigger models fall short of D-Fine when evaluated on the well-known COCO dataset. However, the field keeps on changing rapidly, so we’ll keep on tracking progress on both to make the best possible choices for every problem.

If you’re working on demanding detection problems where accuracy, robustness, and efficiency matter, we can help. We tailor DETR-based models to your specific application, integrate them in video processing pipelines on our DM platform, and set up continuous improvement loops to ensure performance keeps rising as new data comes in. Reach out; we’d be excited to turn cutting-edge Detection Transformer research into real, production-grade impact for your system.

Automated Retopology for 3D Assets

Thibaud Despriet — Thu, 23 Oct 2025 08:11:52 GMT

TL;DR

Clean, anatomy-aware retopology remains a critical bottleneck that current manual, semi-automatic, and fully automatic tools only partly solve. Retopomeister detects anatomical anchors, selects curated source topologies, and uses neural wrapping to transfer them onto target meshes - while keeping artists in control.

By transferring production-ready metadata like UVs, vertex groups, and seams, it streamlines the entire pipeline. The prototype hints at what’s next: richer keypoints, interactive loop sketching, and an iterative, artist-in-the-loop workflow.

Result: anatomy-aware, production-ready topology with intact UVs and seams - faster iteration, smoother deformation, less cleanup.

Retopology, the tedious craft of untangling chaotic 3D meshes into clean, production-ready models, is one of the least glamorous time sinks in animation and gaming, and exactly where AI can shine. In this blog post we will outline the complexity of the problem and introduce our prototype for AI-based retopology.

Dense, high-poly sculpt on the left with the retopologized clean, quad-only mesh on the right
(source: from the author of this Reddit comment)

1. The Retopology Bottleneck

Every 3D character artist knows that sculpting is only half the journey. Retopology is still needed, as the dense, high-poly sculpt stillneeds to be converted into a primarily quad, clean mesh. This step is essential for performance, animation, texture mapping and overall compatibility. Modern engines can handle high polygon counts, but clean retopology is still essential for characters that need to deform properly, be rigged effectively, or edited reliably.

Horse in motion with well-structured quad retopology (source: Sketchfab Horse Walk)

Despite its importance, manual retopology continues to be a significant pain point for artists, often requiring hours of tedious, repetitive work

1.1. The complexity of clean topology

Retopology is slow because it’s not just about placing edges and faces, it’s about carefully designing loops that follow the anatomy of the character and making sure these loops lie next to each other in a tidy manner. Artists manually paint edges and vertices so that loops flow naturally around eyes, mouths, shoulders, and joints, ensuring that deformation during animation looks smooth. Every vertex must connect correctly to form continuous edge loops, and even small mistakes can create pinching or stretching when the character moves.

Human head mesh with colored facial edge loops Human body mesh with colored joint and limb loops
(source: CMU animation) (source: Pinterest)

Semi-automatic tools can fill patches of quads or suggest loops, but due to the wide variety of body shapes and facial structures in animation and gaming, there’s no one-size-fits-all tool for retopology. Artists constantly have to adapt, tweak, and reconnect loops to achieve a clean, animation-ready mesh, which is why the process often takes hours for a single character.

1.2. The cost of early mistakes

Retopology becomes especially tedious because the loops and connections you create early on dictate the remainder of the mesh. If an edge loop is misplaced or a vertex is misconnected, it can throw off the flow of nearby loops, causing pinching, stretching, or irregular quads that are difficult to fix later. Correcting these mistakes often requires redrawing entire sections of the mesh or even starting over from scratch. The repetitive nature of carefully connecting vertices, edges, and loops, combined with the high stakes of early errors, turns what should be a precise and structured process into a painstakingly slow and frustrating task.

1.3. Why new 3D generation tools don’t help

Recently, a number of tools such as Meshy, Cube, and Alpha3D have emerged that generate meshes directly from prompts or sketches. While impressive, these one-step pipelines tend to produce dense and unstructured topology that artists can’t easily refine. More importantly, they strip away creative control: instead of guiding the design, artists are left to accept or reject a black-box result. For professional character work, that’s rarely acceptable. Datameister focuses on powerful generative tools that give control to artists to leverage their skill and experience.

Some of the meshes generated using Meshy Topology of a generated mesh using Mesh
(source: Meshy AI) (source: robot model Meshy AI)

Our approach flips this around. Instead of automating the parts where artists add value - their vision, design choices, and sculpting - we target the parts they least want to do: the repetitive, structural, and technical labor of converting a sculpt into clean, animation-ready topology. By automating the boring, low-value steps, we leave the artist’s creative freedom untouched while still delivering a major productivity boost.

In the remainder of this post, I’ll first walk through the tools artists currently use for retopology and highlight where they fall short. Then I’ll introduce Retopomeister, our approach to automating character retopology with AI. Finally, I’ll share where we think this technology can go next and how it could transform the character creation pipeline.

About the author: I’m Thibaud, an AI Engineer at Datameister passionate about machine learning, AI, and computer vision. What drives me is the transformative power of these technologies to solve real-world problems in novel ways. When I discovered Datameister at a career event, I was immediately drawn to its cutting-edge AI projects and collaborative, driven culture. The combination of technical innovation and team energy made it clear this was the place to grow and contribute meaningfully.

2. What’s already out there

First, I’d like to take a moment to reflect on the current state of retopology tools. The options available to artists today can be grouped into three main categories of assistance: manual retopo, semi-automatic retopo, automatic retopo. Each of these categories brings its own strengths and weaknesses to the table, offering different balances between control, speed, and usability. Understanding where these approaches excel - and where they fall short - provides important context for why we see room for something new.

Manual retopology is still often considered the benchmark for production quality since it gives artists a high degree of control. Turning a dense sculpt or scan into a clean, low-poly mesh isn’t just about reducing triangle count, it’s about placing quads so loops follow anatomy, so joints bend without pinching, and so details can be added or removed locally using loop cuts without wrecking the rest of the mesh. That’s why artists still sketch, stitch, and hand-place loops around eyes, mouths, shoulders and wrists: those concentric and radial loops act as deformation buffers and make rigging predictable. While manual work is slower for production-quality characters it’s often the only way to guarantee animation-friendly results.

Manual retopology of a finger from a human mesh (source: YouTube video)

Most artists, however, don’t rely solely on manual retopology. They use tools and addons for semi-automatic retopology that shorten the manual workflow like Blender’s Bsurface or Maya’s Quad Draw. These mostly provide quicker sketch based retopology, allowing for sketching contours, strokes or individual polygons directly and filling it in with patches of quads in real time. These tools speed up the artist but still rely heavily on human guidance by sketching.

Blender’s Bsurface using sketches for retopology (source: YouTube video)

Lastly we have the ideal case of automatic retopology tools like ZBrush’s ZRemesher, Blender’s Exoside Quad Remesher, and Instant Meshes. These are mostly not yet the fully automatic one-click done systems, they often still require some sketches or parameter-changing before they work optimally. While they can generate a decent starting point in minutes, their results vary: one model might come out near-perfect, while another needs heavy cleanup. Because of this lacking reliability, it’s often skipped entirely by artists that would rather make the retopo manually with guaranteed result.

Blender’s Bsurface using sketches for retopology (source: YouTube video)

In short, manual retopology is slow but highly reliable, giving artists precise control over loop placement and mesh flow. Semi-automatic tools speed up parts of the process, yet still demand careful input and adjustments. Fully automatic methods are fast, but often produce edge flows that don’t follow the anatomy, leaving messy areas that need cleanup.

3. Retopomeister - Automating character retopology

Other tools we’ve discussed fall short in areas where Retopomeister can step in. For automatic retopology, the biggest weakness lies in loop placement. These methods may reduce polygon counts and generate meshes that look superficially correct, but their edge loops often fail to follow the underlying anatomy. Another issue is the wrong termination of loops, which is tedious to fix. This misalignment becomes especially problematic for animation, where clean deformation depends on loops flowing around joints and facial features.

That’s why the idea of reusing an existing source topology is so compelling. If a source mesh already has anatomically sound loops, why not adapt that proven structure to a new sculpt of the same type? By overlaying a clean, pre-designed topology onto the target geometry, we can preserve the advantages of manual artistry while automating the repetitive transfer work.

Retopomeister works through two main mechanisms. The first is AI-driven keypoint detection, which identifies important anchor points on the target geometry - like the hands, feet, chest, and elbows. The model can detect these points reliably across different meshes and poses.

The second mechanism is mesh wrapping, where the source topology is deformed to fit the target geometry as closely as possible following the keypoints. This even allows T-pose topologies to be fitted on A-pose geometries and vice versa. Earlier tools, like Blender’s Softwrap plugin, required manual adjustments, moving the mesh over the target like a digital “skin.” Retopomeister automates this process using a neural network, fitting the source mesh over the target efficiently.

Together, these systems create an automated pipeline that lets you retopologize a new mesh using an existing retopology.

High-level overview of the key components of Retopomeister, converting an input triangle geometry to a clean, quad mesh

3.1. Keypoint detection

The keypoint detection model was trained on a large dataset of humanoid characters, where it learned to reliably identify 12 anatomical anchor points across different meshes and poses. Because the training process is unsupervised, the approach is not limited to humanoids - given the right dataset, the same method could be adapted to creatures, props, or any other mesh type, without the need for manual labeling. For humanoids, the detected keypoints already provide a strong foundation for retopology: the AI consistently finds landmarks like hands, feet, elbows, and chest, which are exactly the areas artists use to guide loop placement.

Importantly, these keypoints are not locked in. After generation, artists can still add, remove, or reposition existing anchors, ensuring they remain in full control of the process. This balance between AI-driven automation and artist-driven fine-tuning reflects feedback we received directly from professionals. As Thijs from studio TOVENAAR put it: “You want a tool that helps as much as possible, but still gives you full freedom to easily make adjustments and doesn’t constrain you.” Retopomeister’s keypoint detection was built with that philosophy in mind.

3.2. Mesh wrapping

Mesh wrapping is where the system brings everything together. It takes the source topology, the target geometry, and the set of anchor points, then runs an optimization procedure to deform the source mesh so it matches the target as closely as possible. Earlier solutions, like Blender’s SoftWrap, already offered this kind of fitting but required extensive manual tweaking.

Footage of using Blender’s Softwrap to manually wrap a retopology to a new head mesh (source: Blender Softwrap)

Retopomeister automates this step through a neural model, aligning the source topology to the target in a way that respects both the geometry and the detected anchors. Two variants are available. The asymmetric mode works well for characters or objects that are naturally non-symmetric. But based on feedback from artists themselves, we found that in practice, most models are designed symmetrically first, and asymmetry is added later. To support this workflow, we implemented a symmetric mode that enforces mirrored deformation. This uses a combination of a soft symmetry loss and explicit averaging of mirror-point pairs, guaranteeing perfect bilateral symmetry while still fitting the overall form. The result is a workflow that adapts to both artistic styles: symmetric by default, asymmetric when needed.

Thanks to GPU acceleration, the wrapping process runs significantly faster, with fitting typically completing in under two minutes. This short turnaround lets artists adjust anchors or tweak settings and quickly see the results, transforming what was once a slow, CPU-bound process into an efficient, iterative workflow. The added responsiveness makes Retopomeister not only automated, but also fluid and well-suited for everyday production.

3.3. Source topologies with AI search

The source topology can be chosen from a database of predefined topologies. Since the starting topology plays a crucial role in the entire process, beginning with a poor or only moderately suitable one will inevitably limit the final result. That’s why we aim to select the source topology that best matches the target model. The larger and more diverse the source topology dataset, the more powerful Retopomeister becomes. However, finding the most suitable model can become increasingly challenging as the database grows. To address this, we use an AI-driven search algorithm to quickly identify the most relevant source topology for the target geometry.

The source topology and target geometry searching process before anchors and wrapping

An important practical upside is that many curated source topologies already carry production-ready metadata - UV unwraps, vertex groups and material seams. When a good source topology is fitted to a target, those asset-level mappings can often be transformed along with the mesh, which can meaningfully reduce downstream work.

3.4. Evaluating the retopology results

We developed Retopomeister in collaboration with artists, gathering feedback from teams like studio TOVENAAR and others. One recurring theme was clear: they didn’t want a tool that dictates creative choices, but one that quietly takes away the repetitive, low-value labor. Retopomeister was built with that philosophy.

To measure how well the system actually performs, we also built an in-house evaluation suite called RetopoCheck. This tool compares meshes across multiple dimensions: statistics, renders, wireframes, zebra stripes, pixel-wise differences, and geometric metrics like Hausdorff distance and curvature. With this, it becomes easy to spot errors, highlight areas needing improvement, and iteratively refine the system. It’s the same tool we used to tune Retopomeister itself, providing constant feedback loops for quality.

The evaluation generated by Retopocheck to compare retopology quality to the original mesh

The results so far have been promising. With strong anchor points and a good source topology, Retopomeister consistently produces meshes with clean edge loops and animation-friendly flow. Compared to existing tools, the added step of using transferring from a curated source topology and AI-generated anchors gives it a clear edge. The extra inputs don’t limit the method - they actually empower it. We can provide a library of source topologies, while the anchors already come pre-generated and will continue to require less and less artist correction over time.

Demo of retopomeister generating anchors and fitting a source topology to a target geometry (in symmetric mode)

3.5. Transferability

Since the keypoint detector is learned without labels, adapting Retopomeister to other mesh categories is straightforward: you only need a modest dataset of meshes from the new category. Retraining the detector produces anchors that reflect the functional parts and geometry of creatures, props, or any other subject, and the rest of the pipeline - source-topology selection and neural wrapping - can be applied unchanged. We validated this by retraining the keypoint model on meshes of cats and observed the same flexible, anchor-driven fitting behavior. In short, extending Retopomeister beyond humanoids requires little more than the right example meshes.

Demo of retopomeister also working on cat meshes by generating anchors and fitting (in asymmetric mode)

3.6. Future work

There are several clear directions for pushing Retopomeister further. One exciting direction is increasing the number and precision of keypoints. While the system currently detects 12 major anchors, we envision expanding this to finer landmarks - like fingers, facial features, or subtle details such as the wings of the nose. More precise anchors mean the retopology can follow complex structures more closely, giving artists even greater control over deformation and loop placement.

We’re also looking at ways to make the wrapping process even faster. Today it completes in under two minutes, but with further optimizations, we could aim for near-instantaneous feedback. Artists could adjust an anchor or tweak a setting and immediately see the results, making Retopomeister feel like a truly interactive assistant.

Finally, we see potential for a more hands-on interface, where artists can sketch edge loops directly onto the target mesh. These loops would act as hard constraints during the fitting, combining the speed of automation with the precision of manual control. Together, these improvements could turn Retopomeister into a fully collaborative retopology tool - taking care of the repetitive work while leaving the creative choices entirely in the artist’s hands.

4. Conclusion

Retopology has always been that quiet but stubborn bottleneck in 3D character production. It is vital for animation and performance, yet it is slow, repetitive, and unforgiving. Retopomeister shows that AI does not need to replace artistry to make a real difference. By understanding anatomy and reusing clean source topologies, automation can take on the heavy lifting while artists stay focused on what matters most: shaping characters that move, feel, and perform.

As we keep refining the system, the goal is not full automation for its own sake. It is to remove the friction that blocks creative flow. Fewer hours lost to cleanup, fewer guesses in topology, and a smoother handoff from sculpt to rig. When AI handles that work, artists move faster, experiment more, and bring better characters to life.

If you’re a studio or team with recurring retopology needs - humanoids, creatures, props, or product assets - we’d love to collaborate. We can tune Retopomeister on your specific object classes, integrate with your existing tools, and benchmark results using our RetopoCheck evaluation. Reach out, and let’s explore how much time your team could save when clean topology starts as a given, not a goal.

Datameister at SIGGRAPH 2025: Insights and Trends

Jarne Van den Herrewegen — Mon, 22 Sep 2025 18:35:28 GMT

At SIGGRAPH 2025, AI continued to gain traction across computer graphics, with clear signs of both progress and resistance. While interest was high and more research papers than ever involved deep learning, practical adoption remains fragmented and often constrained by infrastructure and workflow realities. This post offers a grounded perspective beyond the hype, covering industry economics and AI, 3D generation, and simulation. Rather than a deep technical dive, it shares observations and takeaways from a week of demos, papers, and conversations on the show floor.

Datameister’s Axel (CTO) and Jarne (AI Engineer) geeked out for a week in the Vancouver convention center.

Industry Pressure & AI for Efficiency

Social media and streaming are reshaping attention and revenue in the entertainment industry. People spend more time in social media and streaming platforms instead of going to movie theaters or playing high-end games than pre-covid, clearly put by Natalya Tatarchuk (CTO Activision) during the Advances in Real-Time Rendering in Games sessions. The shift puts pressure on major film/animation studios and AAA gaming studios to do more with less while keeping quality high. AI could play a role in this evolution and has been a polarizing topic in the industry for the past years. Lay-offs in the industry have partially been attributed to (hyped?) expectations for AI to automate artists and devs away. On the SIGGRAPH floor, however, a careful yet broad interest for AI tooling existed.

Where is AI adoption the largest? Most popular applications lie in the ideation and discovery processes early on, where generative models bring small ideas to life quickly. Midjourney and ComfyUI were definitely mentioned the most and seemed widely adopted, besides the obvious ChatGPT and Gemini as general asistants. For later stages in large projects, where the playbook is fixed, generative AI was deemed too low quality for making final renders. 3D related AI tools are mostly in an experimental phase, many studios are looking into new tools, but adoptions seems low.

What is coming in the next two years? In the paper section, the amount of research on AI for computer graphics was astounding, breakthrough applications are expected to arrive in the coming years. Animation generation was covered widely, for example AnyTop, allowing to generate animations and skipping mocap entirely. 3D generation quality is improving rapidly, altough not clearly not production ready for everyone. Nvidia is pushing heavily for neural shading and physics simulation for robotics and extreme realism in digital environments. Further in this post, we cover 3D Generative AI and Neural Shading & Simulation more broadly.

AnyTop generated animations.

What are practical challenges for the new AI tools? On-premise solutions and walled gardens are often necessary to protect intellectual property. For many large studios, especially those outside North America and Europe, cloud-only platforms raise serious concerns around privacy, compliance, and long-term access. Yet most of the new tools on the market today are web-based or cloud-native, making them hard to adopt. Integration is another pain point: many tools are designed as one-size-fits-all platforms rather than plug-ins or extensions to existing software. Midjourney and ComfyUI are clear examples. Leaving your creative software environment breaks flow, adds friction, and makes it harder to stay focused. While tech artists are quick to explore these tools, many other creatives remain cynical and are much slower to adapt generative tools-especially because AI models are trained on their data and threaten their income. Automating the boring part of creative workflows, e.g. retopo, is better supported by artists.

3D Generative AI

A hot space at SIGGRAPH this year was 3D generative AI. A wave of new startups like Deemos, Chat3D, Tripo, and Meshy are driving innovation in this area. These platforms make it faster than ever to go from an idea to a rough 3D model, however they’re missing quality and important features for AAA studios.

Geometric quality is getting really good, and results are often visually convincing at first glance. However, under the hood, most generated assets fall short of production standards. Particularly textures are often low-resolution or poorly aligned, topology is usually too chaotic to animate or modify, and UV unwrapping is either missing or completely unusable. Getting high quality PBR materials is a challenge clearly. These are still critical bottlenecks if the goal is integration into a film, game, or simulation pipeline.

High-detail geometry comes with high density topology (Rodin 1.5 - Deemos)

Control is another big challenge. Most current tools offer minimal ways to guide or constrain the output beyond a text prompt or references images. Iterating on a result to refine a shape, preserving certain features, or regenerating only part of a mesh is sometimes supported but not in a practical way. At DataMeister, we’ve built a more structured approach in Trellis, see our presentation and whitepaper, where users can iteratively guide generation using 3D constraints and highly precise 2D edits making the process more interactive and reliable.

On the infrastructure side, most 3D generative tools run fully in the cloud and are often built by teams outside of North America or the EU. That can be problematic for studios with strict data policies, especially when dealing with confidential assets or client data. Additionally, few of these tools integrate well into existing workflows. Even something as simple as naming conventions, material assignments, or unit scales is often missing-forcing teams to manually adapt outputs before use. These small gaps add up quickly and break creative flow. They’re part of the reason why many artists still prefer traditional modeling over trying to wrangle with generative results.

Neural Shading & Simulation

NVIDIA demonstrated an Unreal Engine integration for neural shading, where neural networks optimize rendering by compressing textures first and approximating materials during real-time rendering. This triggered a feeling of inception to us, as neural networks approximate functions and a rendering pipeline approximates real-world physics. The benefits include reduced memory usage (3x - 10x) and higher fps (2x - 5x) at minimal visual loss, but the tech is still early. No major productions are using it yet. NVIDIA offered multiple courses at SIGGRAPH and theoretical sessions to introduce their frameworks, such as Slang, to support this new approach.

Neural BRDF from Zeltner et al. (NVIDIA)

Robotics and simulation teams are increasingly adopting real-time graphics tooling originally built for film and game production. Features like photorealistic sensor emulation, contact dynamics, and rigid body simulation are now being combined with rendering pipelines that support ray tracing, mesh-based collision, and procedural scene composition. At SIGGRAPH, several NVIDIA demos showed how procedural scene generation paired with high-fidelity physics can produce scalable virtual environments for robotics simulation such as the Disney droids. In parallel, AI models are being used to estimate physically based material properties (PBR) from real-world sensor data, like video footage captured by a self-driving car, allowing engineers to recreate complex scenes with realistic lighting and surface behavior. These materials can then be selectively altered-such as changing only the roughness of a road surface or the reflectivity of a wall-while keeping the geometry constant. This enables precise experimentation with visual variation, which is critical for stress-testing vision and control systems. As a result, the boundaries between creative rendering workflows and robotics simulation stacks are narrowing, resulting from shared needs for realism, control, and reproducibility.

Conclusion

AI is more present than ever on SIGGRAPH. As a sign on the wall, the volume of research papers involving neural networks indicates that this trend will definitely continue in the coming years. This industry is driven by technological advancements, SIGGRAPH was born for this exact reason, and AI is clearly becoming the next shift. As Datameister, we had a great time soaking up the energy, testing new tools, and discussing both the promises and limitations of current AI approaches. Many of the issues raised around integration, creative control, and production quality are challenges we actively help customers with.

In case you are looking for a partner in this space, reach out. We offer end-to-end support-from AI development and infrastructure to hosting and seamless integration with your existing creative pipelines.

Constraint-Aware 3D Generative Design: Editable, Iterable, Manufacturable

Ruben Verhack — Sun, 07 Sep 2025 11:14:32 GMT

TL;DR

Most 3D generative AI creates “fiction before physics”-beautiful shapes that break once real-world constraints are applied. Our approach flips that script. By defining which geometry is fixed and which can evolve, then iterating and editing directly from 2D views, we unlock a workflow where style and physics develop together.

Result: concepts that respect constraints from the start, faster iteration, and precise local control without endless retries.

In a recent post, our CEO Ruben shared why constraint-driven AI matters and how it shifts the role of designers and engineers into true co-creation. That piece, along with my CDFAM interview, explored the pain points that inspired our work: non-negotiable geometry in automotive, architectural design lock-in, and the costly late-stage changes that plague hardware teams.

This post builds on that foundation. Instead of focusing on the “why,” here we’ll go deeper into the “how”: the workflow that makes constraint-aware 3D generative design practical, and what new possibilities it unlocks for designers and engineers.

Watch the full length talk on AI-accelerated Automotive Design at CDFAM.

The problem

Industrial design always starts with hard points-clearances, battery packs, safety envelopes, structural elements-non-negotiable geometry that must survive every iteration. Real creativity means exploring every viable form within that rigid framework.

Yet most generative-AI tools treat these guardrails as afterthoughts. Two issues show up again and again:

Fiction before physics - Models produce striking shapes that collapse when confronted with packaging, safety, or structural data.

Ambiguity in 2D guidance - When designers provide input through images, sketches, or edited renders, models often confuse whether the edit should update texture (paint, logos, finishes) or geometry (indents, curves, structural features). A highlight on a car body might be misread as a paint effect, when in reality it should change the shape.

Our approach

Instead of patching those gaps after the fact, we bake constraints directly into the generative process. Using Trellis-based 3D diffusion, masked generation, and differentiable rendering, we let you sweep through shapes, textures, and styles while critical geometry stays frozen. Crucially, our pipeline can explicitly separate texture edits from geometry edits-or let the AI resolve the distinction automatically based on multiple edited views.

Example of texture-only update

From small-scale experiments-like a “bananacopter” wrapped around fixed screws and rotor blades that we’ll show later-to full automotive envelopes locked to battery packs and crash structures, the same workflow applies: critical geometry stays untouched while everything else evolves. Designers can add details with a few brushstrokes, lock down successful iterations, and push new variations-all without breaking the engineering rules in the background.

Example of geometry update

What this unlocks

Constraint-aware generation - Freeze non-negotiable geometry while exploring new shapes, textures, and styles around it.

Targeted iteration - Regenerate only in editable regions instead of starting over, keeping what already works.

2D-to-3D editing - Apply brushstrokes, sketches, or image edits to renders and propagate them back into the 3D model.

Texture vs. geometry control - Decide explicitly whether an edit changes surface appearance or underlying shape-or let the AI resolve it from multiple views.

Physics and style in sync - Bridge the “fiction-before-physics” gap so expressive designs remain viable against engineering constraints.

These capabilities matter because they address the biggest shortcomings in today’s generative 3D AI. To see why our approach is different, it helps to look at where the field stands today-and why standard models still fall short in real design workflows.

Generative 3D AI today

Over the past year, generative models for 3D objects have taken a big leap forward. The quality of single generations is now impressive-but for real design work, “one decent output” isn’t enough. In many cases, creatives spend as much time cleaning up assets as they would starting from scratch.

These limitations come from the models themselves. Because they are trained on large but biased datasets, they tend to favor certain shapes and styles. And while controllability has improved-mainly through image and text conditioning-these methods still leave designers with limited precision. They help nudge a model, but they don’t guarantee geometry that survives real-world constraints.

The baseline: image conditioning

Most 3D generative models, including Trellis, accept images as conditioning. In principle, you can draw edits on a render, feed that back into the model, and expect the change to appear in 3D. In practice, it rarely works reliably: surface details get lost, and it’s ambiguous whether an edit is meant to change texture (paint, logos) or geometry (indent, curve).

Image conditioning nudges a model, but it doesn’t guarantee precision. For real design work, that’s not control-it’s guesswork.

For example, if we sketch the letters DM on a render of a banana and use that as input, the resulting 3D generation often distorts or ignores the edit.

Image conditioning alone is unreliable: edits like “DM” drawn on a render (left) are often misinterpreted or lost in the resulting 3D output (right).

This lack of precision is one reason creatives hesitate to adopt 3D generative tools. It sets the stage for more robust methods-ones that don’t just nudge a model, but directly enforce constraints and interpret edits correctly.

Techniques that close the gap

The techniques we showcase here aim to bridge that gap: moving from nice-to-look-at outputs toward usable, constraint-respecting assets. In our demo, we show how to get precisely edited 3D assets that keep fixed geometry intact. Two key methods make this possible:

Masked generation - borrowed from image inpainting/outpainting, this lets us freeze non-editable regions and regenerate only where new geometry is allowed.

Differentiable rendering - a more advanced method where edits are made in 2D (on renders of the asset), then backpropagated through the renderer into the 3D latent space, updating either textures or geometry as intended.

We apply these ideas on top of the Trellis 3D generative model (see our Trellis tutorial blog post), though they can be used with other architectures as well. Trellis is especially useful because its structured latent space can decode into multiple modalities, including Gaussian splats and meshes-making it practical both for fast exploration and for downstream CAD or simulation. (It’s also widely adopted, with over 10K stars on GitHub.)

Trellis 3D generative model: structured latent space with decoders for Gaussian splats and meshes.

The bananacopter example

To illustrate, our demo walks through designing a toy helicopter around a fixed block with a screw and rotor blades. Using a 3D generative model with masked generation and differentiable rendering, we grow a whimsical “bananacopter” that respects the fixed geometry while still exploring creative variations. This simple example shows how the same workflow can scale to far more complex, constraint-heavy domains like automotive packaging.

Constraint-aware latent representation: the model combines designer intent (image) with fixed geometry (constraints) into a structured latent, where each voxel encodes both geometry and texture

Masked Generation

Masked generation is similar to image inpainting or outpainting. During denoising in the diffusion process, specified regions remain fixed. The idea was mentioned in the original Trellis paper under the name “Repaint,” although the official code was not released. There’s an unofficial implementation, but we found that a simpler approach works surprisingly well: initialize the generative voxel space with the constraints and regenerate only in masked regions outside of them. It’s straightforward to implement and pairs nicely with Trellis’ image conditioning.

Constraints define which geometry must remain fixed (left), while editable regions are free to regenerate (right).

By defining a mask to indicate which parts of the geometry the model is allowed to regenerate, and supplying guiding images, you can control both the 3D shape and texture. This gives designers creative flexibility while ensuring engineering-critical geometry stays untouched.

The example below shows constrained generation for the bananacopter: a toy helicopter wrapped around fixed screws and rotor blades. Only the free region is regenerated, guided by an image prompt, while the blades and gear block remain frozen.

Constraint-aware generation of the bananacopter: the fixed rotor + screw block remain untouched, while the editable region adapts freely.

Precise texture editing with differentiable rendering

To overcome the limits of image conditioning, we developed algorithms that use differentiable rendering to feed edits directly back into the model. Instead of hoping the network interprets a sketch correctly, this approach maintains a computational link between the rendered pixels and the structured latent representation of Trellis.

The process works like this: starting from a latent (from a prompt or a masked generation), we decode it into Gaussian splats and render multiple viewpoints with a differentiable renderer. When a designer edits those renders-for example, painting the letters DM on the banana-those pixel edits are backpropagated into the latent. Iteratively, the latent updates until its decoded output matches the edited views.

Differentiable rendering: input renders with manual edits (left) are backpropagated into the latent, producing consistent texture edits across new 3D views (right).

This makes edits much more precise than image conditioning alone, because it avoids the information loss that happens when neural networks “interpret” guidance. With just 3–4 edited renders, you can enforce detailed changes that persist from any viewpoint in 3D space.

Editing geometry through 2D input

Differentiable rendering doesn’t just work for textures-it can also drive geometry edits. Combined with masked generation, this lets us update the underlying shape of an object using only a few edited 2D renders.

The process starts with masked generation: fixed geometry is preserved, while new voxels are generated in editable regions. Instead of guiding this only with images, we augment the pipeline with a differentiable voxel renderer. This ensures that edits made in 2D renders-such as an indent or curve-propagate directly into the 3D structure.

In practice, this means designers can sketch structural changes in a couple of 2D views, and the model updates its geometry to match-without breaking constraints.

Differentiable rendering for geometry: manual edits in 2D renders (left) are backpropagated into the voxel latent, producing consistent 3D geometry updates (right).

Differentiable rendering for geometry: manual edits in 2D renders (left) are backpropagated into the voxel latent, producing consistent 3D geometry updates (right).

Closing thoughts

Constraint-aware generation doesn’t just make 3D assets look good-it makes them usable. By freezing critical geometry and letting everything else stay fluid, the workflow you’ve seen here short-circuits the usual back-and-forth between design and engineering. The result is faster cycles, fewer dead ends, and assets that can move directly into CAD, simulation, or tooling.

But the bigger shift is what this unlocks: controllability. Designers can steer generation at the right level of detail-whether that’s broad form exploration or pixel-level edits-while staying aligned with engineering constraints. And because these methods are modular, they don’t just live in demos: we can integrate them into existing graphical pipelines, connect them to product workflows, and adapt them across industries.

If you’re working on design tools, simulation, or automation pipelines and this resonates, we’d love to connect. Reach out to explore how our constraint-driven generative pipeline can streamline your projects-or spark new ones.

Datameister Turns Two 🎂

Ruben Verhack — Mon, 30 Jun 2025 14:29:11 GMT

On June 12th, at Datameister’s two-year anniversary, we opened the doors to our brand-new Ghent HQ. It was a moment to show how a young, ambitious team can scale fast when curiosity meets craft. From our first proof-of-concepts to production-ready computer-vision and 3D-AI projects for sports analytics, automotive design, and entertainment pioneers, year two was our biggest leap yet.

Datameister’s two-year anniversary & new Ghent HQ

Year-2 Scorecard

We’ve been moving fast, building hard, and growing strong-these are just a few of the moments that defined our year.

Team growth: ↑ 7 to 14 brilliant minds mainly focused on in-depth AI engineering, DevOps and extra senior leadership.

Projects: More and more visual and 3D AI projects while diversifying our client base.

Platform upgrade: Upgraded Datameister Platform with faster visual data processing and robust in-house AI capabilities.

Conferences: Attended ECCV 2024, SuperNova, and UNWRAP (as speaker); planning to attend SIGGRAPH and speak at CDFAM in 2025.

New office: Finally moved into a larger Ghent HQ - plenty of room for our growing team and beautifully designed by our friends Dennis (from Boldhouse) and Tim Van Rensbergen.

One Year, Many New Faces: Our Team in Pictures

There was a time when we could write a blog post for every new hire-but with 7 amazing additions this year, that’s become a bit harder! A warm welcome to Lily, Niels, Jirne, Pierre, Ekaterina, Thijs, and Bernard (who’s missing in the photo).

What a difference twelve months make-here’s the bigger-than-ever Datameister crew! Check out our team a year ago!

Datameister two-year anniversary team photo outside new Ghent office, photos by Arthur Pieters

Thank You for Being Part of the Journey

None of this happens in isolation. To our clients who trust us with their toughest data problems, to our academic partners who stretch our thinking, and to the tech community that cheers us on-thank you. We wrapped up the year by hosting a well-deserved BBQ at our new office, bringing together clients, partners, suppliers, friends, and the team.

Datameister two-year anniversary celebration, photos by Arthur Pieters

A special thanks to Tim, our can-do-everything handyman, whose paintbrush, design sense and fix-it wizardry turned a bare space into a fully equipped HQ built for our next phase of growth.

Custom made bench by Tim Van Rensbergen.

What’s Next? 🚀

Year three is already in motion, with plenty ahead:

Further maturing Datameister and crystalizing our offering (stay tuned for our new leadership introductions)
Accelerating development of the Datameister Platform
More in-house development of AI capabilities
More technical demos and blog posts
Fresh roles on the careers page.

Follow the Journey

Follow us on LinkedIn & Instagram to see what year 3 brings.

Constraint-driven 3D Generative AI - Computational Design Symposium

Jarne Van den Herrewegen — Wed, 11 Jun 2025 14:07:31 GMT

I'm excited to be speaking at CDFAM Amsterdam, happening July 9–10, 2025, where I’ll share how we at Datameister are applying constraint-aware generative AI to real-world design challenges-starting in automotive, and now branching into architecture, consumer electronics, and beyond.

Generative design is often talked about as a creative revolution-but in practice, most tools either ignore the constraints that engineers live with, or only validate feasibility after the fact. At Datameister, we’ve taken a different route. As I explained in my recent CDFAM interview:

“Instead of generating something and checking feasibility after the fact, the designer is co-creating with a system that already understands the constraints-structural, ergonomic, regulatory, whatever they may be.” - Ruben Verhack, CEO Datameister

Rather than pushing a generic model onto every workflow, we develop application-specific tools that embed domain constraints up front-enabling designers and engineers to collaborate with AI systems that understand the rules of the game before play even begins.

This is especially important in fields like automotive, where every bold design gesture must fit within tightly coupled systems-crash structure, visibility requirements, aerodynamics, and platform geometry. But it’s not just about cars.

In architecture, for example, early massing decisions often get locked in before regulations or budgets are fully known-leading to massive rework downstream. In consumer electronics, once layout and thermal constraints are defined, altering enclosures becomes prohibitively expensive late in the process. This kind of design lock-in is a structural issue across industries, and our tools aim to break that cycle by decoupling dependencies and enabling earlier, faster, lower-risk iteration.

By integrating real constraints directly into the design generation process, we allow teams to explore more-and backtrack less.

The real shift, though, is in the role of AI: not as a black box, but as a co-creator. One that gives designers immediate, constraint-aware feedback and frees engineers to become enablers, not gatekeepers. As I describe in the interview, this turns design into a much more fluid, collaborative, and expressive process.

UPDATE (Aug 14th, 2025): Video is online!

Watch the full length talk recorded at CDFAM.

Read the full interview here to dive deeper into our approach, or catch my talk at CDFAM to see how this works in practice.

And if you're working with simulation, optimization, or engineering automation-especially in high-constraint environments-I’d love to connect in Amsterdam.

Datameister Platform: Accelerating AI Deployment for Visual Data

Ruben Verhack — Wed, 19 Feb 2025 08:23:58 GMT

TL;DR

Most MLOps platforms struggle to handle the challenges of visual AI, such as large-scale image, video, and 3D data processing. The Datameister Platform solves this by combining AI development and operations into one seamless, GPU-optimized environment.

Result: faster model deployment, easier iteration, and lower infrastructure costs. Clients get transparent pricing, real-time monitoring, and scalable performance without the need for in-house DevOps. It’s a simple, secure, and future-ready way to build, deploy, and manage visual AI solutions at production scale.

1. Introduction

At Datameister, we don’t just develop custom AI algorithms for visual data-we also offer a fully managed MLOps platform that handles deployment, monitoring, and maintenance. Our goal is simple yet powerful: dramatically reduce the time it takes to bring complex AI solutions to market, while keeping costs manageable and performance high.

Why does this matter? Because working with large-scale images, videos, and 3D objects demands more than a typical DevOps pipeline. GPU orchestration, specialized job scheduling, and real-time tracking are all crucial. By combining AI development with a dedicated MLOps platform, we ensure you can focus on what the algorithm does, not how to keep it running.

2. Why We Built the Datameister Platform

Our experience as an AI Research & Deployment Lab made one thing clear: quickly iterating on AI models and getting them production-ready requires much more than isolated data science and DevOps teams. Here’s how our platform addresses this:

Speed and Tight Integration
We unify AI development and infrastructure so new models can be deployed or updated fast. When something goes wrong, our engineers can debug in hours, not days, because they have full visibility into the logs, inputs, and outputs-without lengthy handovers.

MLOps for Visual Workloads
Rather than using generic cloud setups, we designed our platform for GPU-intensive tasks, such as image generation, video analysis, and 3D object processing. Our Kubernetes cluster and container minimalization strategies help keep inference times short and resource usage efficient.

Scalability With Flexibility
Whether you’re an SME taking first steps in AI or a startup racing to market, our platform adapts. You can start small, then seamlessly scale up to handle heavier loads or more advanced features-without rebuilding everything from scratch.

Maintainable, Agile Architecture
We shield you from complex DevOps chores: container orchestration, resource allocation, and performance tuning are handled behind the scenes. That enables us to rapidly iterate on algorithms, knowing that the underlying platform is stable and well-monitored.

3. Key Benefits: From Cost Efficiency to SLAs

3.1. Adaptive Scheduling with Multi-Tenant Efficiency

A major advantage of our platform is its multi-tenant design, which allows us to share baseline capacity across clients and reduce the constant spinning up and tearing down of machines when loads fluctuate. We spread jobs across EU-based data centers and major cloud vendors, automatically opting for cost-effective resources first (like spot instances) and shifting to on-demand if needed to maintain uptime.

For high-priority workloads, we offer a priority queue that can reserve dedicated or on-demand capacity to meet tight turnaround requirements. Meanwhile, our ongoing work in container minimalization, efficient scheduling, and GPU optimizations helps drive down startup times and overall latency-putting near real-time performance within reach for many visual AI use cases. By dynamically balancing workloads in a multi-tenant environment, we not only optimize resource usage but also deliver lower latencies and better cost efficiency than a one-size-fits-all cloud setup.

3.2. Cost-Efficient Scaling and Transparent Pricing

Our pricing model aims to be straightforward, transparant and predictable, eliminating the hidden costs and inefficiencies that often come with managing AI infrastructure in-house.

Monthly Platform License: A fixed fee that covers platform maintenance, updates, and baseline support.

Credit-based Compute Cost: You’re billed for actual usage depending on job type (per GPU-hour or per job).

Flexible SLAs: A basic SLA covers core business hours, while higher tiers (with shorter response times or 24/7 coverage) come at an additional cost.

Beyond cost transparency, our platform removes the need for an in-house DevOps team, saving on hiring, training, and retention costs. With shared infrastructure and dynamic scheduling, clients benefit from higher efficiency and continuity-ensuring AI workloads run smoothly without the overhead of managing infrastructure, monitoring, and troubleshooting internally. Every optimization we make applies across all clients, meaning your AI runs faster and more cost-effectively over time.

3.3. Streamlined Monitoring and Debugging

Our real-time monitoring system allows us to detect, diagnose, and resolve issues instantly, eliminating delays from log retrieval or environment setup. With direct access to execution traces, inputs, and outputs, we quickly pinpoint the root cause of errors or slowdowns, ensuring minimal disruption.

This tight integration of MLOps and AI development not only accelerates debugging but also drives continuous optimization-adapting workloads, refining resource allocation, and improving model efficiency based on real-world performance. The result: faster iteration, lower overhead, and AI models that get better with every deployment.

3.4. Security and Compliance Mindset

Our multi-tenant architecture enhances security by isolating workloads while allowing us to apply continuous monitoring across multiple AI deployments. This means early detection of anomalies, shared security improvements, and efficient resource management-all without compromising data separation.

As an EU-based company, we ensure GDPR compliance and provide data processor agreements for clients handling personal data. Our platform is designed with strict access controls, ensuring only authorized users can modify or interact with deployed workloads.

While we follow many ISO27001 best practices, we prioritize practical security measures that keep AI workloads safe, scalable, and efficiently managed. We are aiming for ISO27001 certification by the mid-2026.

3.5. Future-Proof Flexibility

We won’t lock your business into our platform. If managing AI infrastructure in-house becomes viable, our containerized deployment allows for a structured transition to your own cloud or on-prem setup.

However, self-hosting introduces higher overhead, requiring in-house expertise for infrastructure, monitoring, and cost management. The tight AI-DevOps integration that enables fast debugging and continuous optimization on our platform won’t carry over, leading to longer issue resolution times. Additionally, Datameister support won’t extend to externally hosted environments.

While transitioning will require some effort, we assist with the offboarding process, ensuring your workloads can be migrated with minimal disruption. For most clients, staying on the platform remains the most efficient and cost-effective choice, but when the time comes to move, we make sure you’re set up for success.

4. Who Benefits the Most?

SMEs Venturing into AI
Gain high-end MLOps capabilities without hiring or training a full DevOps team.

Startups Racing to Market
Iterate and deploy quickly, focusing resources on refining your AI rather than managing servers.

Companies Handling Complex Visual Data
If your solution depends on heavy image or video processing, our GPU-optimized platform helps you maintain both performance and cost control.

5. Conclusion

The Datameister Platform is designed to bring speed, efficiency, and simplicity to MLOps for visual data. By merging AI development expertise with a robust operational backbone, we empower you to roll out new features, debug issues swiftly, and scale to meet growing demands-all with a transparent cost structure.

Our approach helps you stay focused on innovation while we handle the mechanics of running your AI at scale.

3D Generative AI: Image-based 3D reconstruction

Jarne Van den Herrewegen — Mon, 27 Jan 2025 09:09:47 GMT

In 2024, there was a Cambrian explosion of academic work on 3D generation and several commercial tools made their appearance. Last month in particular, there was the exciting release of Trellis by Microsoft, taking a leap forward in the open source/science community. Its performance is great, but of equal importance is the fact that it was largely open sourced and people are already building with it.

The goal of this post is twofold:

This post will be a tutorial for builders and other interested minds, offering a concise overview of the historical developments that led to Trellis and a detailed look at how it works-explaining why its pipeline is designed the way it is.

We will compare Trellis to other state-of-the-art commercial tools -such as Rodin by Hyper3D, Tripo by TripoAI, SPAR3D by StabilityAI and Hunyan3D-2 by Tencent-by examining both the theoretical differences in their pipelines and presenting visual results for a subjective eval of each tool.

Jarne Van den Herrewegen - AI Engineer (author)

A small word about myself first. My name is Jarne, I am a first class AI nerd and ML engineer at Datameister. I hold a soon-to-be-defended PhD in self-supervised 3D deep learning. The best thing for me is rabbit-holing in research and busting out algorithms in disruptive products. Having worked with the Datameister founders Axel and Ruben for 3 years at Oqton, I am more than happy to join them and the other talented meisters!

1. From NeRF to Trellis3D: key concepts of 3D generation

Given the success of image generation models, an extensive line of research on 3D reconstruction from images has formed in the past two years. The goal of this section is to introduce key concepts in 3D generative models using image generation without losing ourselves in paper-filling details. These foundations will clarify how the field evolved and set the stage for understanding Trellis and why its design choices push 3D generation further. In this section I will cover:

NeRFs and Gaussian Splatting: the bridges between 2D and 3D

DreamFusion: pioneering general 3D generation

Large Reconstruction Model: making DreamFusion efficient

InstantMesh: introducing 3D feedback

1.1 Image-based 3D reconstruction

We begin with NeRFs and Gaussian Splatting because they set the foundation for modern 3D generation-both methods demonstrate how limited 2D observations can be leveraged to generate new “3D” views. By understanding the strengths (and limitations) of these approaches, we can see why Trellis adopts certain strategies (like voxelization and latent-space modeling) and how it ultimately distinguishes itself from earlier techniques.

Novel view synthesis. Key to the reconstruction from 2D to 3D, are Neural Radiance Fields (NeRF) and Gaussian Splatting (GS). Given a few images and according camera perspectives, these methods can synthesize unseen camera views with high quality. To predict views that are truthful to the 3D world, both methods build a form of geometric grounding. This geometric information will be the bridge between 2D images and 3D generation. NeRF and GS will be shortly introduced here. For more applications, definitely check out our previous blog about image-based rendering with AI by Ruben, who used to publish papers in the early days of this field!

Source: Datameister blog

Neural Radiance Fields, Mildenhall et al. ECCV 2020, capture an environment into a neural network based on images and according camera positions. In essence, a NeRF is a neural network f fitted to infer RGB values and volume density σ (0 = empty space, 1 = solid) for point x and viewing direction d:

To render an image from a Neural Radiance Field, a ray is cast through space for each pixel in the virtual image plane. According to a sampling process, the ray is evaluated at specific points in the NeRF:

Source: NeRF paper, Mildenhall et al.

The volume density contains information on the composition of the scene and guides the color aggregation process along the rays. This geometric understanding will be the basis for obtaining 3D assets with NeRFs. Extracting a mesh from a density field is possible with a Marching Cubes algorithm. This works by voxelizing the volume and using the NeRF density to determine whether each vertex is inside or outside of a surface. Marching Cubes then checks each voxel (cube) individually for how the surface intersects its edges, and selects a polygonal pattern (from a small lookup table) that best approximates the shape within that region, as shown below.

Marching Cubes lookup table for mesh faces. Red vertices are inside the surface. Source: Isovox

Gaussian Splatting (GS), is a more recent method for novel view synthesis and is aimed at real-time applications. GS represents 3D scenes as a cloud of colored Gaussians, without any neural networks involved. In essence, each Gaussian has a position μ, covariance Σ, maximum density σ and a color c that describe local structure and color. Similar to rendering with Neural Radiance Fields, rays are cast from a virtual image plane through the scene:

Each ray aggregates color from the Gaussian kernels in the scene. The color contribution of each kernel is weighted by its maximum density and proximity to the ray. The following formula shows the contributed density for kernel i for position x on the ray.

In reality the Gaussians are projected onto the image plane and rendering is done through simple α-compositing of kernels sorted on distance. Most kernels can be efficiently pruned, reducing the inference time even more. This type of optimizations, combined with the absence of neural networks, make Gaussian Splatting about 100x more efficient than Neural Radiance Fields in rendering speed.

While Gaussian Splatting represents the scene’s geometry implicitly through the positions and covariances of splats, extracting a clean, explicit mesh is not straightforward-especially for translucent or mirror-like surfaces where splats do not encode a single, solid boundary. Unlike density fields (where Marching Cubes can be directly applied), the soft nature of Gaussian splats leads to ambiguities about whether a particular region is “inside” or “outside” a surface. Some recent adaptations of GS explicitly link the Gaussians to underlying geometry, but this is still an active research area. In Trellis, Gaussian Splatting serves as medium for image reconstruction losses.

1.2 DreamFusion: Pioneering general 3D generation

Dreamfusion is the first academic work that successfully generated 3D assets in a generalizable way, published by Poole et al. at ICLR 2023. Prior to its development, 3D generation models showed good quality, but were constrained to specific object categories like chairs, limiting their versatility. Dreamfusion, on the other hand, builds upon 2D image generation methods that can generalize across millions of objects.

3D model of “a frog wearing a sweater” Source: DreamFusion

The main trick in Dreamfusion’s approach lies in its use of a pre-trained 2D Diffusion model, Imagen in this case, to optimize a Neural Radiance Field representation. Given a text prompt describing the object the user wants to generate, Dreamfusion fits a single Neural Radiance Field initialized from scratch.

Overview for DreamFusion. Adapted from Poole et al.

Optimization at inference. The pre-trained diffusion model is used as a critic to guide the NeRF towards a plausible generation. The optimization process works by taking a camera perspective and rendering the according view from the NeRF. Then, the novel view is combined with sampled noise and is passed through the diffusion model together with the text prompt. The pre-trained diffusion model predicts the noise that needs to be subtracted from its input to obtain a high quality image conditioned on the text prompt. DreamFusion looks at the difference between sampled noise and predicted noise:

If the predicted noise is close to the added noise, the noise difference is small, implying that the rendered view was of high quality.

Conversely, if the predicted noise deviates significantly from the added noise, this indicates that the diffusion model corrected substantial issues in the image, suggesting that the NeRFs performance was poor.

This clever idea is referred to as Score Distillation Sampling (SDS) in literature and forms the loss function for fitting the Neural Radiance Field. By iteratively refining the NeRF using this feedback loop, DreamFusion generates NeRFs of any object that is well-known to the pre-trained image generation model.

DreamFusion remains just a strong piece of research however. It’s practical applicability is limited due to its mid quality and long inference time (1.5h). Considering that the authors only used 64x64 images, the overal results are still impressive. Follow-up work, such as the Large Reconstuction Model, focused on improving the inference time and quality.

1.3 Large Reconstruction Model

To address slow inference in DreamFusion, the Large Reconstruction Model (LRM) was proposed by Hong et al (ICLR 2024). LRM takes a different approach to representing Neural Radiance Fields, specifically leveraging the triplane NeRFto improve inference time.

A triplane NeRF is an adapted version of the original Neural Radiance Field, introduced by Chan et al. 2022. While classic NeRFs are entirely implicit functions, just one neural network for predicting everything, triplane NeRFs use intermediate feature maps. These are predicted with a pre-trained backbone that infers features from input images. The pre-trained backbone adds more prior information to the procedure, improving the final RGBσ regression and reducing the inference time. When a triplane NeRF is queried with an XYZ coordinate, a feature vector is aggregated by averaging the values from the closest pixels in the three orthogonal feature maps, as seen below. This aggregated feature vector is then used to regress the eventual RGBσ output with only a few fully-connected layers.

Triplane NeRF representation. Source: Chan et al. (CVPR 2022)

Avoiding cubic scaling. Ideally, there would be an NxNxN grid with feature vectors instead of 3 feature maps with NxN grids to have even better locality in the NeRF, but the cubic scaling would be too inefficient to scale to high resolution. The triplane representation proves to be a good tradeoff between resolution and scalability. We will later see how Trellis puts a spin on this and uses voxelized 3D surfaces instead of straight planes to improve the locality but remain efficient.

The Large Reconstruction Model is an image-to-3D model that manages to reduce the inference time to less than 10s thanks to the triplane representation. The authors shifted the computational efforts from test time to train time. A training procedure is introduced where a pre-trained DINO encoder is used together with a custom decoder for inferring triplane maps. The model is trained on multi-view renders from the Objaverse dataset, containing 800k textured 3D objects. After training, text-to-image models can be chained before the LRM to create a text-to-3D model.

Overview of the Large Reconstruction Model work. Figure adapted from InstantMesh

The improvement of LRM over DreamFusion comes from its practical applicability. Where DreamFusion showed that a 3D generative model can generalize over millions of categories, LRM showed that 3D generative models can also be practical. For both works, the output quality is still below the level where a creative would start working on the object however. DreamFusion’s overal resolution was too low, as it only used 64x64 images. LRM was trained with 512x512 images, resulting in improved resolution, but it suffered from inconsistent quality between different views. Objects only look good from the perspective of the input image, as shown below.

1.4 InstantMesh

The Large Reconstruction Model (LRM) set the stage for many follow-up works that extended its pipeline to enhance geometric consistency and overall quality. InstantMesh, Xu et al. 2024, is a notable follow-up that introduced multi-view input and direct feedback from 3D meshes to improve the consistency from different perspectives. Furthermore, InstantMesh brings the possibility for PBR materials such as normal maps, important for creative workflows.

InstantMesh pipeline. Source: InstantMesh

Multi-view input. InstantMesh uses 6 input images from different perspectives to have richer geometric information. For a user, it is non-trivial to provide 6 images without having the 3D object already. Therefore, Xu et al. use a multi-view diffusion model, Zero123++, to generate 6 views from one input image. Next, the same encoder as in LRM is applied six times separately and the LRM decoder is retrained to work with 6x image tokens.

Geometric feedback. Another significant improvement involves incorporating explicit 3D data into the training process. Until this point, both DreamFusion and LRM had relied exclusively on image-based losses. From an inferred triplane NeRF, the authors extract a 3D mesh in a differentiable manner, allowing to propagate mesh reconstruction losses back into the LRM. InstantMesh relies on a parameterized Marching Cubes algorithm called FlexiCubes for this.

Quality. InstantMesh brought 3D generation to a point where it is practically useful. It can generate a wide variety of objects, within a minute, with a level of quality that is sufficient to start working on in some cases. In particular for background props that are not animated, InstantMesh can be useful for inspiration or as starting point for a sculpt. For important characters that require detail and animation, the geometry, textures and mesh topology are not good enough yet.

Now the question arises: can 3D generation be pushed further to achieve more faithful geometry, higher-quality textures, and greater editability? Considering that the amount of open source 3D objects is limited, this is a tough challenge. This is precisely where Trellis makes its mark, starting from this pipeline introducing innovations-like voxelized surfaces and latent space Flow Matching to take 3D generation to the next level.

2. Trellis

Last month Microsoft released Trellis, Xiang et al. 2024, which inherits many ideas from InstantMesh and brings each step in the pipeline to another level.

Two improvements stand-out: (1) triplanes are replaced with voxelized 3D surfaces and (2) the generative process is moved from image space to the latent 3D space. The result is high quality geometry, improved textures and a framework that allows more control and editability in the 3D generation process.

Converting textured meshes to surface voxels with DINO feature vectors.

Evolving triplanes to surface voxels. Instead of using triplane feature maps, Trellis introduces sparse voxels with features. In a 64^3 grid, voxels are activated at the surface of the 3D object, as shown in the figure above. 150 renders are made for each object and are encoded into features with a DINOv2 backbone. Each voxel receives features that are projected from the visible DINOv2 output. Compared to a triplane feature representation, the sparse voxel features are closer to the location where the information is needed. At the same time, the dimensionality does not explode. In the Trellis training set, each object reportedly has 20K active voxels on average out of (7.63% grid occupation), where a triplane representation would have 12K feature pixels (4.69% grid occupation).

100 images, depth maps and normal maps are rendered. Surface voxels are generated for all objects. Rendered images are captioned and summarized per object with GPT-4o. Variational AutoEncoder (VAE): a VAE is trained to create a downsampled latent space in which the generative process will take place, as in Latent Diffusion. This VAE is only used for the model that generates features in the voxels, not for the model that generates the voxel structure. Flow model training for structure generation: a Flow Matching model is trained to generate voxel structures from a random binary 64x64x64 grid. The model can be conditioned on text or image input. A more detailed description is given below. Flow model training for latent feature generation: a second Flow Matching model is trained to generate latent features in the surface voxels. This model generates latent features that match the VAE encoder output, conditioned on the voxel structure and text or image input. Training a Variational AutoEncoder (VAE) in Trellis. Source: Trellis authors Variational Auto Encoder (VAE) for latent generation. An important trick in generative modeling is to perform the generation process in a downscaled latent space. This speeds up the generation process and improves generalization capabilities significantly as the input dimensionality is lower. Trellis uses this trick as well and trains a VAE encoder to reduce the dimension of the features in each voxel from 1024 to 8. The VAE decoder learns to infer the parameters for the 3 output modalities from these small feature vectors. In practice, the authors first trained an encoder-decoder only with Gaussian Splatting. Afterwards, the the final layer in the decoder was replaced with new layers that were trained separately to predict NeRF and FlexiCubes parameters. VAE architecture. The encoder and decoder architecture are a 3D variant of Shifted Window Transformers. Attention is only applied to tokens in a local 8x8x8 window to limit the attention matrix. Each token corresponds to one active voxel, empty voxels do not get a token, reducing the attention matrix by another order of magnitude. Tokens are created by projecting the features of an active voxel with a linear layer and adding a sinusoidal positional encoding based on its location in the grid. Flow Matching for generative modeling. The Trellis authors use Flow Matching in latent space to generate the voxel structure and the voxel features. Flow Matching is similar to Diffusion and has become popular thanks to its faster sampling process. Where Diffusion relies on a Markov Process (MP) to (de)noise samples, Flow Matching defines a more general theory for transforming one distribution to another using vector fields. The theory is more abstract, but in practice the MP is just replaced with linear interpolation. In the formulas below, given a ground truth sample xgt and noise ε at timestep t, a noisy sample xnoised is interpolated. Next, the Flow Matching (FM) objective is defined for the denoising model vθ(x,t) with parameters θ that learns to predict the added noise inxnoised. Generation process for sparse voxels representing 3D surfaces. Inputs: random binary grid and text/image conditioning. Adapted from Xiang et al. Structure generation. The first Flow Transformer learns to generate sparse structures, where empty/active voxels are indicated 0/1. Since this Transformer model is not a windowed Transformer and because it works on a dense 64x64x64 grid, there would be 262k tokens to attend to at once. As it is computationally expensive to generate directly in the 64x64x64 space, the authors train additional downsampling and upsampling layers that downsample the binary grid into a 16x16x16 grid with latent vectors of 8 dimensions. Given text or image conditioning (cross-attention with CLIP or DINOv2 tokens) and downsampled random grids, the model is trained to generate latent vectors that are upsampled back to a binary grid. Structured latent feature generation for each active voxel, initialized with random 8-dim vectors. Adapted from Xiang et al. Latent feature generation. For the last step, a Flow Transformer is trained to generate features for the surface voxels in the VAE latent space. Similar to the structure generation, the authors introduce additional downsampling and upsampling layers to improve the efficiency. After the latent feature generation, the VAE decoders can infer Gaussian Splats, a NeRF or a 3D mesh. Image-to-3D comparison between InstantMesh and Trellis. Source: Trellis Output quality and usability. Overall, Trellis takes a big step forward in texture quality and geometric detail. For simple background props, it is almost possible to generate game-ready geometry. Textures often need to get some reworking to get rendering-ready. The paper reports a user study where Trellis is compared to InstantMesh and 5 other scientific works (not including commercial tools), finding that Trellis is preferred by users in 67.1% of the text-to-3D generations and in 94.5% of the image-to-3D generations. Practical advantages in Trellis. With these core ideas in place, Trellis offers several practical advantages that set it apart from prior work. Creatives typically work in iterations to improve their work.While NeRFs and meshes from the previous 3D generative models are typically non-trivial to edit, the sparse voxels in Trellis are much easier to touch on. In addition, the latent generative process can be masked to only change a specific region. Trellis was released with a MIT-license, which means you can build with it like you want, even commercially. However, there are a couple of Nvidia dependencies that are not available for commercial use. For the open source community, Trellis is an amazing starting point for which we will hopefully see similar tooling as for image generation models, e.g. infill models, upscalers, ... 3. Trellis and friends While we’ve already discussed Trellis, other commercial tools released in the past 12 months also deliver outstanding results. Notably, Tripo by Tripo AI, Rodin by Hyper3D, Spar3D by StabilityAI and the just released Hunyuan3D-2 by Tencent have pushed the boundaries of what’s possible, all focusing on image-to-3D. To showcase the differences and similarities among these tools, we provide a visual comparison of their outputs below. Comparison for image-to-3D models, showing textures and geometry. Swipe for more examples. Source: author. Visual comparison discussion. These conclusions are based on more than 2 examples, but for brevity I have only included 2. As seen in the examples above, Trellis, Tripo and Hunyuan3D are closest to the intended geometry in the image, altough Trellis and Hunyuan3D fail at handling the ladder well in the adventurer case. Rodin also shows high quality geometry, but seems to deviate from the input image. For texture quality, Tripo and Rodin seem to show the highest quality, where Tripo is definitely closest to the original image. Trellis and Hunyuan3D also remain close to the original image, but have a dark tone in all of their generations SPAR3D is clearly subpar in quality, but is much faster. In the table below, there is a detailed comparison considering quality and practical aspects. Detailed comparison table considering quality, inference time, cost per use and availability. Behind the scenes. The commercial tools have their grounding in research and all companies published a technical paper at some point. Given the background in Section 1 and Section 2, it is not hard to have a high-level understanding of the other models: Tripo, report, was originally based on LRM, with improved data curation and small changes to reduce GPU memory usage. Their report is the oldest among all compared models and it is unclear how close their current model is to the one in the report. Given the high quality, it is to be expected that their dataset and/or architecture must have improved significantly. SPAR3D, report, was only released this month, and shows a generative process working on sparse point clouds (instead of surface voxels in Trellis) that allow very flexible editing. The point cloud is decoded into a triplane NeRF with CP-decomposition as in Trellis, from which geometry is extracted with differentiable Marching Cubes, as well as PBR materials with an illumination model RENI++. Its inference speed and control through point cloud editing are remarkable, let’s hope that the output quality is improved in the coming year. For the moment, SPAR3D remains an strong piece of research, but not a practical tool. Rodin, report,differs the most from the previously covered works. Just as Trellis, it breaks down the generation process into one generative model for geometry and one for materials. The geometric generative model works on downsampled point clouds (based on 3DShape2VecSet), something in between SPAR3D’s point cloud diffusion and Trellis’ downsampled voxel generation, followed by a decoder that predicts a volumetric function for applying the Marching Cubes algorithm to. For generating textures, Rodin has a very interesting approach which combines existing image generation models and UV mapping. Hunyuan3D-2, report, is similar to Rodin’s pipeline. Their VAE is also inspired by the 3DShape2VecSet architecture and there are separate models for geometry generation and texture generation. The most notable differences introduced by Hunyuan3D-2 are in the texture generation, e.g. there is a delighting model for obtaining albedo colors in input images, and the use of Flow Matching Transformers as used in Flux. Conclusion In conclusion, there have been major leaps forward in the past 2 years for 3D generation. The resulting models are never just one neural network but a pipeline of existing image models glued together with small custom networks and other algorithms. The results are great for generating simple assets as inspiration and starting point for 3D modeling. For important characters and objects requiring animations, there are still significant barriers before they become truly production-ready. Most notably, topologies are often not suited for rigging or animation, and the outputs can require extensive cleanup or manual retopology. Even high-quality generation models miss crucial surface attributes like UV layouts or PBR-friendly materials, making advanced texturing or lighting workflows a challenge. For many industries-ranging from game development to robotics-consistency and precise scale are essential, yet these generative models lack the tools to control the generative process with low tolerances. It’s these types of pain points that Datameister tackles head-on, merging research with engineering rigor to produce tools that are not just concept demonstrations but fully integrated systems for commercial and creative use. In case you are looking for a partner to build 3D generative applications, do not hesitate to contact us.">

[Technical] Decoding features to 3D Gaussian Splats, NeRFs and meshes. From the surface voxels with features, Trellis offers three decoders: one for Gaussian Splats, one for NeRFs and one for triangle meshes. Each of these modalities are non-trivial to predict with deep learning models and require specific differentiable loss functions. For Gaussian Splats, 32 gaussian kernels are created for each voxel by predicting the location, scale, orientation, density and color as described in Section 2.1. Views are sampled from the splats and compared to renders from the actual scene with image reconstruction losses (D-SSIM and LPIPS). The Neural Radiance Field decoder predicts 4 feature vectors per voxel that represent a CP-decomposition as in Tensorial Radiance Fields, a supercharged version of triplane NeRFs. The NeRF decoder is trained with the same reconstruction losses as Gaussian Splats. Meshes are predicted by upsampling the voxel grid from 64^3 to 256^3 and regressing SDF values on the 8 voxel vertices together with all required parameters for FlexiCubes. The reconstruction loss for the mesh decoder is defined on the depth and normal maps rendered from the original 3D object and the reconstructed mesh.

Generative model overview. On the algorithmic side, Trellis differs the most from DreamFusion, LRM and InstantMesh by not performing the generative process in image space, but rather on the sparse feature voxels to stay closer to 3D space. There will be two generative models: one for generating surface voxels in 3D and one for generating features in the voxels. Trellis uses Flow Matching in latent space as generative process. Additionally, the authors curated a more qualitative dataset. It is not clear from their report however which improvement was more impactful, the dataset or the algorithm. An overview of all components:

Dataset: a high quality 3D dataset with 500k samples is curated from 10mil objects in Objaverse-XL. From each object, >100 images, depth maps and normal maps are rendered. Surface voxels are generated for all objects. Rendered images are captioned and summarized per object with GPT-4o.

Variational AutoEncoder (VAE): a VAE is trained to create a downsampled latent space in which the generative process will take place, as in Latent Diffusion. This VAE is only used for the model that generates features in the voxels, not for the model that generates the voxel structure.

Flow model training for structure generation: a Flow Matching model is trained to generate voxel structures from a random binary 64x64x64 grid. The model can be conditioned on text or image input. A more detailed description is given below.

Flow model training for latent feature generation: a second Flow Matching model is trained to generate latent features in the surface voxels. This model generates latent features that match the VAE encoder output, conditioned on the voxel structure and text or image input.

Training a Variational AutoEncoder (VAE) in Trellis. Source: Trellis authors

Variational Auto Encoder (VAE) for latent generation. An important trick in generative modeling is to perform the generation process in a downscaled latent space. This speeds up the generation process and improves generalization capabilities significantly as the input dimensionality is lower. Trellis uses this trick as well and trains a VAE encoder to reduce the dimension of the features in each voxel from 1024 to 8. The VAE decoder learns to infer the parameters for the 3 output modalities from these small feature vectors. In practice, the authors first trained an encoder-decoder only with Gaussian Splatting. Afterwards, the the final layer in the decoder was replaced with new layers that were trained separately to predict NeRF and FlexiCubes parameters.

VAE architecture. The encoder and decoder architecture are a 3D variant of Shifted Window Transformers. Attention is only applied to tokens in a local 8x8x8 window to limit the attention matrix. Each token corresponds to one active voxel, empty voxels do not get a token, reducing the attention matrix by another order of magnitude. Tokens are created by projecting the features of an active voxel with a linear layer and adding a sinusoidal positional encoding based on its location in the grid.

Flow Matching for generative modeling. The Trellis authors use Flow Matching in latent space to generate the voxel structure and the voxel features. Flow Matching is similar to Diffusion and has become popular thanks to its faster sampling process. Where Diffusion relies on a Markov Process (MP) to (de)noise samples, Flow Matching defines a more general theory for transforming one distribution to another using vector fields. The theory is more abstract, but in practice the MP is just replaced with linear interpolation. In the formulas below, given a ground truth sample xgt and noise ε at timestep t, a noisy sample xnoised is interpolated. Next, the Flow Matching (FM) objective is defined for the denoising model vθ(x,t) with parameters θ that learns to predict the added noise inxnoised.

Generation process for sparse voxels representing 3D surfaces. Inputs: random binary grid and text/image conditioning. Adapted from Xiang et al.

Structure generation. The first Flow Transformer learns to generate sparse structures, where empty/active voxels are indicated 0/1. Since this Transformer model is not a windowed Transformer and because it works on a dense 64x64x64 grid, there would be 262k tokens to attend to at once. As it is computationally expensive to generate directly in the 64x64x64 space, the authors train additional downsampling and upsampling layers that downsample the binary grid into a 16x16x16 grid with latent vectors of 8 dimensions. Given text or image conditioning (cross-attention with CLIP or DINOv2 tokens) and downsampled random grids, the model is trained to generate latent vectors that are upsampled back to a binary grid.

Structured latent feature generation for each active voxel, initialized with random 8-dim vectors. Adapted from Xiang et al.

Latent feature generation. For the last step, a Flow Transformer is trained to generate features for the surface voxels in the VAE latent space. Similar to the structure generation, the authors introduce additional downsampling and upsampling layers to improve the efficiency. After the latent feature generation, the VAE decoders can infer Gaussian Splats, a NeRF or a 3D mesh.

Image-to-3D comparison between InstantMesh and Trellis. Source: Trellis

Output quality and usability. Overall, Trellis takes a big step forward in texture quality and geometric detail. For simple background props, it is almost possible to generate game-ready geometry. Textures often need to get some reworking to get rendering-ready. The paper reports a user study where Trellis is compared to InstantMesh and 5 other scientific works (not including commercial tools), finding that Trellis is preferred by users in 67.1% of the text-to-3D generations and in 94.5% of the image-to-3D generations.

Practical advantages in Trellis. With these core ideas in place, Trellis offers several practical advantages that set it apart from prior work.

Creatives typically work in iterations to improve their work.While NeRFs and meshes from the previous 3D generative models are typically non-trivial to edit, the sparse voxels in Trellis are much easier to touch on. In addition, the latent generative process can be masked to only change a specific region.

Trellis was released with a MIT-license, which means you can build with it like you want, even commercially. However, there are a couple of Nvidia dependencies that are not available for commercial use. For the open source community, Trellis is an amazing starting point for which we will hopefully see similar tooling as for image generation models, e.g. infill models, upscalers, ...

3. Trellis and friends

While we’ve already discussed Trellis, other commercial tools released in the past 12 months also deliver outstanding results. Notably, Tripo by Tripo AI, Rodin by Hyper3D, Spar3D by StabilityAI and the just released Hunyuan3D-2 by Tencent have pushed the boundaries of what’s possible, all focusing on image-to-3D. To showcase the differences and similarities among these tools, we provide a visual comparison of their outputs below.

Comparison for image-to-3D models, showing textures and geometry. Swipe for more examples. Source: author.

Visual comparison discussion. These conclusions are based on more than 2 examples, but for brevity I have only included 2. As seen in the examples above, Trellis, Tripo and Hunyuan3D are closest to the intended geometry in the image, altough Trellis and Hunyuan3D fail at handling the ladder well in the adventurer case. Rodin also shows high quality geometry, but seems to deviate from the input image. For texture quality, Tripo and Rodin seem to show the highest quality, where Tripo is definitely closest to the original image. Trellis and Hunyuan3D also remain close to the original image, but have a dark tone in all of their generations SPAR3D is clearly subpar in quality, but is much faster. In the table below, there is a detailed comparison considering quality and practical aspects.

Detailed comparison table considering quality, inference time, cost per use and availability.

Behind the scenes. The commercial tools have their grounding in research and all companies published a technical paper at some point. Given the background in Section 1 and Section 2, it is not hard to have a high-level understanding of the other models:

Tripo, report, was originally based on LRM, with improved data curation and small changes to reduce GPU memory usage. Their report is the oldest among all compared models and it is unclear how close their current model is to the one in the report. Given the high quality, it is to be expected that their dataset and/or architecture must have improved significantly.

SPAR3D, report, was only released this month, and shows a generative process working on sparse point clouds (instead of surface voxels in Trellis) that allow very flexible editing. The point cloud is decoded into a triplane NeRF with CP-decomposition as in Trellis, from which geometry is extracted with differentiable Marching Cubes, as well as PBR materials with an illumination model RENI++. Its inference speed and control through point cloud editing are remarkable, let’s hope that the output quality is improved in the coming year. For the moment, SPAR3D remains an strong piece of research, but not a practical tool.

Rodin, report,differs the most from the previously covered works. Just as Trellis, it breaks down the generation process into one generative model for geometry and one for materials. The geometric generative model works on downsampled point clouds (based on 3DShape2VecSet), something in between SPAR3D’s point cloud diffusion and Trellis’ downsampled voxel generation, followed by a decoder that predicts a volumetric function for applying the Marching Cubes algorithm to. For generating textures, Rodin has a very interesting approach which combines existing image generation models and UV mapping.

Hunyuan3D-2, report, is similar to Rodin’s pipeline. Their VAE is also inspired by the 3DShape2VecSet architecture and there are separate models for geometry generation and texture generation. The most notable differences introduced by Hunyuan3D-2 are in the texture generation, e.g. there is a delighting model for obtaining albedo colors in input images, and the use of Flow Matching Transformers as used in Flux.

Conclusion

In conclusion, there have been major leaps forward in the past 2 years for 3D generation. The resulting models are never just one neural network but a pipeline of existing image models glued together with small custom networks and other algorithms. The results are great for generating simple assets as inspiration and starting point for 3D modeling. For important characters and objects requiring animations, there are still significant barriers before they become truly production-ready. Most notably, topologies are often not suited for rigging or animation, and the outputs can require extensive cleanup or manual retopology. Even high-quality generation models miss crucial surface attributes like UV layouts or PBR-friendly materials, making advanced texturing or lighting workflows a challenge. For many industries-ranging from game development to robotics-consistency and precise scale are essential, yet these generative models lack the tools to control the generative process with low tolerances.

It’s these types of pain points that Datameister tackles head-on, merging research with engineering rigor to produce tools that are not just concept demonstrations but fully integrated systems for commercial and creative use. In case you are looking for a partner to build 3D generative applications, do not hesitate to contact us.

For 2025, Datameister is looking to hire several ML engineers and interns working on computer vision & graphics in the creative world, robotics simulation and sport analytics. If you consider applying for internship or a full-time position, send us an email at hello@datameister.ai!

A big thank you to Ruben Verhack and Liam Wezenbeek for proofreading and providing feedback!

Datameister @ECCV 2024: Building a foundation

Larsen D'hiet — Wed, 20 Nov 2024 10:48:03 GMT

Avanti! 🚀

From September 29 until October 4, MiCo Milano was the setting for the 18th European Convention on Computer Vision, the two-yearly convention on the best the field of Computer Vision has to offer. As true AI builders, we at Datameister are always eager to stay updated on the latest trends and advancements in the field. Naturally, our presence in Milan was a no-brainer.

In this two-fold blogpost, we aim to give a short overview of general research and upcoming topics we found interesting, useful for us, or simply really exciting. The access to advanced foundation models - think of DINO, Segment Anything (SAM), or Stable Diffusion - as well as the birth of novel techniques such as Gaussian Splatting has given researchers the opportunity to explore a wide range of new ideas, of which we would like to share the most interesting ones with you.

MiCo Milano - Europes largest convention centre - was the venue for ECCV 2024

Before I do that, allow me to introduce myself briefly. I am Larsen, electrical engineer graduated from Ghent University with a knack for AI, signal processing, sports, and any intersection between those. Ruben and Axel guided me carefully through my masters thesis back in 2023, which made me decide to join Datameister in June this year. Between that and my graduation in June last year, I spent 7 months in Paris working at the start-up Emobot at Station F, one of the largest start-up incubators in Europe (some of you will probably know Station F as the place where Hugging Face first started in 2017).

Larsen D’hiet - AI Engineer (author)

Now that we got acquainted, let’s dive into the exciting stuff! We saw three major topics coming back during workshops and poster sessions: 3D Gaussian Splatting, Diffusion for image (and short video) generation and 3D object generation and representation. In this blogpost, the first two topics will be covered. I chose to go deeper into the details of some of my personal favorites rather than for this to be a mere listing of papers. This one is for the tech-enthusiasts who are not afraid of some digging - let’s go!

3D Gaussian Splatting

We kick things off with one of the favorite topics of our co-founder Ruben - if these things spark your interest, go check out his earlier blogpost on image-based rendering. The introduction of 3D Gaussian Splatting (3DGS) in 2023 meant an interesting exception to the reign of neural networks of the last years. The method allows for high-quality, real-time novel-view synthesis representing a scene using 3D Gaussians.

Gaussian Splatting is a fairly novel rasterization technique that aims to draw 3D Gaussians representing physical a scene on the screen (source: AI-Driven Breakthroughs in Image-Based Rendering)

The properties of the 3D Gaussians are optimized in an end-to-end manner using gradient descent on a pixel-per-pixel loss. The resulting assembly of Gaussians accurately captures the scene and produces photo-realistic 3d views. The main drawbacks the initially proposed algorithm suffered from were high memory usage and artifacts due to occlusions, insufficient training views or lightning conditions causing view-dependent appearances of an object.

Foundation models to the rescue

Many researchers at ECCV aimed to tackle one or more of the shortcomings listed above. Foundation models as Dinov2 and SAM more often than not played a significant role in that. Let’s go in short over some other interesting work on 3DGS we’ve seen.

WildGaussians handles occlusions and appearance changes while maintaining the real-time rendering speed of 3DGS. Most notably, it leverages the difference between pre-trained DINOv2 features of the ground-truth and rendered image to model regions of uncertainty. It does so by using a cosine similarity measure to train an uncertainty predictor. This predictor is leveraged to mask uncertain pixels in the pixel-per-pixel loss for rendering training. The resulting uncertainty modeling reduces the influence of occluders such as transient objects or pedestrians in training images.

Uncertainty modeling using DINOv2 features to predict and mask uncertain regions in training images (source: WildGaussians paper)

Gaussian Grouping augments each Gaussian with an identity encoding allowing them to be grouped by their semantic meaning in the scene. The encodings are supervised using automatically generated SAM masks for each view in the training collection. Masks from different views are associated using video-like object tracking (a call for the use of SAM-2 here?). This “identity-aware Gaussian Splatting” elegantly unlocks possibilities for 3D object removal, inpainting, style transfer, and more. Go take a look on the authors’ project page, there is some cool stuff on there!

Gaussian grouping leverages identity encodings in the process to group Gaussians representing the same object together, in order to be able to perform tasks such as 3D Object Removal and Object Inpainting (source: Gaussian Grouping paper)

SAGS or Structure-Aware 3D Gaussian Splattingleverages Graph Neural Networks (GNNs) operating on a k-Nearest Neighbor graph that links points within a local region. These points are expected to share common structural features. Allowing these points to interact with each other through the use of GNNs enhances scene understanding and reduces artifacts.

And the award for coolest paper name goes to... Gaussian Frosting!

Clearly, Gaussian Splatting is very well suited for fancy paper names. One more such example -and our personal favorite on this topic - is Gaussian Frosting, a particularly fresh and creative idea fusing 3D mesh representations with Gaussian scene modeling. The key idea to Gaussian Frosting is to augment a mesh with a layer of Gaussians in order to better capture fine surface details. This allows for the representations to be editable just as a mesh, while maintaining the high rendering quality of 3DGS.

Gaussian Frosting starts from the observation that 3D Gaussians obtained by standard 3DGS - so-called ‘unconstrained Gaussians’ - are not regularized to align well with the surface of the scene. Algorithms such as Marching Cubes therefore fail to construct a mesh representation from an unconstrained 3DGS scene. Another paper called SuGaR, by the same authors, already proposed a solution to regularizing Gaussians for surface alignment - constructing ‘regularized’ Gaussians - and subsequent mesh extraction (using Poisson reconstruction rather than Marching Cubes):

Regularized Gaussians align better to the surface of objects in the scene and therefore result in much better mesh representations (source: Gaussian frosting paper)

Gaussian Frosting builds further on this by constructing a ‘frosting layer’ of additional Gaussians on the extracted mesh, of which the thickness depends on the ‘fuzziness’ of the material. This ‘fuzziness’ can be formally defined by considering the thickness of both the regularized Gaussians and the unconstrained Gaussians. This allows for automatically estimating the required thickness of the frosting layer:

The required thickness of the frosting layer is determined by considering both the thickness of the regularized and unconstrained Gaussians (source: Gaussian frosting paper)

Thickening the frosting layer around these fuzzy regions simply allows for more Gaussians to capture fine-grained details:

In constrast to normal 3DGS, Gaussian Frosting allows to render fine details meticulously (source: Gaussian frosting paper)

Furthermore, the authors are able to keep the frosted Gaussians within the frosting layer during optimization as well as when deforming the base mesh. This allows to use traditional mesh editing tools in e.g. Blender to edit or composite scenes with great rendering quality, even for complex volumes or fuzzy materials. One such example can be seen here underneath. Who thought Buzz Lightyear would be able to ride a giant kitten in the classic 3DGS bike scene?

Scene composition with deformable meshes using Gaussian Frosting. The resulting composited scene is of high quality and can properly deal with fuzzy surfaces and occlusions (source: Gaussian frosting paper)

For those familiar with Blender, the Gaussian frosting authors provide a Blender add-on to play around with Gaussian Frosting! You can find it here.

Diffusion

Latent Diffusion proved to be a game-changer for conditioned 2D image generation back in 2021. Moving the diffusion process to the latent space allowed to exploit the potential of image denoising more efficiently. I would like to highlight three fairly different papers using diffusion for generation directly. They have one common factor: the results are an impressive tribute to the capabilities of latent diffusion models nowadays.

CosHand: Controlling the world by the sleight of hand

Diffusion models most likely have some understanding of the interaction of objects with the world, simply because of the vast amount of data models such as Stable Diffusion have seen during training. CosHand is an example of a “world model”. World models predict the future conditioned on past observations and an action. This setup is quite different from normal text- or image-conditioned diffusion models. World models should ensure consistency across the input and predicted image, and accurately model physical behavior of objects with their surrounding environment.

CosHand makes use of SAM to obtain a large dataset of combinations of hand masks and images, on which a latent diffusion model with strong physical priors can be finetuned (source: CosHand paper)

CosHand models hand-environment interaction and aims to predict the change in position, appearance or geometry of an object caused by hand motions, seen from a single view. Given an input image X_t, the hand mask in that image h_t and a future hand mask - the ‘hand query’ - h_(t+1), it predicts how the image would change because of the hand motion.

Starting from a pre-trained Stable Diffusion model to leverage its strong priors, CosHand conditions on encoded versions of Xt, ht and ht+1, as well as on a CLIP embedding of Xt during finetuning. The latter conditioning ensures semantic consistency between the input and output image. Hand masks are generated with - you’ll never guess - Segment Anything, and model inputs and outputs are sampled from the SomethingSomethingv2 video dataset.

CosHand has a sense of depth, and is able to generalize to unseen but similar mechanisms such as robot arms (source: CosHand paper)

The results are - least to say - fascinating. CosHand is able to predict the impact of a hand motion on a scene. It also has a sense of depth, as illustrated on the left of the figure here above. Even more fascinating is the fact that it is also able to do the same thing for robot arms; it clearly has some understanding about physical concepts at an abstract level beyond its dataset! CosHand is an excellent example of unlocking strong physical priors from diffusion models.

Generative Camera Dolly: It’s all about perspective

At the end of November 2023, Stability AI brought Stable Video Diffusion (SVD) for image-to-video synthesis into the world. Generative Camera Dolly (GCD) operates by the same principle as CosHand: finetuning a diffusion model that was pre-trained on large scale - in Dolly’s case video - data. Given a static, single-view video of a scene, GCD is able to imagine what the scene would look like from other different perspectives. The output is a video-to-4D transformation, generating a video as if the camera was moving along the scene - just as a camera dolly.

Generative Camera Dolly is a Stable Video Diffusion model finetuned to incorporate explicit control over camera movement during video generation (source: GCD paper)

The authors condition a pre-trained SVD model on the relative camera viewpoint difference, captured by a series of rotation matrices R_t and translation matrices T_t. This allows for explicit control over camera movement during video generation.

A desired transformation can be specified as e.g. up 15°, right 60° and back 10m. Upon generating this camera movement, the model is still capable of recovering full scene layout and reconstructing temporally hidden objects despite occlusions. It also correctly imagines the continued motion of objects in a scene as the camera moves.

Cool video examples can be found on the GCD Project Page, be sure to go give it a look!

FMBoost: Go with the flow

Diffusion models generally still suffer from slow inference and high computational resource usage, especially for high resolution image generation. Apart from (latent) diffusion processes, another paradigm for image generation has gained some attraction lately: flow matching. Flow matching is a theoretically quite heavy concept, based on Continuous Normalizing Flows. Flow matching aims to construct a flow ϕ that maps an initial distribution p0** (for image generation a standard Gaussian distribution)to another, possibly more complex distribution p1 (the distribution of the image space):

High-level overview of flow matching: a flow ϕ allows to construct a mapping from an initial distribution p0 to another, more complex distribution p1

ϕ is the solution of the differential equation dx = u_t(x)dt, with u_t(x) a time dependent vector field. Flow matching essentially found a regression objective to estimate the vector field u_t(x) from sampling from the target distribution p_1. The flow ϕ can then be found using highly optimized ODE solvers, making flow matching really efficient for image generation as it does not require many stochastic denoising steps such as is the case with diffusion models. However, this comes at the cost of flow matching models being less expressive and diverse.

FMBoost combines the best of both worlds: it leverages latent diffusion models in a small latent space to generate diverse samples at a small resolution. Thereafter, flow matching is used to efficiently upsample the latent code to a higher dimensional latent space, from which a high resolution image can be decoded with a pre-trained VAE decoder:

FMBoost combines the strength of both Latent Diffusion Models (LDM) and Flow Matching (the CFM module) for efficient high resolution image generation. The latent decoder is a pre-trained VAE decoder.

The Coupling Flow Matching module (CFM module) is trained to efficiently transport the low resolution latent code to a high resolution one. Diffusion in the low-dimensional latent space ensures sufficient diversity in the generated images. The result is a plug-and-play method for boosting the resolution of latent diffusion models in an efficient manner. It is really neat to see the difference between the low resolution image resulting from the latent code from the LDM, and the high resolution after flow matching:

Difference between the original image resulting from the low resolution LDM and the high resolution image after several upsampling steps using a chain of CFM modules.

That’s it for this first ECCV blogpost! Hope the stuff from here above could fascinate you as much as it did fascinate us. Every corner of the 3D AI scene is moving at an unprecedented speed, that is for sure. In the next blogpost, a brand new colleague of ours - and expert in the field - will take you through more fascinating ECCV work on 3D object generation and representation. Stay tuned 😎

Celebrating our first year

Ruben Verhack — Mon, 15 Jul 2024 14:39:52 GMT

We made it! First year of Datameister is already behind us. We founded Datameister in June 2023, fast-forward one year, and here we are! There are many things that I would like to share. For example, our team is expanding to 7 people in the coming months. We’ve been extremely lucky to gather a lot of top notch talent. Liam already introduced himself, we’ll be introducing others soon too. Additionally, we’ve been cooking up some exciting stuff behind the scenes.

However, I’m going to keep this one short, but I wanted to share our photos of our 1-year celebration at Zebrabeach, Ghent.

We invited our friends and colleagues from the tech scene around Ghent to join us for drinks, BBQ and tunes by DJ Mixmonster Menno. Thanks to Janne Kegels for the photos!

Finally, I want to give a shout-out to Baking with Astrid for the delicious Datameister-branded mini-cupcakes.

Reflecting on the first AI for Digital Arts/Entertainment/Game Dev meetup

Ruben Verhack — Thu, 11 Apr 2024 15:16:16 GMT

The event

Last night, Datameister, Howest Digital Arts & Entertainment and Flanders Game Hub organized the first event of the AI in digital arts game development and entertainment BE meetup community. It was hosted in the beautiful event space underneath our offices at the Boldhouse in Ghent, BE. It was great to see so much enthusiasm and attendance to the event. Full House!

Full house at the Boldhouse!

The goal is to connect people ranging from AI algorithm techies to AI-powered creatives. Both ends are really looking at each other for help since there is so much noise at the moment. It feels like the entire playing board has been completely shaken up. The techies wonder what to make, while the creatives wonder what is possible. We believe that those who take the lead in navigating this new future will be the ones who ultimately own it.

The people

I was honestly pleasantly surprised by the creative talent that was present. In general, there is always more talent in Belgium than you would expect. People came from far and beyond and were showcasing their experiments to each other on their laptops. That really brought a smile to my face.

It was a nice mix of people, and I have to give a shoutout to Howest DAE Research for being the ideal bridge between the creatives and techies. Researchers and teachers were all present to share their findings and look for collaborations.

Apart from the creatives, it was great to see other industries present. People who work on photogrammetry and 3D assets in industrial settings, work with 3D people developing their new AI products in stealth mode, or other people working on all kinds of simulations and training environments in AR/VR.

The talks

Glenn Van Waesberghe (DAE Research) gave a great overview of how text-to-image works, not shying away from technical details like the embedding spaces and diffusion, something I really appreciate because I don't want to be fluffy about AI in this community. Give me the cold, hard, technical details. He concluded his talk with a comprehensive analysis of all the available tools, evaluating them on various aspects. Glenn shared his slides here.

Glenn presenting text-to-image: deep dive

Vince Buyssens (Starhaven) showed us how he has been successfully applying AI for big international campaigns and discussed how it was received. He specializes in supplying the change management necessary for companies and all the nuances of how to apply new techniques to broaden your toolbox but still respect your audience and keep humanity relevant within the AI revolution. He was kind enough to share his presentation slides. Check out his videos on which he collaborated: “Welcome to the Latent Space”, “Under Armour - Anthony Joshua - Forever is made now”.

Me, Ruben, gave a talk on AI-driven breakthroughs in image-based rendering which I already wrote extensively about in this blog post.

My reflections

First of all, I found it extremely interesting how tools, which are still very limited in quality or creation length, can be put together to create something great. It’s a great new type of creativity that emerges, which often reminds me of the 80s and 90s aesthetics that also tried to be as creative with shitty graphics as possible but gave rise to 8-bit art etc. It itself becomes an art.

Another trend I am seeing is that the modalities on which creatives will work in the future will be volumetric; it won’t just be images and video. Some studios already treat video productions a lot more like 3D environment productions than plain 2D video productions. For this, NeRFs and Gaussian Splats will become great intermediate formats that bridge this camera-captured data with 3D worlds. We will work in media in a different modality akin to radiance fields in a more volumetric environment, and this will require radically new tools. This is exciting for a company like ours that specializes in spatial AI.

What I learned was how creatives now have access to open source, and how open source is so important to creatives. The idea that this technology should not be only dominated by Adobe and the likes is actually a stance on open source that I never thought of. For me, as a techie, I always assumed open source was something us techies only cared about. Especially ComfyUI caught my attention.

Last observation was that creatives seem to care a lot more about the impact on their industry than I have ever given thought of. And with a lot, I mean A LOT. The criticism and even boycots company face when experimenting with AI is mind-blowing. They need to deal with inertia a lot more than us techies need to deal with. It opened my eyes to what kind of techno-optimistic bubble I live in.

Next steps

This event placed a significant emphasis on text-to-image generation, image in- and out-painting, and generative video, along with our discussion on radiance fields, including Gaussian Splatting and NeRFs. I am really looking forward to discovering what other topics will emerge. I am particularly excited about those that involve 3D pose, animations, 3D world design, and visual effects.

We already set the date for the next event, being June 5th. More details will follow, follow us on our meetup community page.

AI-Driven Breakthroughs in Image-Based Rendering: Light Fields, SMoE, Gaussian Splatting, NeRFs and beyond

Ruben Verhack — Tue, 13 Feb 2024 10:41:27 GMT

In a previous life, I was working at TU Berlin and Ghent University in the exciting field of video coding and image-based rendering. I had the opportunity to meet with many brilliant minds at conferences, MPEG and JPEG meetings as well as industry meetups at Google, Netflix and Disney Research.

Me (left) - starstruck - meeting the original JPEG inventors at the JPEG meeting celebrating 25 years of JPEG (Turin, Italy). Source: Image by the author.

Although image-based rendering has been around for decades, it wasn’t until recently that a big revival happened. Image-based rendering was considered impractical for many reasons, until deep learning and diffusion models provided solutions to some of the longstanding issues.

In this article, I will provide an overview of the recent advances in AI-driven image-based rendering field based on my personal experience and background. Additionally, I will discuss my contributions to the field. This is definitely more of technical longread. Hope you enjoy.

What is image-based rendering?

Image-based rendering is the field of rendering in which new images are distilled from a set of captured images. It typically involves by interpolating, and tracing light rays in a space. The captured images serve as snapshots of the light rays in the space.

Illustration in which the yellow camera viewpoint is synthesized based on the pixel data entering the blue real cameras. Source: Image by the author.

Why is this useful? Image-based rendering allows creating novel views from a scene without knowing anything about the geometry of the scene. Reverse-engineering the geometry (meshes, texture, and reflective properties) from a scene is notoriously difficult and an underdetermined problem. Think about smoke, reflections, and translucent objects. It easily gives rise to the uncanny valley problem.

On the other hand, there are practical problems as well in image-based rendering when the geometry and its properties are unknown in order to generate new images:

Many images need to be captured in order for new images to be able to be generated. This leads to practical data acquisition problems, as well as storage and streaming issues.

Occlusions, one can never see behind objects.

Where does AI come into play?

The naive way to implement the rendering is by using ray-tracing to trace each pixel back to the reference views. Over the years, this has been extended by incorporating more and more intelligent view interpolation techniques. However, the modern way is using an intermediate AI model that is constructed based on the reference viewpoints. Such a model can then be queried at the desired viewing angles.

The primary area of innovation lies in designing these view models that can grasp the essence of a scene to enable reconstructing views at high fidelity. Additionally, these models will be required to have properties that are more concerning applications, such as the ability to relight a scene or efficiently stream parts of a scene over a network. The applications will thus be important in deciding what type of model will be preferred.

Applications

Image-based rendering has many applications which all share the desire to take a scene or an object from the real world and visualize it again in a virtual setting.

1. Immersive applications

The most natural application is in a VR/AR experience in which image-based rendering techniques allow a user to experience a strong immersive feeling due to the high photorealistic light effects, as well as having all possible viewing angles at their disposal giving them 6-degrees of freedom.

It effectively fulfills the potential of virtual reality using camera-captured content. This represents a significant improvement over the restricted range of motion found in current 360-degree videos. There are many possible applications, e.g. entertainment (remote events), telemedicine (remote robot surgery), remote visits to real-estate, cultural heritage, and others.

The model will be required to be streamable efficiently across networks. Ideally, it should have all the features found in typical MPEG data streams, such as random access or different layers of detail.

2. 3D assets for virtual productions

A less obvious, but extremely important new application is the inclusion of scanned objects and spaces to be reused in virtual productions, VFX, video production, and gaming. This is where I truly believe that the need for a compact and versatile data models arises.

This is the area that has seen the most activity over the last few years. Engines such as Unreal and Unity currently have plugins that allow you to import view generating models like NeRFs and Gaussian Splats right into your 3D world. You can then integrate other 3D objects like meshes with your camera-captured content. Watch the video below and you will instantly understand why this is a game changer for productions.

Video showing the production process of incorporating a mountain top camera-captured by a drone into a virtual production. Source: Bad Decisions Studio

For these applications, other requirements may be imposed on view-point generating models, e.g. being able to edit or to relight scenes or objects.

Mathematical models of light

Let’s first get back to the (mathematical) basics. Several mathematical models have been introduced to model light over the last century. All relate one way or another to the plenoptic function. The plenoptic function is a mathematical model that describes all the light rays in a space at a certain time. It captures all possible visual information about a scene at every point in time and encompasses various attributes like position, direction, color, intensity, etc. Different models have been introduced that all relate to the plenoptic function: light fields, radiance fields, ray-space representation. All of the models are different parametrizations or simplifications of the plenoptic function.

The full plenoptic function. The Polarization, Bounce and Phase arguments are typically left out for simplification. Time is only relevant for non-stationary scenes (Source: History of Neural Radiance Fields).

For example, the simplified 4D light field contains information about both the intensity and direction of light rays at every point in space. By capturing and processing this vast amount of data, we can recreate realistic images with accurate lighting effects and depth perception. Light fields are used in photography and imaging technologies, like light field cameras (e.g., Lytro), which capture information about the light direction as well as its intensity. This allows for post-capture refocusing, changing the perspective, and precise depth-based filtering.

It is called the 4D light field because it reduced the plenoptic function into 4 main parameters. There are different parametrizations possible (see below) in which the first (a) is the most common, the parallel two plane representation. Each ray of light goes through an image plane (s,t) and travels through a camera plane (u,v), assuming that the cameras are placed on a single plane.

Source: Changyin Zhou, & Nayar, S. K. (2011). Computational Cameras: Convergence of Optics and Processing. IEEE Transactions on Image Processing, 20(12), 3322–3340.

The 4D light field is a convenient mathematical model, but does not always transfer easily to practical and cost-efficient camera rigs, as I will discuss in the next section.

Alternatively, free viewpoints camera setups are possible (or even just moving a single camera in time), which requires the additional complexity of structure-from-motion (SfM) / photogrammetry techniques to situate the pixel values in the 3D space. For example, Gaussian Splatting relies heavily on SfM for the initialization of it’s kernels.

Structure-from-motion: Finding correspondences between camera viewpoints to locate the pixel values in the physical 3D world. A more flexible approach compared to the constrained 4-D light field. Source: https://towardsdatascience.com/a-comprehensive-overview-of-gaussian-splatting-e7d570081362

Capturing light fields

I will briefly discuss camera setups for light field capturing, mainly to give insight on why light fields have not been a practical solution for a long time. Below you can see a prototype for one of Lytro’s production-level light field camera array. Needless to say, this is extremely costly to rent, run and to provide storage for. Each camera array requires a server rack to capture video.

Adapted from RoadToVR - Exclusive: Lytro Reveals Immerge 2.0 Light-field Camera with Improved Quality, Faster Captures.

In contrast, you can find my poor man’s version that I built at IDLab-MEDIA - UGhent below. Each panel consists of 9 Raspberry Pi minicomputers with the RPI v2 cameras. Each panel was laser cut and assembled using standard easy to find tools. The cost of one panel was sub-1000 EUR, which was my assigned budget.

DIY Raspberry Pi-based light field camera array. The German word “Kabelsalat” is very apt here. Source: Image by the author.

Each panel would thus produce 9 photos/videos that were spatially displaced. The external parameters (camera position/angle) and intrinsic parameters (lens correction) were co-optimized using multi-cam optimization.

Result of one panel: 9 viewpoints from a single camera plane. Source: Image by the author.

If you were to apply naive light field rendering by ray tracing through these nine images, you can achieve a result as shown below. The camera position can be slightly changed, and some refocusing is possible by adjusting a virtual focal plane. However, the level of quality remains low when using 9 cameras with relatively large gaps in between. However, it does give you an impression of where we want to go.

Naive light field rendering applied to my poor man’s light field camera rig.
Source: video by the author. Full video here

Novel AI-driven image-based rendering techniques

I will now discuss four recent advancements in image-based rendering that have been made possible by recent AI breakthroughs. For more detailed information on the history of this field prior to 2017, please refer to my PhD thesis.

Non-exhaustive overview of continuous representations of camera-captured scenes, separated in Gaussian-based methods and whole-scene methods. Source: Image by the author.

SMoE (2017)

Between 2014 and 2020, I published a number of papers that introduced Steered Mixture-of-Experts (SMoE), and eventually published a book on the matter. I had the privilege and pleasure to work for an MPEG grandmaster, Prof. Thomas Sikora, who was thrilled about the idea of abandoning the concept of pixels all together. Initially, the method was meant as a continuous representation for images and videos. Everything was modelled by one large Gaussian Mixture Model (GMM). There would just be more Gaussians where more detail was required.

The Gaussians - called ‘kernels’ - would span over spatial directions and in time, thus replacing the pixel as the building block of imagery. A single Gaussian thus represented a single blob of color that had a spatial and a temporal extend. The reconstruction is performed by taking the most probable pixel luminance at a certain location. This is given by the expectation of the posterior distribution given the location, which leads the Gaussians to work together in a Mixture-of-Experts (MoE) fashion. MoEs approximate a continuous function by combining a set of experts that are responsible for a part of the total function. In this case, each kernel is responsible for modeling a single gradient of a region in an image.

A detailed example on a 32x32 pixel image patch. These 1024 pixels were represented by 10 Gaussian kernels, each representing one localized gradient. Comparison to JPEG is at same bitrate. Source: Image by the author.

An example of mean estimated reconstructions of a 128x128 image from the dataset. Original (left) followed by models with 25, 100, 250, 750, and 2000 components, i.e. ranging from 1 kernel covering ±655 to ±8 pixels on average. Source: Image by the author.

It was soon realized that this method could be extended to any dimensional imagery, thus any light model. This was a huge breakthrough, it meant that this could give rise to a methodology that could modelling, coding, decoding, and render: images, videos, light fields, light field video, and even 360-degree content natively. It has the major advantage that complicate redundancy-reducing methods, typically present in MPEG technologies, do not have to be introduced (e.g. motion compensation using motion vectors).

Basically, one Gaussian kernel represents one bundle of light in space. It has an orientation and an extend through space and time.

A SMoE reconstruction is exemplified below using 9,000 kernels instead of the original 41 million pixels. This means that there is approximately 1 kernel for every 5,000 pixels. In general, the development of this method was mainly focused on image compression rather than computer graphics, so bit-efficiency has always been a priority.

The light field is shown below as a video that goes through all the camera standpoints (taken with a Lytro lenslet camera). The video traverses through the camera from the top left to the bottom right.

Left: original, Right: SMoE reconstruction. Source: Video by the author.

Below is an illustration of the 4D kernels of a crop of the above. Sadly, 4D is inherently difficult for us humans to wrap our heads around. However, the main key takeaway is that if you move your head left and right in a scene, a patch of color will move left or right relative to the distance of that patch of color in the scene. Same goes for moving your head up and down. This relative movement is what is visualized on the a1 and a2 dimensions below. The main observation is that kernels have an extent in all four dimensions. A kernel is thus responsible for a patch of color that can move from left to right or top and bottom based on the camera viewpoint.

Visualization of the 4 dimensions of the light field, i.e. the image dimension and the epipolar planes. Source: Image by the author.

The rendering of such SMoE models was heavily improved by the works of Martijn Courteaux (UGent - IDLab-MEDIA). It allows for real-time rendering of light fields from any angle. Here’s one example based on only 9 images from one of my DIY light field camera array. Since the kernels find the correlation between the different viewpoints, they provide a smooth transition in between and even outside of the original camera plane.

Light Field rendering of a SMoE model. Source: Video by the author.

Visualization of the kernels of the model. Source: Video by the author.

The issue with SMoE was that, although theoretically sound, there always remained a struggle to obtain near-lossless quality or ways of dealing with fine texture details. In my works, I mainly trained the GMMs by using the Expectation-Maximization (EM) algorithm. I even developed a method to scale this algorithm to accommodate hundreds of thousands of kernels on billions of pixels (I might write about that in a later post). There have been follow-up methods published on how to better initialize and train the GMM using MSE-optimization using gradient descent. This involved making the model differentiable, similar to Gaussian Splatting. This greatly improves image quality but broke the theoretical soundness that the model was not a pure Bayesian GMM anymore (for all that it's worth).

The same Martijn Courteaux (sitting in the back in the video) has worked on bringing the modeling and rendering of light fields using SMoE to a whole new level over the last few years. A sneak peak is included below:

Previewing some recent advances in SMoE-based rendering. Source: Martijn Courteaux Youtube Channel .

I will discuss the current state of SMoE vs Gaussian Splatting in our wrapping-up section. But first, I will continue chronologically through the major breakthroughs.

NeRF (2020)

Neural Radiance Fields (NeRFs) and the subsequent research based on NeRFs, are whole-scene methods in which a single neural network captures the entire scene. The neural network maps the viewing angle onto the color output, providing a continuous representation of the entire scene.

Source: History of Neural Randiance Fields

The data that needs to be saved consists solely of the weights (and the architecture) of the trained neural network. The downside is that the whole set of weights is required to reconstruct even portions of the scene. This leads us to the main disadvantages of whole-scene NeRF methods: encoding and decoding complexity and memory requirements. There are no "building blocks" as each scene corresponds with training an entire neural net. Reconstruction corresponds to inferring the entire neural network.

Source: https://www.matthewtancik.com/nerf

The biggest advantage of NeRFs is that they have a complete knowledge of the scene, which can lead to much better image quality and fewer camera viewpoints necessary due to better generalization between image viewpoints.

NeRF reconstruction example. More examples on https://www.matthewtancik.com/nerf (source)

Gaussian Splatting (2023)

Recently, Gaussian Splatting has received much attention, and rightfully so. It is a Gaussian-based method similar to SMoE, but it addresses some of the persistent issues present with SMoE. The optimization of Gaussian Splatting parameters is MSE-based, similar to extensions of SMoE.

The main differences to SMoE are as follows:

The Gaussian kernels exist in the physical 3D coordinate space, whereas SMoE kernels exist in the camera-image-plane coordinate system. As such, they have a more explicit connection to the real geometry. Furthermore, this allows the Gaussians to be initialized better by using structure-from-motion, which greatly improves the optimization process to achieve an optimum quickly and efficiently.

Spherical Harmonics are used as the view-dependent color function which provide more expressive local expert functions compared to the color gradients in SMoE. This has been key to achieve photo-realistic results.

Gaussian splatting benefits from a plethora of optimization possibilities, as it can be implemented as a rasterization method. As such, it benefits from decades of computer graphics advancements.

High-level comparison between SMoE and Gaussian Splatting. Source: image by the author.

Seemingly bikes make for great test cases in the field of view synthesis, since there are many small structures such as spokes and brake cables. Below is an illustration from a different bike scene modelled by Gaussian splatting.

Source: the Hugging Face blog.

Similar to our SMoE example of a scene, an example is shown here which illustrates the quality of Gaussian Splatting in a similar scene, albeit recorded using a more higher end camera rig.

Source: 3D Gaussian Splatting at Plain Concepts.

It is great to see a method reaching maturity that will definitely make a huge impact on virtual video productions and a variety of game production tools. Especially since it is not a black-box model, but there are building blocks that can be segmented, compressed, streamed... A lot of the paradigms of video coding are applicable to splats which is exciting.

For those who want more details. I would highly recommend reading the excellent Comprehensive Overview of Gaussian Splatting by Kate Yurkova.

ReconFusion (2023)

I included ReconFusion as it clearly demonstrates the benefit of a technique that is purely deep-learning based. ReconFusion is basically the combination of NeRFs with diffusion, which is a rather novel architecture within deep learning that is especially good at generalizing to unseen data. This generalization translates to needing fewer original camera viewpoints as it "inpaints" the missing views based on prior image knowledge gathered by training on a large dataset of viewpoints.

Below you can see the comparison between regular NeRFs and the ReconFusion using diffusion priors. It is clear that ReconFusion requires much less initial camera viewpoints.

Source: https://reconfusion.github.io/

Wrapping up

There are currently two new main paradigms in image-based rendering: the Gaussian-based methods and the whole scene deep-learning methods. Both have their pros and cons listed below.

Comparison between Gaussian-based methods (left) and whole scene NeRF methods (right). Source: image by the author.

I strongly believe that when it comes to creating efficient streamable camera-captured VR content, Gaussian kernel-based methods are the clear choice. While NeRFs can be utilized as 3D assets in virtual productions, Gaussian splats can serve the same purpose just as effectively. Additionally, Gaussian splats offer the advantage of being loosely connected to the underlying geometry. This opens up possibilities for editing these 3D assets in various ways.

NeRFs can still be employed in scenarios with restricted viewpoints and a greater need for image priors. The methods are not mutually exclusive either. It may be logical to initially generate a NeRF using limited camera viewpoints, harness the capabilities of diffusion, and subsequently create a Gaussian Splat for a more practical model.

Sadly, for some reason, currently unpublished improvements to SMoE have been extremely difficult to get published. The novel methods have even been rejected three times at ACM Transactions on Graphics, the SIGGRAPH journal in which Gaussian Splatting was introduced. Anyway, I'll spare you a massive rant on the current state of journal review processes. Nevertheless, I hope to see the publication soon, which would also benefit the Gaussian Splatting community since many improvements are transferable between the two techniques.

One thing is for sure, the field is more alive and kicking than I have ever experienced in my career. At Datameister.ai, we're following up on all the developments in the field and are currently exploring how we can contribute to it. More on that later!

Feel free to discuss the article on reddit: https://www.reddit.com/r/GaussianSplatting/comments/1ax1102/a_higherlevel_view_on_ai_models_in_radiance/

Making the case for custom LLMs and custom LLM deployments

Ruben Verhack — Wed, 20 Dec 2023 15:15:31 GMT

In this article, I will cover two of the major debates in the NLP community and when working with clients:

Do you use an API of a proprietary company (e.g. OpenAI’s GPT-4) or do you use a custom / off-the-shelve open-source LLM?

If using an open source LLM, do you deploy your own nodes or do you rely on services such as Amazon Bedrock or Sagemaker?

Choose your side the debate: self-deployed open source LLMs or external proprietary APIs?

The answer to this question is not definitive and, naturally, it varies depending on the application. The fact that this field is constantly evolving makes it difficult to provide a straightforward response. While there are certainly advantages to utilizing external APIs (GPT-4’s capabilities are still the gold standard), there are also numerous pitfalls that must be taken into account. In many cases, opting for a custom and self-deployed LLM is the wisest and most cost-efficient choice.

Spoiler alert: I will obviously discuss how working with Datameister provides you with the best of both worlds.

Risk of API dependency

There are many possible applications for LLMs. Most of the chat-like, creative applications have taken most of the spotlight recently, but actually in the industry LLMs are mainly used in much more closed contexts.

Typically, clients want to generate text (reports, mailings, ...) based on structured data (scores, categorical data, ...).

Or the other way around, the LLM receives unstructured data (freeform text) and is asked to extract some specific information out of a text.

Or alternatively, they want to serve an end-user predefined knowledge from their resource center in a controlled manner.

Reproducibility is key in production environments. Basically, there are few cases where surprise outcomes are welcomed. The creative factor of generative AI has been the most attention-grabbing feature lately, but it is arguably not the most useful in most industries.

Picture the following. So you’ve made the investment to put in considerable engineering work on a specific OpenAI model. You’re app runs fine, the client is happy. You were able to be cost-efficient by using some of the older OpenAI models since they were fine for your use case. Life is good. You shut down shop for the weekend, and then this email appears:

OpenAI’s announcement in which they deprecate lange models that are only a few years

I don't think your client was counting on having to update models every two years. Wouldn't it be nice if you had a specific version of a specific LLM running? A version that you knew would never suddenly change? An LLM that you know what works and what doesn't work.

LLMs are hard to validate. In theory, changing models is easy, but in practice, each model behaves very uniquely. Changes models costs time and thus money.

At Datameister, we ensure the deployment and continuous operation of your LLM for as long as you need.

Building your own IP

If I had a penny for every startup idea in 2023 that was along the following lines, then, well, I would have a lot of pennies:

My app will be a chatbot front-end build on ChatGPT that takes in resources from the field X to provide users faster, easier and more human-readable info about X.

Now, OpenAI has released GPTs in which people can do exactly this without any code. Can you spot what the problem was in the above business plan? The main issue that the company did not hold the IP behind the core functionality of the product. The IP lays entirely in the data. In some cases, this will still be a valid plan, e.g. when you are the sole proprietor of uniquely copyrighted material, but in most cases it isn’t. OpenAI came in and destroyed 90% of the LLM startups in a single product release.

I haven’t come across many open-ended applications that are build on top of GPT-4 that delivered unique IP to a company. They mainly rely on the GPT-4 magic, which you do not own.

At Datameister, you own your LLM, we only deploy them for you.

One of the key advantages of deploying your own LLM is the freedom to customize the model according to your specific needs or preferences. Unlike external APIs, which may have limitations on customization options, having full control over the model allows you to tailor it precisely to fit your requirements. Of course you can fine-tune models on AWS or OpenAI. However, these are very costly operations and still provide you with a vendor lock-in. When you customize, you build intellectual property. Your IP is what sets you apart from your competition.

Managing your own LLM provides an opportunity for deeper understanding and learning within your team or organization. By taking ownership of the model's management and maintenance, you can gain valuable insights into how it works and potentially drive innovation in natural language processing within your company.

Cost Control

For large-scale or long-term use, deploying your own LLM can be more cost-effective than paying for API usage, and even when using platforms like AWS SageMaker. While there may be an initial setup cost involved, the savings can add up over time as you avoid recurring API fees. Note that, in general, the throughput and delay requirements are the two main cost drivers.

Inference

Most LLM applications do not have strict time constraints, allowing for various cost optimizations. At Datameister, we focus on two key drivers for reducing costs:

Scale-to-zero: Running an LLM continuously can be expensive. However, with our deployment platform, compute nodes are only active when necessary. While there may be a short delay when starting a node that was previously inactive, this is not an issue for non-time critical applications. In fact, it can result in up to 95% cost savings.

Spot instances: On platforms like AWS, spot instances offer lower costs but less predictability compared to on-demand instances which are more expensive but provide stability. By leveraging a robust job scheduling system like ours and not requiring real-time processing, you can take advantage of spot instances and save at least 50% on compute costs without worrying about their inherent unpredictability.

When you utilize SageMaker to deploy your own LLMs, you do not have the choice to use spot instances for inferencing (only for training). Instead, only the on-demand system is accessible, which leads to higher costs. Nevertheless, SageMaker does allow for serverless inferencing and off-line batch transforms. Although pricing remains relatively high.

Fine-tuning

Fine-tuning through external APIs such as OpenAI can be very expensive. Form our experience, fine-tuning a GPT3.5 on a few thousands (large) samples can rake up to hundreds of dollars. This does not allow you to do a lot of experimentation. Plus, even if you wanted to spent thousands of dollars on fine-tuning, you need to be in the right usage tier to be allowed to do so.

For the same cost-cutting reasons as in inferencing, it can be more cost efficient to deploy your own LLM if you need many iterations in your R&D phase. Furthermore, you do not have a lock-in to a specific vendor which could change pricing at will.

Deploying your own LLM at Datameister can offer significant cost savings and predictability compared to using external APIs or AWS SageMaker.

Full Control Over Data and Privacy

This is a harder topic to cover. OpenAI is now GDPR compliant and you can run the OpenAI API in Europe through Azure’s OpenAI. You can also run Claude, or Jurassic-21 models using Amazon Bedrock in your preferred region. Oh, wait, no you can’t. At time of writing, most models are not available in every region on AWS, but it probably won’t take much time before they are.

However, many clients still prefer to have the LLM running in either their own cluster or in the Datameister cluster. This provides them with greater control over their data, ensuring enhanced privacy and security. This is especially crucial when dealing with sensitive or proprietary information. By keeping the data in-house, clients can guarantee its protection and confidentiality. However, theoretically, this shouldn't be a problem if external APIs are used correctly within the designated regions and if all applicable regulations are followed.

Datameister: Tackling the Challenges Associated with Self-Deployed LLMs

While there are many benefits to deploying your own LLM, it's important to consider the challenges that come with it:

High initial setup costs and complexity: Setting up your own LLM requires a significant investment. You need the right infrastructure and resources to ensure smooth operation and optimal performance.

Scalability challenges: Managing scalability in-house can be challenging, especially if there are sudden spikes in demand for your LLM.

Limited knowledge, resources and support: Unlike established external APIs, deploying your own LLM may limit your access to support and resources. If you encounter unique challenges or bugs, you might have to rely on internal expertise or community forums for assistance.

And of course, there is the risk of obsolescence: NLP is rapidly evolving, with new advancements being made regularly. If you're deploying your own LLM, there's a risk that it might become outdated if you're unable to keep up with these advancements. However, as mentioned above, in many use cases it is actually preferred to have a system with known limitations, over a solution that is constantly evolving and needs to be constant validation. Furthermore, most of your power lays in your data. In the case of fine-tuning, you can fine-tune newer models with the same dataset.

While deploying your own LLM comes with its fair share of challenges, most of these are addressed when relying on Datameister.

Most challenges are offloaded to Datameister:

Initial Setup Cost: Datameister has already covered the initial setup cost by building on their Datameister LLM deployment platform.

Complexity: Datameister manages the complexity associated with setting up an LLM.

Scalability: With Datameister's solution, scaling is taken care of by automatically spinning up instances as needed, reducing the challenges of managing scalability in-house. Scaling to zero machines is one of our unique offerings, this means there are no machines running when no workload is present. No machines = no costs.

Support: Datameister offers experienced support to help you overcome any unique challenges or bugs you may encounter.

Risk of Obsolescence: By relying on Datameister's expertise and continuous updates, you can mitigate the risk of your LLM becoming outdated.

Conclusion

In conclusion, here are the key points to consider when deciding between using an external proprietary API or deploying your own LLM with Datameister:

Opting for a custom and self-deployed LLM provides greater control, ownership, and customization options.

With Datameister, you can ensure the deployment and continuous operation of your LLM for as long as you need.

Building your own IP is crucial in most cases, as relying solely on external APIs may result in a lack of ownership over the core functionality of your product.

Deploying your own LLM allows for cost control and potential savings compared to recurring API fees or platforms like AWS SageMaker.

Datameister offers cost-cutting measures such as scale-to-zero instances and leveraging spot instances for inferencing.

You have full control over data privacy when running your LLM in either your own cluster or the Datameister cluster.

While deploying your own LLM comes with challenges, many of these are addressed by relying on Datameister's expertise and support.

Meet The Meisters

Ruben Verhack — Thu, 14 Dec 2023 09:44:04 GMT

The Meisters: Axel (left), Ruben (right)

Datameister is a deep-tech AI lab in Ghent, founded in 2023 by Ruben Verhack and Axel Vlaminck out of one conviction: the AI problems worth doing are the hard, deep ones, and they are worth going long on. Our work has run from cybersecurity to medical imaging to sports analytics, and has converged on spatial and visual intelligence and, now, Physical AI.

We are bootstrapped and built on real projects: every line of our code has shipped in a client's production system. What began as the two of us is now a team of more than a dozen engineers and researchers.

Result: the depth, the track record, and the in-house R&D to take on the AI problems most teams avoid.

Why we started Datameister

We started Datameister in June 2023, the two of us, Axel and Ruben, out of three frustrations.

The first was academic. Ruben had spent years in research and loved its technical depth, but the impact was missing: work that is excellent on paper does not always change anything in the real world.

The second was about where we live. Ghent punches far above its weight, a top-ten global density leader for tech, but it is better known for SaaS than for deep technical work. We wanted to do the harder, more research-heavy AI here.

The third was about time horizons. The investor climate rewards quick returns, which pushes companies toward thin products, the ChatGPT wrappers of the world. Those have their place, but it is not our game. We wanted room to go deep and long, where the hard problems actually get solved.

So we built a lab that could do exactly that: independent, technically deep, and patient enough to back its own R&D.

Bootstrapped, and built on real projects

What sets us apart is depth, and a throughline to where it now points. Early on we took on whatever hard AI modelling problems came our way, across cybersecurity, medical imaging, and sports analytics, and we have been building, fine-tuning, and hosting LLMs for clients since GPT-2. Over time we specialized in spatial and visual intelligence: computer vision, 3D data, and the understanding and generation of scenes and assets. That work has found its place in Physical AI, getting models off the screen and onto robots that perceive and act in 3D space.

The other half is how we are built. Datameister is bootstrapped. We fund the deep work by doing real work, which keeps our incentives pointed at the same thing our clients care about: getting models into production. Good AI should earn its keep quickly in the real world, and the returns we chase come from systems that ship and keep running under load. A product that demos well but falls over in production is not what we are after.

That discipline shows up in our code. Everything we build lives in one monorepo, and every line in it has shipped in a real client project. Out of it grew the DM Library, our collection of AI capabilities proven in production, and the DM Platform for deployment, MLOps, and delivery. That work runs at scale: clients from startups to the Fortune 100, tens of thousands of hours of video processed a month, and on the order of 100,000 open-source packages scanned a day for security. The monorepo is where our speed comes from, and our filter against the hype: if something does not survive contact with real data and real deployments, it does not stay. We take the bleeding edge and get it into production.

Who are the Meisters?

Datameister was founded by the two of us: Ruben Verhack (CEO) and Axel Vlaminck (CTO), engineers at heart with a strong passion for AI. We met at Oqton, the AI company later acquired by 3D Systems for a reported 180M USD in 2021. There we became an ad-hoc AI prototyping lab: management handed us an idea and a month to prove it out, technically and as a product. Axel led the technical side, Ruben owned the product. Killing most of those ideas gave us a sharp instinct for what AI can and cannot do, what scales, and what delivers value. It is also where we caught the itch to build something of our own.

Ruben

Ruben has been building software since 2007: first a company that took on all kinds of work, much of it image processing, then an AI radiology startup in 2019.

Alongside that, he built an academic career, earning a double PhD in computer science from Ghent University and TU Berlin. His doctoral work introduced Steered Mixture-of-Experts (SMoE), an image-based scene representation that predates today's Gaussian Splatting, and it earned a string of awards: the Google Faculty Award in 2015, a Best Paper Award at IEEE Transactions on Multimedia, and several Best Student Paper Awards. He was invited to speak at companies like Google (Mountain View and Zürich), Disney Research, and Netflix.

After the radiology startup, Ruben consulted across analytics, natural language processing, and computer vision. He then spent two years at Oqton, working on robotic welding and geometrical reasoning through AI.

Axel

Axel is a hands-on engineer who builds across the whole stack. He trained as an electronics engineer and moved into AI early, back when its courses were still considered exotic. His master's thesis was on training quadruped robots, years before Physical AI became a buzzword and the same ground Datameister works on today. He was one of the first employees at Oqton and one of the biggest drivers of its AI team.

Today he works across signal processing, computer vision, point clouds, and meshes. He is as comfortable with the physics as he is setting up MLOps environments or building his own transformer-based networks.

The team we built

We are just as proud of the people we have brought in since. Our Head of Engineering, Thijs Bernolet, joined from Autodesk and Oqton. Bernard Grymonpon, formerly of Showpad and Oqton, is our senior technical advisor. Jarne Van den Herrewegen holds a PhD focused on 3D AI representations. Behind them is a deeper bench of engineers and researchers whose work speaks for itself, and you will see many of them publishing here on the blog in the months ahead.

What's next

We are building out our in-house R&D and pushing further into spatial intelligence and Physical AI. If you are an engineer or researcher who wants to work on hard problems with real-world stakes, take a look at our open roles. And if you are a team with an AI problem that has to survive production, come say hi.