MSGField: A Unified Scene Representation Integrating Motion, Semantics, and Geometry for Robotic Manipulation

Yu Sheng , Runfeng Lin , Lidian Wang , Quecheng Qiu , Yanyong Zhang , Yu Zhang , Bei Hua , Jianmin Ji
University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Sci- ence Center, Hefei, Anhui, China
Code arXiv

The pipeline of MSGField. Geometry field captured by surface reconstruction from 2D Gaussian Splatting. In the semantic field, each primitive is assigned a label, which links to an object feature extract from CLIP. For the motion field, we represent scene motion with Motion Bases, where each primitive's motion is a combination of these base.

Abstract

Combining accurate geometry with rich semantics has been proven to be highly effective for language-guided robotic manipulation.Existing methods for dynamic scenes either fail to update in real-time or rely on additional depth sensors for simple scene editing, limiting their applicability in real-world.In this paper, we introduce MSGField, a representation that uses a collection of 2D Gaussians for high-quality reconstruction, further enhanced with attributes to encode semantic and motion information. Specially, we represent the motion field compactly by decomposing each primitive's motion into a combination of a limited set of motion bases. Leveraging the differentiable real-time rendering of Gaussian splatting, we can quickly optimize object motion, even for complex non-rigid motions, with image supervision from only two camera views. Additionally, we designed a pipeline that utilizes object priors to efficiently obtain well-defined semantics.In our challenging dataset, which includes flexible and extremely small objects, our method achieve a success rate of 79.2% in static and 63.3% in dynamic environments for language-guided manipulation. For specified object grasping, we achieve a success rate of 90%, on par with point cloud-based methods.

Dyanmic grasping with MSGField. MSGField can quickly and accurately optimize object motion for rigid or non-rigid objects.

Small Objects Grasping with MSGField.

Flexible Objects Grasping with MSGField.