Autonomous robotic assembly of interlocking bricks demands seamless integration of long-horizon task reasoning, spatial grounding, and fine-grained manipulation. This paper presents BrickCraft, a compositional framework designed for long-horizon and generalizable interlocking brick assembly. BrickCraft models the assembly process using a relative formulation, where each step is anchored to a reference brick within the partial structure, thereby decomposing complex tasks into a finite set of reusable primitive skills. BrickCraft bridges the gap between high-level assembly plans and physical execution through situated manuals, which provide explicit spatial guidance for learned visuomotor skills by projecting the assembly intent onto real-time robot observations. Finally, BrickCraft employs a compositional execution pipeline that chains these spatially grounded skills to accomplish long-horizon assembly tasks. Extensive experimental validations demonstrate that BrickCraft acquires proficient assembly skills from a limited set of demonstrations and exhibits strong compositional generalization to unseen structures.
BrickCraft transforms a digital design into a physical product through three phases: (i) Skill-Oriented Assembly Reasoning decomposes the long-horizon task into steps anchored to reference bricks and maps them to reusable primitive skills; (ii) Assembly Intent Grounding generates situated manuals to provide spatial guidance; and (iii) Compositional Visuomotor Execution chains visuomotor skills to complete the assembly.
(a) Assembly Intent Grounding: Symbolic assembly plans are rendered into visual references in simulation and aligned with real-world observations \(I_{ws}\) to extract task-relevant entity masks. These masks are tracked via SAM 2 and overlaid onto real-time observations to yield the situated manual. (b) Visuomotor Skill Execution: We formulate the assembly skill as a diffusion policy. The policy takes the situated manual as observation input and is conditioned on the task encoding \(\tau\) to generate diverse assembly behaviors.
We trained three visuomotor assembly skills on a broad range of demonstrations. The situated manual enables these skills to execute diverse assembly tasks with seamless adaptation to unseen structures.
By chaining spatially grounded skills into a composable execution pipeline, BrickCraft achieves fully autonomous, long-horizon assembly across various brick designs.
@misc{yu2026brickcraft,
title={BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly},
author={Jichuan Yu and Bowei Li and Zhenran Tang and Guanxing Lu and Chuxiong Hu and Ruixuan Liu and Changliu Liu},
year={2026},
eprint={2605.07605},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2605.07605},
}