ShapeWalk: Compositional Shape Editing through Language-Guided Chains

Abstract

We introduce ShapeWalk, a carefully curated dataset designed to advance the field of language-guided compositional shape editing. The dataset consists of 158K unique shapes connected through 26K edit chains, with an average length of 14 chained shapes. Each consecutive pair of shapes is associated with precise language instructions describing the applied edits. We synthesize edit chains by reconstructing and interpolating shapes sampled from a realistic CAD-designed 3D dataset in the parameter space of a shape program. We leverage rule-based methods and language models to generate natural language prompts corresponding to each edit. To illustrate the practicality of our contribution, we train neural editor modules in the latent space of shape autoencoders, and demonstrate the ability of our dataset to enable a variety of language-guided shape edits. Finally, we introduce multi-step editing metrics to benchmark the capacity of our models to perform recursive shape edits. We hope that our work will enable further study of compositional language-guided shape editing, and finds application in 3D CAD design and interactive modeling.

⚡ Pipeline

We introduce a synthetic dataset of language-guided chained edits, with associated language instructions and ground-truth edited shapes. We generate the dataset by reconstructing and interpolating shapes sampled from a realistic CAD-designed 3D dataset in the parameter space of a shape program.

🎯 Targeted Edits

We compare the ShapeTalk dataset (top) with our work (bottom). For an equivalent edit instruction (in green), ShapeTalk provides pairs of shapes with many factors of variation, while our dataset provides a pair of shapes with a single clear varying factor, alongside an edit magnitude.

🪑 Synthetic Dataset

We showcase generated synthetic chains using our method. For each chain edge, we show the corresponding language instruction describing the parameter changes necessary to transition from one shape to the next. Generated shape chains are realistic and diverse, and focus on fine-grained shape details.

👆 Click to pause/play again.

♻️ Compositional Editing

We show that training a neural editor module in the latent space of a shape autoencoder using our synthetic dataset enables a variety of language-guided shape edits. We show examples of shape edits learned using our synthetic data, in the space of a simple pointcloud autoencoder.

👆 Click to pause/play again.

We also experiment with a more complex transformer-based diffusion model operating in the latent space of 3DS2VecSet, a global latent set representation. We show generated meshes following a variety of language-guided edits below, at every step of the diffusion process. Generated edits are of high quality, and enable fine-grained shape changes (e.g. decreasing legs thickness) as well as complex layout changes (e.g. adding a slatted backrest).