Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

1Technion - Israel Institute of Technology 2Tel-Aviv University    

Sharp-It is a multi-view-to-multi-view model that enhances low-quality 3D shapes, correcting fine-grained geometric details and adding appearance features. Sharp-It enables various applications such as 3D object synthesis, editing, and controlled generation.

Sharp-It Teaser

A rainbow leather chair

A wooden tiki mask

Generation results generated through Sharp-E, and Refined using Sharp-It. The prompts are written below each video.

Abstract

Advancements in text-to-image diffusion models have led to significant progress in fast 3D content creation. One common approach is to generate a set of multi-view images of an object, and then reconstruct it into a 3D model. However, this approach bypasses the use of a native 3D representation of the object and is hence prone to geometric artifacts and limited in controllability and manipulation capabilities. An alternative approach involves native 3D generative models that directly produce 3D representations. These models, however, are typically limited in their resolution, resulting in lower quality 3D objects. In this work, we bridge the quality gap between methods that directly generate 3D representations and ones that reconstruct 3D objects from multi-view images. We introduce a multi-view to multi-view diffusion model called Sharp-It, which takes a 3D consistent set of multi-view images rendered from a low-quality object and enriches its geometric details and texture. The diffu- sion model operates on the multi-view set in parallel, in the sense that it shares features across the generated views. A high-quality 3D model can then be reconstructed from the enriched multi-view set. By leveraging the advantages of both 2D and 3D approaches, our method offers an efficient and controllable method for high-quality 3D content creation. We demonstrate that Sharp-It enables various 3D applications, such as fast synthesis, editing, and controlled generation, while attaining high-quality assets.

Geometric Enhancement

Sharp-It through its training process learns to refine geometric details and fix "floating" errors in the Shap-E representation, refining it into a high-quality 3D object, as can be seen in the dragon below

By leveraging attention mechanisms, Sharp-It refines corresponding points across multiple views to ensure consistency. For instance, self-attention mechanisms identify relationships between key geometric features, such as aligning wheels across car views, while cross-attention integrates textual guidance for semantic accuracy. These techniques ensure that both fine-grained details and geometric coherence are achieved.

Appearance Editing

A golden jewelry box A suede leather jewelry box

A leather chair A leopard print leather chair

A glass city tower A wooden tower

Object Generation

Scroll for more results. All results shown in this page are videos, and might load slowly.

Object Editing

Edit Friendly DDPM Inversion

A blue beetle car ➜ A blue SUV

A blue beetle car ➜ A turquoise beetle car

Edit-Friendly DDPM Inversion enables both reconstruction and diverse editing operations by extracting noise maps from images, and reusing them for the denoising process. This allows for meaningful manipulations while preserving structure, enabling text-based editing without model fine-tuning.

Shap-Editor

A table lamp ➜ A santa table lamp

A table lamp ➜ A golden table lamp

Shap-Editor is a feed-forward framework for editing 3D assets in Shap-E’s latent space. It provides fast, text-guided edits without optimization.

How Sharp-It Works

  • First, a 3D object is generated with Shap-E. Then, we render six views of this low-quality object. Sharp-It is a diffusion model based on Stable Diffusion that enhances these views with the guidance of a text prompt by refining geometry and adding detailed appearance. Sharp-It employs cross-attention layers for text-based guidance and self-attention layers for cross-view consistency. A high-quality 3D object can be reconstructed from the multi-view image set.
  • Self-attention maps for a query point (red) on the car’s wheel, showing highest attention weights at corresponding wheel locations across different views.