Editorial illustration for Know3D uses image model with Qwen2.5‑VL to edit hidden sides of 3D objects
Know3D AI Reveals Hidden Sides of 3D Object Editing
Know3D uses image model with Qwen2.5‑VL to edit hidden sides of 3D objects
Editing the unseen facets of a 3D model has long been a stumbling block for creators who rely on natural‑language interfaces. Traditional pipelines let a language model describe a shape, then hand the description straight to a geometry engine, but they fall short when the prompt asks for changes on surfaces that never appear in the rendered view. That limitation forces users to manually sculpt the back side or settle for generic approximations, slowing iteration and raising the bar for technical skill.
Know3D’s latest experiment tackles the problem by inserting an intermediate visual step, effectively converting textual intent into a picture before the geometry stage. By doing so, the system can infer texture, shading and structure that were previously invisible to the language component. The approach promises a smoother workflow for designers who want to tweak hidden geometry with simple prompts, and it raises questions about how best to combine language, vision and 3D synthesis.
The details of the architecture are laid out in the following passage.
So Know3D takes a detour, slotting an image generation model between the language model and the 3D generator to act as a translator. The setup uses Qwen2.5-VL as the language model, Qwen-Image-Edit for image generation, and Microsoft's Trellis.2 as the 3D generator. The language model reads the text instruction and analyzes the input image.
The image generator then turns that understanding into spatial-structural information that steers the 3D generator. The trick is figuring out what information to pull from the image generator. The team tested three options: an internal image representation grabbed right before the final output, image features extracted from it via Meta's DINOv3, and the model's internal intermediate states during generation.
The last option won by a clear margin--these intermediate states carry both semantic and spatial information without relying on pixel-level accuracy or mistakes in the final image.
Know3D demonstrates a concrete step toward filling the blind spot that has long hampered single‑image 3D reconstruction. By inserting Qwen‑Image‑Edit between Qwen2.5‑VL and Microsoft’s Trellis.2, the system translates textual cues into visual suggestions for the unseen back side, then feeds those into the 3D generator. The approach is clever: the language model parses the prompt, the image model produces a plausible hidden view, and the 3D engine integrates it with the front‑facing geometry.
Early results suggest users can steer the hidden surface with simple text, something that previously required manual sculpting or multi‑view inputs. Yet the paper offers limited quantitative evaluation, so it remains unclear how consistently the method handles complex shapes or ambiguous prompts. Moreover, the reliance on a separate image‑generation step could introduce artifacts that propagate into the final mesh.
In practice, the utility of Know3D will depend on how robustly it can generalise beyond the examples shown. For now, the prototype underscores the potential of language‑guided image synthesis as a bridge in 3D generation pipelines, while leaving open questions about scalability and reliability.
Further Reading
- Know3D: Prompting 3D Generation with Knowledge from Vision Language Models - arXiv
- Use Qwen2.5-VL for Zero-Shot Object Detection - Roboflow Blog
- Qwen2.5-VL: A hands on code walkthrough - Towards AI
- Qwen2.5 VL Official Release - Qwen Official Blog
- Qwen Image Edit Full Tutorial: 26 Different Demo Cases, Prompts - YouTube (secourses)
Common Questions Answered
How does Know3D solve the challenge of editing hidden sides of 3D objects?
Know3D introduces an innovative approach by inserting an image generation model between the language model and 3D generator. The system uses Qwen2.5-VL to understand text instructions, Qwen-Image-Edit to generate spatial-structural information about unseen surfaces, and Microsoft's Trellis.2 to integrate these suggestions into the 3D model.
What specific models are used in the Know3D pipeline for 3D object editing?
The Know3D system utilizes three key models: Qwen2.5-VL as the language model to interpret text instructions, Qwen-Image-Edit for generating image-based spatial suggestions, and Microsoft's Trellis.2 as the 3D generator to incorporate those suggestions into the final 3D object. This multi-model approach allows for more sophisticated editing of previously unseen object surfaces.
Why is editing hidden sides of 3D models traditionally difficult?
Traditional 3D modeling pipelines struggle to edit surfaces that are not visible in the initial rendering, forcing creators to manually sculpt back sides or accept generic approximations. This limitation significantly slows down the creative process and requires advanced technical skills to overcome.