Piece it Together

Abstract

Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept—such as an uniquely structured wing, or a specific hairstyle—serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.

The IP-Prior Architecture

Given an input image, we extract its semantic components, sample a subset, and encode each image patch into the IP+ space using a frozen IP-Adapter+. The image embeddings are then passed together through our IP-Prior model. The IP-Prior model outputs a cleaned image embedding that captures the intended concept, from which we generate the concept image with SDXL.
At inference time, users can provide a varying number of object-part images to generate a new concept that aligns with the learned distribution.

Generated Data Samples

Sample images generated using FLUX-Schnell, which are used to train our IP-Prior model.

The IP+ Space

The CLIP space is well suited for semantic manipulations but limited in its ability to preserve complex concepts, resulting in a loss of details. This intuitively stems from the fact that CLIP was never trained to reconstruct images but rather to learn a joint representation space for text and images. While this encourages a semantic representation, it does not require the representation to encode visual details that cannot be easily described through text.
To improve on this, we explore alternative spaces and ultimately converge on the internal representation of IP-Adapter+. Using this IP+ space not only results in improved reconstructions but also retains the ability to perform semantic manipulations and thus can serve as an effective representation for visual concepts.

Semantic Manipulations in the IP+ Space

We encode the input image (left) into two different embedding spaces, modify its latent representation by traversing each space, and render the edited image using SDXL. As shown, CLIP struggles to both reconstruct the concept and follow the desired edit, whereas in IP+ space, the rendered images are faithful both to the concept and the desired edit across the entire range.

Additional Examples

Recovering Text Adherence with IP-LoRA

IP-Adapter+ enables rendering generated concepts via SDXL but often struggles with text adherence. To address this, we fine-tune a LoRA adapter over paired examples, where the conditioning image has a clean background and the target image places the object in a scene described using a text prompt. This lightweight training (using just 50 prompts) effectively restores text control while maintaining visual fidelity.

Styled Generation

The same tuning mechanism can be used to force a specific style on the outputs of the SDXL model when conditioned with the same concept embedding inputs.

Additional Results

Multiple Priors

Given a single input part, we generate concepts across different learned IP-Prior models, highlighting how each model naturally interprets and adapts the part according to its learned distribution.

Single Inputs

Concepts generated by PiT using a single input part, showcasing the variation across the generated results.

Piece it Together: Part-Based Concepting with IP-Priors

Using a dedicated prior for the target domain, our method, Piece it Together (PiT), effectively completes missing information by seamlessly integrating given elements into a coherent composition while adding the necessary missing pieces needed for the complete concept to reside in the prior domain.