Piece it Together: Part-Based Concepting with IP-Priors

teaser image

Using a dedicated prior for the target domain, our method, Piece it Together (PiT), effectively completes missing information by seamlessly integrating given elements into a coherent composition while adding the necessary missing pieces needed for the complete concept to reside in the prior domain.

Abstract

Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept—such as an uniquely structured wing, or a specific hairstyle—serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.

PiT Results

sealrat
suncrest_lizard
fruit


The IP-Prior Architecture

PiT
Given an input image, we extract its semantic components, sample a subset, and encode each image patch into the IP+ space using a frozen IP-Adapter+. The image embeddings are then passed together through our IP-Prior model. The IP-Prior model outputs a cleaned image embedding that captures the intended concept, from which we generate the concept image with SDXL.
At inference time, users can provide a varying number of object-part images to generate a new concept that aligns with the learned distribution.

Generated Data Samples

PiT

Sample images generated using FLUX-Schnell, which are used to train our IP-Prior model.

The IP+ Space

The CLIP space is well suited for semantic manipulations but limited in its ability to preserve complex concepts, resulting in a loss of details. This intuitively stems from the fact that CLIP was never trained to reconstruct images but rather to learn a joint representation space for text and images. While this encourages a semantic representation, it does not require the representation to encode visual details that cannot be easily described through text.
To improve on this, we explore alternative spaces and ultimately converge on the internal representation of IP-Adapter+. Using this IP+ space not only results in improved reconstructions but also retains the ability to perform semantic manipulations and thus can serve as an effective representation for visual concepts.

Semantic Manipulations in the IP+ Space

PiT

We encode the input image (left) into two different embedding spaces, modify its latent representation by traversing each space, and render the edited image using SDXL. As shown, CLIP struggles to both reconstruct the concept and follow the desired edit, whereas in IP+ space, the rendered images are faithful both to the concept and the desired edit across the entire range.

Additional Examples

PiT

Recovering Text Adherence with IP-LoRA

PiT

IP-Adapter+ enables rendering generated concepts via SDXL but often struggles with text adherence. To address this, we fine-tune a LoRA adapter over paired examples, where the conditioning image has a clean background and the target image places the object in a scene described using a text prompt. This lightweight training (using just 50 prompts) effectively restores text control while maintaining visual fidelity.

Styled Generation

PiT PiT

The same tuning mechanism can be used to force a specific style on the outputs of the SDXL model when conditioned with the same concept embedding inputs.

Additional Results


Multiple Priors

Multiple Priors
Given a single input part, we generate concepts across different learned IP-Prior models, highlighting how each model naturally interprets and adapts the part according to its learned distribution.

Single Inputs

Single Inputs
Concepts generated by PiT using a single input part, showcasing the variation across the generated results.