Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept—such as an uniquely structured wing, or a specific hairstyle—serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.
Sample images generated using FLUX-Schnell, which are used to train our IP-Prior model.
The CLIP space is well suited for semantic manipulations but limited in its ability to preserve
complex concepts, resulting in a loss of details.
This intuitively stems from the fact that CLIP was never trained to reconstruct images but rather to
learn a joint representation space for text and images. While this encourages a semantic
representation, it does not require the representation to encode visual details that cannot be
easily described through text.
To improve on this, we explore alternative spaces and ultimately converge on the internal
representation of IP-Adapter+. Using this IP+ space not only results in improved reconstructions but
also retains the ability to perform semantic manipulations and thus can serve as an effective
representation for visual concepts.
We encode the input image (left) into two different embedding spaces, modify its latent representation by traversing each space, and render the edited image using SDXL. As shown, CLIP struggles to both reconstruct the concept and follow the desired edit, whereas in IP+ space, the rendered images are faithful both to the concept and the desired edit across the entire range.
IP-Adapter+ enables rendering generated concepts via SDXL but often struggles with text adherence. To address this, we fine-tune a LoRA adapter over paired examples, where the conditioning image has a clean background and the target image places the object in a scene described using a text prompt. This lightweight training (using just 50 prompts) effectively restores text control while maintaining visual fidelity.
The same tuning mechanism can be used to force a specific style on the outputs of the SDXL model when conditioned with the same concept embedding inputs.