How we built it
This feature is made possible thanks to a generative AI technology we created specifically for virtual try-on (VTO), which uses a technique based on diffusion. Diffusion lets us generate every pixel from scratch to produce high-quality, realistic images of tops and blouses on models. As we tested our diffusion technique for dresses, though, we learned there are two unique challenges: First, dresses are usually a more nuanced garment, and second, dresses tend to cover more of the human body.
Let’s start with the first problem: Dresses are often more detailed than a simple top in their draping, silhouette, length or shape — and include everything from midi-length halters to mini shifts to maxi drop waists — plus everything in between. Imagine you’re trying to paint a detailed dress on a tiny canvas — it’d be hard to squeeze in details like a floral print or ruffled collar onto that small space. Enlarging the image won’t make details clearer, either, because they weren’t even visible in the first place. You can think of our VTO challenge in the same way: Our existing VTO AI model successfully diffused using low-resolution images, but in our testing with dresses, this approach often resulted in the loss of a dress’s critical details — and simply switching to high-resolution didn’t help. So our research team came up with what’s called a “progressive training strategy” for VTO, where diffusion begins with lower-resolution images and gradually trains in higher resolutions during the diffusion process. With this approach, the finer details are reflected, so every pleat and print comes through crystal clear.
Next, since dresses cover more of a person’s body than tops, we found that “erasing” and “replacing” the dress on a person would smudge the person’s features or obscure important details of their body — much like it would if you were painting a portrait of someone and later tried to erase and replace their dress. To prevent this “identity loss” from happening, we came up with a new technique called the VTO-UNet Diffusion Transformer (VTO-UDiT for short) which isolates and preserves a person’s important features. So while we train the model with “identity loss” in place, VTO-UDiT also gives us a virtual “stencil,” allowing us to re-train the model on only the person, preserving the person’s face and body. This gives us a much more accurate portrayal of not only the dress but just as important, the person wearing it.