Why stable diffusion won? Dalle3 and this is miles ahead in understanding scene ...

simonw · on Dec 13, 2023

DALL-E 3 doesn't have Stable Diffusion's killer feature, which is the ability to use an image as input and influence that image with the prompt.

(DALL-E pretends to do that, but it's actually just using GPT-4 Vision to create a description of the image and then prompting based on that.)

Live editing tools like https://drawfast.tldraw.com/ are increasingly being built on top of Stable Diffusion, and are far and away the most interesting way to interact with image generation models. You can't build that on DALL-E 3.

karmasimida · on Dec 13, 2023

Saying SD is losing or not useful isn't my position.

But it clearly didn't win in many scenarios, especially those require text to be precise, and that happens to be more important in commercial setting, to clear up those gibberish texts generated by OSS stable diffusion seems tiring by itself.

boh · on Dec 13, 2023

If you’re in charge of graphics in a “commercial setting”, you 100% couldn’t care less about text and likely do not want txt2img to include text at all. #1 it’s about the easiest thing to deal with in Photoshop, #2 you likely want to have complete control over text placement/fonts etc., #3 you actually have to have licenses for fonts, especially for commercial purposes. Using a random font from a txt2img generator can open you up to IP litigation.

boh · on Dec 13, 2023

I think because most people are used to Dall-E and the Midjourney user experience, they don't know what they're missing. In my experience SD was just as good in terms of "understanding" but offers way more features when using something like AUTOMATIC 1111.

If you're just generating something for fun then DallE/MJ is probably sufficient, but if you're doing a project that requires specific details/style/consistency you're going to need way more tools. With SD/A*1111 you can use a specific model (one that generates images with an Anime style for instance), use a ControlNet model for a specific pose, generate hundreds of potential images (without having to pay for each), use other tools like img2img/inpaint to hone your vision using the images you like, and if you're looking for a specific effect (like a gif for instance), you can use the many extensions created by the community to make it happen.

doctorpangloss · on Dec 13, 2023

> Dalle3 and this is miles ahead in understanding scene and put correct text at the right place.

I guess that turns out to be not as important for end users as you'd think.

Anyway, DeepFloyd/IF has great comprehension. It is straightforward to improve that for Stable Diffusion, I cannot tell you exactly why they haven't tried this.

astrange · on Dec 14, 2023

Deepfloyd is slower and needs a lot more memory since it's pixel diffusion.

Also not sure if it can be extended with LORAs or by turning it into a video/3D model the same way an LDM can.