This has been the dominant story going around, I guess because people want it to be true since they're pissed at OpenAI for not being so open, but StableDiffusion's text2image is nowhere near as good as DALL-E 2 in my experience. DALL-E 2 is incredible at that, StableDiffusion is not.
But maybe it doesn't matter, because many times more people are playing around with StableDiffusion, such that the absolute number of good images being shared around is much higher with StableDiffusion, even if the average result isn't great.
> I guess because people want it to be true since they're pissed at OpenAI for not being so open
This is honestly not my experience at all. When I first tried SD and MJ, I did so with a very clear and distinct feeling that they were "knock-off DALL-Es" and I strongly doubted that they would be able to produce anything on the level of DALL-E. Indeed, I believed this for my first couple hundred prompts, mostly because I didn't know how to properly prompt them.
After using them for around a month, I slowly realized that this was not the case, and in fact they were outperforming DALL-E for most of my normal usage. I have a bunch of prompts where SD and MJ produce absolutely beautiful and coherent artwork with extremely high consistency, that when sent to DALL-E, give significantly worse results.
It depends on what you're generating, complex prompts in DALLE ("a witch tossing rapunzels hair into a paper shredder at the bottom of the tower") blow midjourney and stable diffusion out of water.
But if all you're doing is the equivalent of visual mad Libs: "Abraham Lincoln wearing a zoot suit on the moon.", then SD and MJ suffice.
With several thousand images on each, I agree with this -- to a degree.
Dall-E does seem more aware of relationships among things, but using parens and careful word order in some of the SD builds can beat it. By contrast, even most failed images from MidJourney could still be in an outsider art gallery. MJ aesthetic works, while Dall-E seems like a 9 year old was taken hostage and clipped out Rapunzel and the paper shredder from magazines and pasted them onto a ransom note.
That said, I have not been able to get any of Dall-E, MJ, or SD to give me a coherent black Ford Excursion towing a silver camping trailer on the surface of the moon beneath an earthrise.
At cost per image, I could pay to get complex concepts such as this rendered via any number of art-for-hire sites at less expense and guaranteed results.
Only some builds support it. This is the one I'm familiar with: [0]. () around a word causes the model to pay more attention to it, [] around a word causes the model to pay less attention. There's an example at the link.
It's not just many times more people, it's also the fact that Stable Diffusion can be used locally for ~free.
If I get a bad result from DALL-E 2, I used up one of my credits. If I get a bad result from Stable Diffusion running on my local computer, I try again until I get a good one. The result is that even if DALL-E 2 has a better success rate per attempt, Stable Diffusion has a better success rate per dollar spent.
This also affects the learning curve. I've gotten pretty good at crafting SD prompts because I could practice a lot without feeling guilty. I never attempted to get better with DALL-E 2, because I didn't really want to spend money on it.
Yes, it's true, I've tried all the available models and DALL-E 2 outperforms Stable Diffusion. It understands prompts way better and SD sometimes just plainly ignores parts of your prompt or misinterprets them completely. SD cannot generate hands at all for example, they look more like appendage horrors from another dimension.
OTOH, the main limiting factor for DALL-E 2 from my point of view is the ultra-aggressive NSFW filter. It's so bad that many innocent prompts get stopped and you get the stern message that you'll be banned if you continue, even though sometimes you have no idea which part of the prompt even violated the rules.
It's not true that SD cannot generate hands. It's a bit tricky, but it's possible.
Sometimes hands will turn out just fine and sometimes they will suddenly become fine after some random other stuff is added to the prompt.
It's clearly still missing a bit in terms of accurately following prompts, but it's capable of generating a lot of things that may not have obvious prompts. This should improve a lot with larger models. I believe SD is already working on it.
I genuinely think stable diffusion is better than dalle. There’s a really obvious ugly artifact on almost all the dalle image’s I’ve seen that SD doesnt suffer from.
But anyway, SD is far superior even if you consider dalle better per image since you can create 1000 SD outputs and just pick the one you like best (which for sure will have one that’s better than the dalle output you got)
My own experience is that SD requires a lot more prompt engineering to get an appealing output; DALL-E and midjourney spit out amazing results with even minimal inputs. But what I've found is that when it subs in its own aesthetic, its the same aesthetic. Almost like a style.
You're right. History has shown the best quality product doesn't always win if there's a "just okay" solution laying around that's more accessible. VHS and Windows both come to mind.
From my experience there isn't a clear difference in quality between the output produced by Dalle2 and Stable Diffusion. They both suffer from their own unique idiosyncrasies, and the result is that they have differently shaped learning curves.
I do admit that I rate the creativity of Dalle2 higher than that of SD. It can occasionally create really unexpected and exciting compositions, whereas SD will more often lean more conventional.
But maybe it doesn't matter, because many times more people are playing around with StableDiffusion, such that the absolute number of good images being shared around is much higher with StableDiffusion, even if the average result isn't great.