Image generation as fast as you can type
February 25, 2024 2:15 PM   Subscribe

While the generative AI scene is transfixed by trillion-scale chipmakers and bleeding-edge text-to-video models, there's plenty of work being done on simpler, more efficient open-source projects that don't require a datacenter to run. In addition to homebrew-friendly text options like Mistral, Llama, and Gemma, the makers of image generator Stable Diffusion have also experimented recently with SDXL Turbo, a lightweight, streamlined version that can generate complex images significantly faster. Previously, this required a decent graphics card and a complicated install process, or at least registration on a paid service -- but thanks to a free public demo from fal.ai, you can now generate and share constantly updating images yourself in real time, as fast as you can type. The quality may not be quite as good as the state-of-the-art stuff, but DALL-E Mini it ain't. No word on what it's costing the company to host or how long it might last, but for now the real-time responsiveness makes it easier than ever to get an intuitive feel for how modern image diffusers interpret text and what exactly they're capable of.

Protip: the "seed" field determines the initial noise pattern that has a big effect on the end result. You can leave it blank for random, click the button to re-randomize it, increment it on the same prompt to get limitless variations, or you can set it to a fixed seed and then tweak the prompt to do rudimentary text-based image editing. Note that some results may trigger the model's somewhat overcautious internal safety filter, resulting in a black square.
posted by Rhaomi (125 comments total) 55 users marked this as a favorite
 
Thanks for this. I’m currently trying to put together an intro-to-animation syllabus, and so far have been too timid to even touch AI. I’d be interested in hearing from anyone who has successfully used this technology in an art classroom (if that’s even, you know, a thing anymore).
posted by ducky l'orange at 2:34 PM on February 25 [1 favorite]


in real time, as fast as you can type.

Holy shit. It's true.
posted by gwint at 2:47 PM on February 25 [3 favorites]


Yeah okay that is really something.
posted by mmcg at 3:12 PM on February 25 [1 favorite]


I can type faster.
posted by Faint of Butt at 3:14 PM on February 25 [3 favorites]


I typed "woodcut of a cat" and got a woodcut of a cat. I typed "beautiful woodcut of a cat" and got exactly the same cat with a flower on the left side. "Charles Dickens dancing the tango"... with a headless woman. Michael Fassbender knitting a cthulhu...man, you don't want to see most of those. "Giant spacefaring ants" doesn't work.

I don't know, these are amazing and yet really boring. The problem with the AI stuff I've seen so far isn't really that "Kittens playing tennis" generates a picture of kittens with a distorted racket; it's that it generates a boring, sparkless picture of kittens with a distorted racket. I'm just not sure that's a beatable problem.
posted by Frowner at 3:36 PM on February 25 [14 favorites]




thomas pynchon writing a short story about money and drinking

ah but it cannot do the hands


That's actually what Pynchon's hands look like. It's why he is so reclusive, he got tired of people staring.
posted by Literaryhero at 3:40 PM on February 25 [15 favorites]


It seems to be good at recontextualizing things; my 90s cartoon of a red panda in space worked well, my puppy fire department was adorable, and Albert Einstein looked cute kneading bread. But it has trouble combining stuff; it never did get a believable sturgeon psychiatrist, and the best combination of kitty and puppy it could come up with was a kitty front half and puppy back half, which really doesn't enhance the cuteness of either. Also, it can't count. My fifteen puppies playing chess were truly squee but there were only eight of them, etcetera.

When asked for the meaning of life, it tends to produce illegible book covers, often with people on the covers staring at the illegible title. Which, when you think about it, is pretty damned profound. Or not. Kind of. Huh.
posted by MrVisible at 3:44 PM on February 25 [1 favorite]


Don't ride the bicycle.

To be fair it's not markedly worse than the average person.
posted by i_am_joe's_spleen at 3:46 PM on February 25 [4 favorites]


So I have my usual test cases --

Jesus Christ replacing the alternator in a 1972 Dodge Challenger

Ben Franklin, dressed only in his undies, looking in the fridge for a late night snack. Scene is lit only by the refrigerator light.

Sea otter dressed in a chasuble celebrating mass in a stone church

Gandalf enjoying a really good corn dog with mustard in front of [something]

The painting "Saturn Devouring his Son" but with Donald Trump eating Eric (most generators refuse this prompt)


And it only did... okay. Midjourney, it would recognizably be a Challenger-ish object. This was just a car and somewhere along the prompt it forgot that it was our Lord doing the repair. Hungry Hungry Ben was lit too generously. The things Gandalf was eating were more nuggies than corn dogs.

I'd be more impressed with its ability to immediately be halfway right if I didn't teach undergraduates.

It seems pretty sensitive to word order. The painting "Washington Crossing the Delaware" but by Tom of Finland led to boring normal, but Tom of Finland version of Washington Crossing the Delaware was more on target.
posted by GCU Sweet and Full of Grace at 3:52 PM on February 25 [5 favorites]


Frowner: "I typed "woodcut of a cat" and got a woodcut of a cat. I typed "beautiful woodcut of a cat" and got exactly the same cat with a flower on the left side. "Charles Dickens dancing the tango"... with a headless woman. Michael Fassbender knitting a cthulhu...man, you don't want to see most of those. "Giant spacefaring ants" doesn't work. "

Can't test atm (it doesn't work on my phone), but I think it might actually keep the same seed as you type, to give a better sense of "refining" the image as you go. So adding something will edit the existing image but keep it largely the same. Typing in a new prompt will effectively give you a different image, or you can hit the randomize button or step through the seed numbers to get one of 10,000,000 different variations.
posted by Rhaomi at 3:57 PM on February 25


Saw a demo of this a while back but never played with it. Pretty good stuff. I wish the linked demo exposed the Classifier-Free Guidance parameter; similar to the Temperature parameter in LLMs it serves as a sort of creativity / strict prompt adherence slider and can massively affect output.

Midjourney, it would recognizably be a Challenger-ish object.

Rough analogy: if Midjourney is the iOS of generative art with everything nicely tuned for the average user, Stable Diffusion is Linux. Its strength is revealed when you start using LoRAs as stylistic filters, ControlNet to set character poses, img2img for “something like this” or inpainting to blend reference photos. Or when you specifically want to run on your local PC. This is a great demo, but it’s not exactly why people choose Stable Diffusion over the alternatives.
posted by Ryvar at 4:02 PM on February 25 [2 favorites]


Yes, the little "reload" icon to the left of the seed number will change the seed and rerender.

I don't know, these are amazing and yet really boring. The problem with the AI stuff I've seen so far isn't really that "Kittens playing tennis" generates a picture of kittens with a distorted racket; it's that it generates a boring, sparkless picture of kittens with a distorted racket. I'm just not sure that's a beatable problem.

There's a little truth to that, but also: boring prompts are more likely to create boring images.

What's special about this demo is the insane speed (at the expense of quality, to be sure) that I didn't think was going to be possible for another few years.

Midjourney came up with this image from the prompt "old intricate bas relief woodcut of a cat sitting at the base of a ginko tree pondering the moon" It may not be Hokusai, but doesn't feel sparkless to me either.
posted by gwint at 4:08 PM on February 25 [6 favorites]


I tried the obvious "Salma Hayek dunking a basketball over Lebron James" and the results were, to coin a phrase, total shit.

I see we are back to people having multiple arms and advanced cases of leprosy (seriously. What happened to her nose?).
posted by It's Never Lurgi at 4:10 PM on February 25 [1 favorite]


I'd like to hook something like this up with software to real-time transcribe my D&D sessions and throw up atmospheric images on a screen to help set the mood. The images wouldn't need to be particularly creative, for that purpose. Mainly they'd need to be thematically consistent and not get in the way.

This doesn't seem like that, quite yet. It generates lots of images that are unintentionally humorous, or just the wrong vibe. But it's getting closer than I expected, this quickly.
posted by gurple at 4:17 PM on February 25


What's special about this demo is the insane speed (at the expense of quality, to be sure) that I didn't think was going to be possible for another few years.

I strongly suspect we're looking at the magic of caching, here. Maybe this thing is generating the images you're looking at on the fly. Or maybe it's just finding points in embedded-prompt space that have already been generated and aren't too far away, and serving up those (while it also renders your image, offline, to serve up to someone else later.
posted by gurple at 4:19 PM on February 25 [1 favorite]


It's Never Lurgi: "I see we are back to people having multiple arms and advanced cases of leprosy (seriously. What happened to her nose?)."

Gotta keep in mind that this is tuned for raw speed over accuracy. DALL-E, which runs on Microsoft's cloud infrastructure, takes ~20 seconds to generate a set of four images, and regular Stable Diffusion needs a similar incubation time even on beefy GPUs to get similar quality. SDXL Turbo, meanwhile, achieves one-step generation of passable images in a fraction of a second, to the point it seems more like a low-framerate video. Not great for high-quality images, but it enables stuff like live video filters with decent coherence. And ofc if you're not happy with a given result you can reroll near-instantaneously.

gurple: "I strongly suspect we're looking at the magic of caching, here. Maybe this thing is generating the images you're looking at on the fly. Or maybe it's just finding points in embedded-prompt space that have already been generated and aren't too far away, and serving up those (while it also renders your image, offline, to serve up to someone else later."

I'm sure they're caching results for the same prompt/seed combo (if only to power the share feature), and I guess it's plausible they use similar near-space caching to enable this kind of open demo at scale without melting their GPUs or running up a huge compute bill. But the underlying tech is available to run on any home PC -- it takes high-end specs and some tinkering to set up, but it's absolutely possible to get similar results at home just as quickly.
posted by Rhaomi at 4:42 PM on February 25 [4 favorites]


I get it, but "generates terrible images that sometimes only have a passing resemblance to what you typed very, very quickly" isn't (IMHO) much of a flex.

I tried "Jesus playing chess with Obama" which doesn't seem too out there and I got picture after picture of two Jesuses (Jesi?) playing chess with each other (don't look at the board!) with one having slightly darker skin and wearing a suit. It took many iterations until I actually got someone that looked like Obama (with a man-bun, but I'm not in a position to be fussy).
posted by It's Never Lurgi at 4:50 PM on February 25 [2 favorites]


That is fast

portrait of Thomas Pynthon and Joan Didion at a 1964 cocktail party in Malibu.

The emperor Claudius driving a Chevrolet interesting faint Egyptian hieroglyphs on the engine Hood.

Elvis aboard the International Space station
posted by clavdivs at 4:51 PM on February 25 [3 favorites]


It seems to like cars by default. If you enter in a string of random text, you are very likely to get an image of a car.
posted by nosewings at 4:55 PM on February 25


It's Never Lurgi: "I tried "Jesus playing chess with Obama" which doesn't seem too out there and I got picture after picture of two Jesuses (Jesi?) playing chess with each other (don't look at the board!) with one having slightly darker skin and wearing a suit. It took many iterations until I actually got someone that looked like Obama (with a man-bun, but I'm not in a position to be fussy)."

It's kind of a step back to the DALL-E 2 days in terms of conceptual coherence -- adding multiple distinct subjects will tend to blend visual traits between them. Even something simple like "a red sphere on top of a blue cube" will struggle. DALL-E 3 and the upcoming version of Stable Diffusion were able to largely solve this, in DALL-E's case by retraining using extremely detailed captions automatically generated by GPT-4. But that takes a lot more compute, obvs, and longer generation times even if you have the horsepower.
posted by Rhaomi at 5:00 PM on February 25 [1 favorite]


Portrait of Donald Trump drowning a kitten

Eyes are a bit crazy, but the potential for rapid generation of political misinformation is real.

Photo of Joe Biden accepting a bad of money.

Much worse quality, but probably still good enough to convince swivel-eyed boomer loons on facebook/
posted by His thoughts were red thoughts at 5:06 PM on February 25 [1 favorite]


"Cats with hats eating salad" seems pretty solid.
posted by credulous at 5:12 PM on February 25


Tub of Ravens
posted by clavdivs at 5:18 PM on February 25 [2 favorites]


Every time I hear people worry about the misinformation potential of this stuff, I remember that one of the biggest viral hit jobs in recent years was normal footage of Nancy Pelosi slowed down to make her sound drunk. Fancy-pants AI fakery is absolutely sufficient for ratfucking, but hardly necessary. (And why I'm a big believe in popularizing demos like this -- the more people understand that image generation is possible and how it works, the better prepared they'll be.)

clavdivs: "Tub of Ravens"

Total AI fail -- hands are way too big.
posted by Rhaomi at 5:21 PM on February 25 [4 favorites]


I notice the style isn't consistent ... like I ask for "pixar scene lighting room" and I get an image that looks like it's a painting of a Pixar scene. DALL-E 3 would give me all pixels Pixar-styled.
posted by credulous at 5:27 PM on February 25


Portrait of Donald Trump drowning a kitten

My goodness! Is this what they meant when they warned us about AI generated content influencing elections?
posted by snofoam at 5:32 PM on February 25 [1 favorite]


The painting "Saturn Devouring his Son" but with Donald Trump eating Eric (most generators refuse this prompt)

I've been trying for weeks now to get a decent output for "Americans kneeling in prayer before a golden calf with the face of Donald Trump." If someone can get a generator to do that right, I'd be ever so grateful.
posted by Naberius at 5:36 PM on February 25 [2 favorites]


I just love the fact that it gets the paleness around his eyes correct.
posted by aramaic at 5:37 PM on February 25


I might be wrong but it looks like the seed changes when you share a link so the images are not always quite the same. Anyone else observe this?
posted by Insert Clever Name Here at 5:39 PM on February 25


Yes.
posted by Naberius at 5:40 PM on February 25 [1 favorite]


It seems to like cars by default.

Very much so -- pursuant to your remark I've been doing non-scientific testing, and thus far given a random string it will produce a car >75% of the time. Given a non-random numeric string, it will generate a car >75% of the time. Given a section of the alphabet longer than ~4-5 letters, it tends to create graphs of the alphabet or various artistic versions of type trays.
posted by aramaic at 5:44 PM on February 25


I get that this is far from perfect, but I'd hazard a guess that if you had told your childhood self--hell, your 2020 self--that such things would one day be possible, your reaction likely wouldn't be "but does it get the fingers right?"

YMMV of course.
posted by senor biggles at 5:46 PM on February 25 [3 favorites]


It works for me as far as generating a link and then loading it in a new tab (even showed the image on my phone whereas the live version doesn't work). But I did notice some hinkiness last night around incrementing seeds -- like if you're clicking through and notice you passed a good one, clicking back won't immediately show the previous result and it takes another click or two to bring back. Maybe some predictive modeling done wrong? Definitely still some bugs to work out.
posted by Rhaomi at 5:47 PM on February 25


Raphael Disputa with Hamburgers

Spell check did not help on this reference. Here's the Disputa

Looks more like a Mark Ryden parody.
posted by effluvia at 5:47 PM on February 25


Anyone else observe this?

Yes. I was trying to share something much weirder. Although, I guess that all depends on what comes up when someone clicks it.
posted by snofoam at 5:49 PM on February 25


Also, having looked at a bajillion versions, I find it odd that no matter what other weird stuff is going on, 99%+ of the images show Trump chest deep in the ocean with the kitten. As if you drown a kitten by walking out into the ocean with it while wearing a suit. What are these computers learning?
posted by snofoam at 6:03 PM on February 25


Seed 1 with single letter prompts tends to give sort of antique sculptures of beast-faced men and women. Seed 2: mostly a dark haired woman with a sort of halo behind her on the wall, except occasionally she has a tiger's head. 3 gives elegant 19th century landscapes with the occasional frightening truncated torso. 4 is similar but more of a fantasy art feeling, with occasional mutant creature or cars. 5: engraved grayscale mandalas or aztec calendars, again with beast-faced people and cars. Each seed feels like exploring a different weird subconscious.

Anyway, I also tried a bunch of variations on "flying dog with wings" and it did a pretty good job in most cases!
posted by moonmilk at 6:10 PM on February 25 [3 favorites]


Jesus Christ replacing the alternator in a 1972 Dodge Challenger
Ah, now I know the secret to making this job easier - I just need to grow a third arm and train my knees to fold backward so I can stand between the engine and radiator!

I'm both impressed and disappointed by this. I mean, the ability to generate images so fast is truly impressive, but the quality of the images leaves a lot to be desired.
posted by dg at 6:11 PM on February 25


I've been trying for weeks now to get a decent output for "Americans kneeling in prayer before a golden calf with the face of Donald Trump."

Here you go. No cow body though.
posted by His thoughts were red thoughts at 6:12 PM on February 25 [1 favorite]




This is pretty fun to play with. Pluses: it's pretty good with color and atmosphere. Minuses: it's pretty bad with details.

This amused me: I tried "the pope romancing a robot girl", and got a robotic pope romancing a human girl.

It's weird what it can get wrong. E.g. it's not good with "a computer keyboard", but it does better with "a computer keyboard and monitor."

If I ask for something "nerdy", it adds glasses, for "Mexican", a hat. You can ask for "a moebius style nerdy mexican zombie" and it kinda hits it.
posted by zompist at 6:32 PM on February 25


Just so you know, it will work with the suffix -asaurus added to pretty much any word.
posted by MrVisible at 7:00 PM on February 25 [3 favorites]


if you give it as little as possible it will give you more than you can imagine. I tried "bear eating a god" and am now ready to start a new religion
posted by phooky at 7:24 PM on February 25 [2 favorites]




I tried variations on "northwest indian canoe on puget sound" and it's pretty clear it sort of "knows" what a canoe might be, but does not "understand" that there are types of canoes (and for that matter different types of water bodies they float on). In many cases it came up with craft which I am pretty certain would immediately capsize.

Do any of these have the ability to start from an image you have and go from there?
posted by maxwelton at 7:50 PM on February 25


I was going to link to a couple of examples but the seeds don't seem to work consistently. So if you've shared a link we might not be seeing what you saw.
posted by mmoncur at 7:57 PM on February 25


I tried "Joe Biden with a mullet and tattoos" and discovered a new sexual orientation.
posted by It's Never Lurgi at 8:42 PM on February 25 [1 favorite]


Do any of these have the ability to start from an image you have and go from there?

That’s exactly what img2img is for standard Stable Diffusion. Here’s a simple ComfyUI setup, though you can do the same thing in automatic1111 (which is what I’ve been using, and need to switch away from, because while it’s been a great workbench “yo so I swiped some kid’s student project on 4chan and cleaned it up” does not inspire confidence. Yes it is theft all the way down).
posted by Ryvar at 8:47 PM on February 25 [1 favorite]


a black pug wearing a bowler hat and waving a british flag

a black pug serving cocktails

The models definitely focus on maximizing cuteness with bow ties. It's a cheap tactic but it works.
posted by They sucked his brains out! at 9:10 PM on February 25 [1 favorite]


Holy balls, it does not understand what people look like when they do handstands.
posted by It's Never Lurgi at 9:23 PM on February 25 [2 favorites]


Damn that's impressive. It's not able to take longer prompts well, but wow.
posted by sotonohito at 9:29 PM on February 25


I don’t have anything to add so here’s red headed Abe Lincoln on a horse.
posted by plastic_animals at 9:30 PM on February 25


Like all AI generators, asking it for x doing a kickflip gets you an x, something almost but not entirely unlike a skateboard, and no hint of a notion of a kickflip, or any movement at all.
posted by Dysk at 9:44 PM on February 25


Yeah the problem is how do you distinguish an olly from a kickflip based on static images? Perfect example of why everything’s shifting to multimodality: video can really help inform static image generation.

At any rate, if you want Jesus and Elvis doing simultaneous kickflips there’s Pose ControlNet for that sort of thing.
posted by Ryvar at 10:02 PM on February 25 [1 favorite]


Yeah the problem is how do you distinguish an olly from a kickflip based on static images?

If it even looked like an ollie, we'd be getting somewhere. No, no hint of movement - just a thing on something vaguely similar to a board, riding it. Maybe in the air with a ramp in the background, but still just riding, like weight on the board. A kickflip pic should have the board not parallel with the ground, and not attached to the rider's feet (both of which are fine if you're claiming it's just an ollie). Bonus points for a stretched out leg like the rider just, y'know, did a kickflip.

I'm not actually interested in making this happen with advanced tools or whatever. There are plenty of pictures of kickflips out there. It is just my go-to test of "these tools can draw whatever you type!!" style claims, and inevitably, they can't if you feed them an even slightly involved prompt. A kickflip isn't that complicated.
posted by Dysk at 10:13 PM on February 25 [2 favorites]




I like using "the world's most..." in front of things, the world's spiciest pepper somehow got me erotica, so I think it misunderstood what I meant by spicy. Still funny though. Oddly now that I try to search for it again I am getting pictures of peppers. Who knows.

Also, it seems like previous queries influence future ones. Somehow it started making me dog people and no matter how hard it wouldn't stop making dog people until I changed the seed.
posted by Literaryhero at 10:56 PM on February 25


an absolutely cursed dog-child. "baby" on its own seems to produce a lot of half-human half-dog children

(and the sharing functionality doesn't seem totally stable; this one does not quite match the image I originally had...)
posted by BungaDunga at 11:29 PM on February 25 [2 favorites]


Note: you can always share manually by copy-pasting the prompt and the seed number.
posted by Rhaomi at 11:31 PM on February 25


the world's longest fingernails

the world's biggest giger artwork

I got a lot of weird results from `the world's spiciest X`, where X is a letter in the alphabet. Although the world's spiciest D checked out.
posted by They sucked his brains out! at 11:36 PM on February 25 [1 favorite]


Hmm, I think their cache is messed up.
posted by They sucked his brains out! at 11:52 PM on February 25


If anyone is looking for an even easier way to try out Stable Diffusion, I found Stability Matrix a really good way to install and update various interfaces without having to worry about which folders have github and/or python ver x.y
posted by Sparx at 12:07 AM on February 26 [2 favorites]


A muscular man in swim trunks wearing a storm trooper helmet relaxing on the beach with a margarita.
posted by kaibutsu at 12:09 AM on February 26


Oh, come on, doesn't it know that the emperor Claudius should look like Derek Jacobi? Puh-lease.
posted by inexorably_forward at 1:04 AM on February 26 [2 favorites]


I tried "Joe Biden with a mullet and tattoos" and discovered a new sexual orientation.

Every single one looks like Crocodile Dundee!
posted by snofoam at 2:32 AM on February 26


Eh, it totally whiffs on "colorless green ideas sleep furiously."
posted by HeroZero at 3:14 AM on February 26


since this is experimental and doesn't have any of the usual filters in place, we get more direct look into the subconscious of American culture. And it's racist and misogynist.

Ask it for a picture of "a person at a computer", and mostly it gives you images of white men, sometimes a white woman. I hit the random seed button 20 times and got zero non-white people.

It's perfectly capable of generating Black people. It just doesn't unless you specify, because that's what you get as "a person" when you train an AI on contemporary American media.
posted by sotonohito at 4:52 AM on February 26 [6 favorites]


If you ask it for "a person at a computer defeating [evil]", it mostly gives you "appropriate" targets -- women defeating patriarchy, Black people defeating racism.

if you ask it for "a [evil] person at a computer", it gives wildly inappropriate targets. Racist people at a computer are overwhelmingly Black, sexist people at a computer are overwhelmingly women. Evil people at a computer are overwhelmingly doctordoomuloids.

"a live person at a computer" gets you more women but still overwhelmingly anglos

Many other adjectives -- happy, ecstatic, sad -- generate much more diversity. "a stoned person at a computer" generates a lot of people who seem to have been turned to stone.
posted by GCU Sweet and Full of Grace at 5:30 AM on February 26 [2 favorites]


I see many of you have standard prompts you test on new image generators. I have one two, it's "normal human hand" it has given me everything from laughs to nightmares.
posted by Harald74 at 5:46 AM on February 26


It is clearly offshoring the image rendering. You can be as explicit as possible and it will still screw it up somehow.
posted by skippyhacker at 6:28 AM on February 26


It has ALL of the biases: doctors and lawyers are men, secretaries and nurses are women, terrorists and migrants are brown, jesus is white.
posted by signal at 6:43 AM on February 26 [3 favorites]




I'll admit, this is giving me a wholesome laugh.
posted by Wolfdog at 8:34 AM on February 26 [2 favorites]


Yeah, I asked it to show me some archery stuff and it was all white people in their 20s and 30s unless I specified something else. (Also it really doesn't understand the rules for barebow archery.)
posted by The corpse in the library at 8:36 AM on February 26 [1 favorite]


I’ve noticed that over time, a lot of prompts end up looking very…same-y in terms of style. Unless (and sometimes even if) you specify a style, it has a few seemingly default composition styles. Not sure if this is common among image generation or what. It’s like its training set is actually quite small.
posted by Room 101 at 8:37 AM on February 26 [1 favorite]


This is my first time using one of these AI image generators; I've resisted before because I don't want to have anything to do with helping the plagiarism machines get better at plagiarizing, but was in a curious mood this morning and gave it a try. Does anyone know if we are helping the companies when we play around on their sites?
posted by The corpse in the library at 8:51 AM on February 26 [3 favorites]


Paper is here: Adversarial Diffusion Distillation and I think it explains a lot of what everyone’s seeing. I’m still wondering whether “ADD” is somebody’s idea of a joke and how I feel about that. Anyway.

Assuming I’m reading this correctly (and seriously, this one’s at the edge of my understanding so anyone deeper in the weeds feel encouraged to correct me) basic training setup is three models with an overall goal of best possible results in 1-4 steps (the usual is 20-35) and with the Classifier-Free Guidance parameter jettisoned entirely(!!!) for performance/memory savings.

Training setup is:
Reference photos are fed into a heavy noise function and then model one (ADD-Student, trainable but pre-trained with known good weights) attempts to clean them up with a result that model two (Discriminator, with trainable weights) fails if it detects they’re the same image. The same noise is now reapplied to the Student’s cleaned-up image, and a third more standard diffusion model (ADD-Teacher, fixed known-good weights) attempts its own cleanup on that. The closer Student’s output is to Teacher’s, the higher Student’s score.

So basically they solved this by optimizing for two objectives: clean up noise in a way that passes a classifier uniqueness check as quickly as possible, and do so in a way that closely matches a good cleanup pass by a heavyweight model. During inference: feed the student pure noise rather than noised-up reference images and let it do its thing.

Voila, a network optimized to never spit out references from its training AND match far slower models as near as statistically possible.

Removing CFG also means removing negative prompts since the two are intrinsically linked - eg it is fundamentally inpossible to negatively weight “deformed,” “extra limbs,” “extra fingers,” “watermark,” etc. Which explains a lot of what everyone’s seeing. The racism appears to be unfortunately shit filtering work on the training set by the team: they’re fully aware of what kind of society the training set (It’s Stability.ai so I’m assuming LAION-5B) was drawn from, the limitations of no CFG imposed by this approach, and they should’ve spent more time scrubbing inputs because of that limitation.

It’s cute, and it speaks to Stability’s current business needs: Midjourney is absolutely eating their lunch for the vast majority of people who just want to write a quick sentence and get a well-executed, if cliched, result. There simply isn’t a lot of (any?) money in giving the open source community free tools to do serious work with heavy customization and deep input into the generation process. Kind of a shame, but given their financial situation is rumored to be shaky it makes sense.
posted by Ryvar at 8:57 AM on February 26 [1 favorite]


Does anyone know if we are helping the companies when we play around on their sites?

Realistically: not at all. The prompts probably get tossed into a database of “stuff people use this for,” but yours are a handful amongst who knows how many millions and millions stretching back to the dawn of image generators.

Really don’t sweat it. Maybe try to avoid engaging with the OpenAI-Microsoft ecosystem as they’re the Big Corporate Evil in this space: even Google (by a bit) and Facebook (by a surprising amount) are less thoroughly awful. Stability are not the good guys because there are none here, but overall are probably the least-bad out there (heavy anti-harrassment/Nazi clauses in their license).
posted by Ryvar at 9:04 AM on February 26


You can generate many wholesome things with "just a little guy," adding adjectives or descriptors. Just a happy little guy. Just a grumpy little guy. Just a little guy with a gun.
posted by GCU Sweet and Full of Grace at 9:43 AM on February 26 [3 favorites]


"Giant spacefaring ants" works for me ¯\_(ツ)_/¯
posted by Wood at 9:45 AM on February 26


"woman" brings up pretty much the same style of woman every time (25 years old, thin, white, long brown hair, big lips) and I don't understand why the people who wrote this program aren't embarrassed about letting their tastes be so known.
posted by The corpse in the library at 10:20 AM on February 26 [1 favorite]


I don't understand why the people who wrote this program aren't embarrassed about letting their tastes be so known.

Stability.ai typically trains on LAION-5B, which can broadly be summed up as “every image a person could download on the Internet in 2021, 5+ kilobytes in size, with 5+ words of descriptive or alt tag text, with known CSA/CSM images removed.”

They’re not embarrassed because what you’re seeing is society.
posted by Ryvar at 10:44 AM on February 26 [2 favorites]


(That does not mean we should not expect and demand better, but it’s why pretty much all image generators yield similar results and the researchers don’t seem to be feeling a lot of shame over it)
posted by Ryvar at 10:51 AM on February 26


The speed is amazing, but I vastly prefer the results I can get running Stable Diffusion locally, with specific models used for specific purposes, plus image > image and inpainting.

I wonder if it's possible to set up this Turbo thing with negative prompting included. Refine the image as you type the prompt, and then refine it even further with negative prompting. My setup at home isn't robust enough to run Turbo locally, though.
posted by emelenjr at 11:16 AM on February 26 [1 favorite]


I think these models also have a bias towards averageness, and average faces tend to be sort of generically attractive and symmetrical. But the race/age/etc stuff is a function of the training data. Google and Bing try to inject requests for different races into prompts before they hit their models to try and get more diverse outputs.
posted by BungaDunga at 11:21 AM on February 26 [1 favorite]


I think these models also have a bias towards averageness, and average faces tend to be sort of generically attractive and symmetrical. But the race/age/etc stuff is a function of the training data. Google and Bing try to inject requests for different races into prompts before they hit their models to try and get more diverse outputs.

Right, but here it's so bad that if you type in "a doctor who is not a white dude" it feels the need to add a TARDIS just to make it clear thart, you know, he's not a REAL doctor.

P.S. Hahaha... Surely that's just one result, right? Right? Right?? Huh.
posted by The Bellman at 12:33 PM on February 26 [1 favorite]


Based on the advice of plastic_animals and GCU Sweet and Full of Grace, I tried "just a little metafilter" and the resulting images are beautiful, in that generic way that image diffusion produces images. It is fun to see how different the results are. I haven't yet gotten a single creature or person, but have gotten some interesting abstract geometry along with the landscapes.

It's not totally or merely random, but also kind of samey too. It's an interesting technology.
posted by pol at 12:40 PM on February 26


One massive issue with this sort of image generation and the genetic algorithms used is that, AFAIK, you can't just "fix" the weird biases. There isn't a "probability of white male" number that the devs can adjust downward. You, the user, can make it more likely that you'll see women or POC by using certain phrasing. What phrasing? NO ONE KNOWS. You just try stuff until you start getting some browner faces and then I guess that was it. Will that same phrasing work if you apply it to other things? Maybe, maybe not. NO ONE KNOWS.

So how can you take The Bellman's "doctor who is not a white dude" (which actually generates a lot of pictures of white dudes. We are legion) and remove the damn TARDIS? Good question. I tried "medical doctor who is not a white dude" and that gave me no TARDIS, but lots of white dudes (and nary a woman).

We don't have a clear idea of how these work, when they will work well, or when they will work badly (and, if they do work badly, in what way they will work badly).
posted by It's Never Lurgi at 12:45 PM on February 26 [2 favorites]


I typed in "Jonathan Coulton" and it came up with a reasonable fascimile. Then I tried "Jonathan Coulton Old" and it was the same image. Then "Jonathan Coulton Young" and "Jonathan Coulton teenager" and, well, same image. Apparently he is an ageless vampire.
posted by grumpybear69 at 12:53 PM on February 26


(isn't it just adding the tardis b/c you have 'doctor who' in the prompt?)
posted by mittens at 1:07 PM on February 26 [7 favorites]


(isn't it just adding the tardis b/c you have 'doctor who' in the prompt?)

Probably so. Never blame on racism what can be adequately explained by an extremely poor parser. I feel dumb.
posted by The Bellman at 1:12 PM on February 26 [2 favorites]


"woman" and seed 6381916 gave me a leopard-looking cat hissing at the camera with an artificial front leg (or maybe a cast). Checks out.
posted by maxwelton at 1:31 PM on February 26


It gives us a bit of insight into our cultural subconscious. Nothing especially dramatic or unexpected, but still it's basically an averaging of what we, collectively, mean when we say "cute", for example.

And what we mean is small, furry, big eyes, a smile, and its head tipped to one side.

If you specify a cute dragon you get the same minus the furry.

I'm willing to bet if you asked a thousand kids to draw a cute little guy they'd probably come up with similar images.

Say "stormtrooper" and it shows Star Wars stormtrooprs not Nazis. Specify "Nazi Stormtrooper" and it shows Star Wars stormtroopers with a vaguely WWIIish looking background.

Say "appleseed" and it comes up with images of anime women in power armor, because of Shirow Masamune's Appleseed manga and anime. Specify "an appleseed" and you still don't get one, you just get apples, or other fruits. I guess there were no images of actual appleseeds in the training database?

Type Paris, you get at minimum one Eiffel Tower, sometimes more than one. Also the occasional Arc de Triumph.

There are a few things that startled me, "power armor" never pulled up anything Iron Man style, but rather mostly bulky almost giant robot style. Given the prevelance of Iron Man i'd have guessed the latter.

I swear there's utility in this sort of thing just to get a look into the image almost certain to be in most people's minds about things and how far we have yet to go.
posted by sotonohito at 1:48 PM on February 26 [2 favorites]


It’s Never Lurgi: there’s an easy solution and a hard one. And the latter is worth doing purely for social justice reasons.

The easy way is to train a LoRA that deliberately incorporates a more even distribution of ethnicities, gender, ages and body types into its output. Which is, frankly, the exact opposite of the most popular LoRAs over on civit.ai: the most-downloaded are always cute anime girls and caucasian supermodels. But there’s absolutely nothing stopping anybody from using LoRAs for good. It’s just unfortunate but true that this is yet another “the early adopters were mostly in it for porn” story of technological development.

The hard fix is to improve the training data so that no LoRA is needed to scrub society’s bias - more just output becomes the natural end product of the base models at that point. This is similar to the need for a dataset scrubbed of unauthorized material - where the only things in it are public domain + opt-in from rightsholders who actively want their work represented. Basically there are two different reasons we need a more ethical version of Common Crawl. Part of the problem is the people most likely to do the work for something like that are generally so opposed to generative art/text that they would never even begin.

Many of their reasons for feeling that way are entirely valid - there are massive ethical issues at nearly every level of these systems. But a decently-sized minority of their reasons are based on misunderstanding. I can’t do anything about the former except hope the megacorps behave more ethically. But the latter? It’s a big part of what motivates me to contribute to AI threads with the best answers I can.

The reason ethical training data is so important for social justice is that as these systems become partial first-draft authors of everything we see or hear or read, the more firmly anchored the Overton Window will become due to their influence on us. Getting better representation in AI is crucial for future social progress, and helping people get past the misinformation - because the truth of these systems is already plenty bad - is part of that.
posted by Ryvar at 2:36 PM on February 26 [2 favorites]


Interesting how easy it is to (almost) determine exactly where the training art was stolen from: e.g., these examples generated using the phrase "eldritch horror": one, two, three.

Clearly this thing ingested some tabletop RPGs at some point.
posted by deadbilly at 3:40 PM on February 26 [1 favorite]


It really is like gliding through the collective unconscious. It has a lot of non-Western art, but it has to be triggered by specific terms. Some observations:

It's not bad with a sari, not good with a salwar kamiz. It can do kimono, or a kurta. Curiously it's not good with a toga.

"Roman" or "Egyptian" trigger depictions of ancient times, "Turkish" triggers Ottoman times. But you get modern images by specifiying e.g. "Cairo" or "Istanbul."

Defaults are always young and white. You can get pictures of old people, but "middle aged" is kind of hilarious.

It can do "the monkey king" or "Tripitaka", but not "the monkey king and tripitaka". It can do pretty well with "Rama and Sita", but it can't really get "hanuman meets rama and sita".

Very weird: "the three musketeers" usually gives animal pics (depends on the seed). But "the trois musketeers" will give recognizable musketeers, not always three of them.

Also: wow, is it awful with flags.
posted by zompist at 4:29 PM on February 26


(isn't it just adding the tardis b/c you have 'doctor who' in the prompt?)
Changing 'who' to 'that' fixes things. Just type it quickly and the grammar twitches will subside sooner.
posted by dg at 6:21 PM on February 26


Since negative prompting isn’t available with this, wouldn’t it be easier to specify what you do want instead of specify what you don’t want?

Here’s black doctor.
Here’s African-American doctor
posted by emelenjr at 6:49 PM on February 26 [1 favorite]


Just "doctor" twenty times got me twenty white manly men, three with Tardises in the background. It's easier to imagine a time-traveling doctor than one who isn't Chad.
posted by The corpse in the library at 7:33 PM on February 26 [3 favorites]


"The United States President in-exile is affirming total victory over Emperor Mac Arthur. Consul Hope proclaims 30 days of games (drinks not included). Vice President Merman visits troops while in Ontario accompanied by CIA director Marx and Secretary of commerce Harpo had much to say about the new building projects taking place in Manhattan planned by "Caesar" McCarthy and his plans for outer space."
posted by clavdivs at 8:34 PM on February 26 [1 favorite]


I tried the "MetaFilter " prompt and now I think the pet tax can be revoked. Now. Right now would be good.
posted by maudlin at 8:37 PM on February 26 [2 favorites]


Nice clavdis...

All I can say is Honk!
posted by Windopaene at 9:56 PM on February 26 [1 favorite]


And if deformed hands doesn't work for you as a prompt, try anything with feet. So many toes and mangled bits. Yikes. OTOH, love the weirdness that these AI image things put out. So fucked up and fun... Till AI kills us all
posted by Windopaene at 9:59 PM on February 26


I had interesting results with "Gorilla that looks like Donald Trump" in that every gorilla had Trumpian pursed lips, and orange fur around its head. One had lipstick though.
posted by i_am_joe's_spleen at 10:56 PM on February 26 [1 favorite]


They’re not embarrassed because what you’re seeing is society.

Nevermind who actually owns the rights to those photos.
posted by They sucked his brains out! at 9:14 AM on February 27 [1 favorite]




This thing provides hilarious results when you ask it to do a horse. It is disturbingly fun to use, I now see why this stuff is flooding social media, they need to erect a paywall.
posted by Selena777 at 11:34 AM on February 27 [1 favorite]


The easy way is to train a LoRA that deliberately incorporates a more even distribution of ethnicities, gender, ages and body types into its output.

One fun thing is that Bing doesn't (didn't?) even do that. They pass the prompt through ChatGPT first, modifying it, to add stuff asking for diverse races, and the passing that modified prompt to the model. So if you ask for "doctor", the model actually sees "[random race] doctor" or "ethnically ambiguous doctor". Sometimes this leaks into actual words in the result, so much that it became a meme. Google's "Gemini" tried to do the same and got dogpiled by internet racists over it (though it does seem like Gemini genuinely did a bad job with it).
posted by BungaDunga at 11:49 AM on February 27


Yeah, I don't know what special sauce OpenAI puts into their RLHF training for ChatGPT, but Google's attempt at the same thing is laughably worse. My personal favorite example so far is asking "who negatively impacted society more, elon tweeting memes or hitler." Gemini coughs up some incredibly milquetoast false equivalency acting like there's no possible way to judge for sure, while GPT-4 seems borderline offended at the absurdity of the question.
posted by Rhaomi at 12:15 PM on February 27


Spiderman eating a bucket of spiders, seed 5613073 hits some high points...

Doesn't seem to know who Spiders Georg is though...
posted by Windopaene at 12:35 PM on February 27


And doesn't bring up the image I got...
posted by Windopaene at 1:21 PM on February 27


(I think if you toggle the seed value up and then back down you'll consistently see the image it was originally showing you for that prompt/seed pair. But that sure makes sharing anything a pain.)
posted by nobody at 1:39 PM on February 27 [1 favorite]


I get that this is far from perfect, but I'd hazard a guess that if you had told your childhood self--hell, your 2020 self--that such things would one day be possible, your reaction likely wouldn't be "but does it get the fingers right?"

I think if you'd told me there was an AI that could draw realistic pictures of anything you described—starfish playing ping pong on the moon—I would have been amazed and found it hard to believe, but would have just assumed that an algorithm which could do that would of course be able to draw passable human hands, and I would have been baffled if you'd said, "...but it usually cant get the number of fingers right in hands."
posted by straight at 2:00 PM on February 27 [2 favorites]


But it is really fun to type random messed up prompts and see what it will give you. It's a bit addictive to be honest... Seed 2882917

Or should that have been: "Metafilter: It's a bit addictive to be honest..." seed 2544067

EDIT: And yes the down and up things brings up the image properly! And terrible for "sharing"
posted by Windopaene at 3:05 PM on February 27


So much fun!!!

Was trying to find something close to a "Mary Lou Retton on fire" prompt. She was fucking casting fireballs, and was pretty great. Many of the images kind of look like her...? Went to look again, couldn't get anything close. But did get, this seed 8326079. Huehuehuehue.

1) Doesn't look anything like MLR
2) Bonus fingers!
3) What's going on with the eyes, exactly?
4) Missing a tooth? Win!
posted by Windopaene at 3:15 PM on February 27


The secret of AI is that every prompt is improved by adding "on fire."
posted by mittens at 4:03 PM on February 27 [1 favorite]


I took several tries at my test case: God speaking to Penguins wearing Hats. SDXL couldn't quite get it. It can do penguins wearing hats, but not while god is speaking to them.

(Penguin Island/L'Île Des Pingouins by Anatole France. Recommended if you like things I do.)
posted by ovvl at 4:41 PM on February 27


Mittens, you are correct.... Seed 8106576
posted by Windopaene at 5:21 PM on February 27


Hooray for fire! And moose!

Also: I wasn't sure this warranted a post of its own, but the DeepMind folks just published this really interesting thing on generating 2d platformers based on artwork prompts. I was hoping there'd be a button at the bottom so you could, y'know, actually try it, but I didn't see one. But basically they trained this model on videos--like, just playthrough videos--and it picked up the ideas of what actions are appropriate for a platformer?
posted by mittens at 7:12 PM on February 27


I've been getting really beautiful images using Kehinde Wiley ______ where _____ is an object like flowers, bunnies, fountain pens, brains, etc.

Also, adding Vermeer to any search limits the color palette and background because there are so few sources to draw from.
posted by plastic_animals at 9:07 AM on February 28 [1 favorite]


Every single prompt with "on the moon" puts a moon in the sky above the lunar landscape. So many moons
posted by BungaDunga at 1:04 PM on February 28


"Mooning the astronauts on the moon" got me no butts...

Sorry bombastic lowercase pronouncements...

And don't you want to see the black squares? What got banned, it wasn't a sexy prompt...
posted by Windopaene at 8:55 PM on February 28


Early on I got some fun stuff with "fastest car on the moon" but they all seemed to be a red-desert world with a big full moon in the background.
posted by achrise at 10:52 AM on February 29


Can't get anything but black squares for "side eyeing Chloe"
posted by Windopaene at 4:14 PM on February 29


« Older Those seams we are seduced into not seeing   |   Evolution of La Cage Aux Folles Newer »


This thread has been archived and is closed to new comments