AI-Generated Images from AI-Generated Prompts

As the world’s leading expert on a people-first approach to computer vision, I am dedicated to providing insights that enable designers, developers, and copywriters to create accessible images at the highest possible velocity. A velocity so high, in fact, you can almost hear the point whistling over their head, like this self-aggrandizing intro.

Building on my post AI-Generated Images from AI-Generated Alt Text, I am going to demonstrate the saddest, newest trend in generating alternative text for images — AI tools that use existing images to generate text prompts for AI-powered image generators.

Using the same source images from my previous post, I fed each into the CLIP Interrogator using ViT-L-14/openai (meant for Stable Diffusion), then took the text output and fed each into Stable Diffusion. I also fed my original alternative text into the same instance of Stable Diffusion to compare the results.

As with my previous post, dear sighted reader, I want you to read this without looking at the images. Each has been hidden in a disclosure. Instead, read the alternative text I provide then read the AI-generated prompt text, and try to visualize each. Consider how they differ. Then compare the images. As you visually compare each, think about how a screen reader user might have benefited from the hand-crafted artisanal alt text.

Flat Color Illustration

A flat color illustration
A cartoonish Kawaii slice of toast with happy eyes, open smiling mouth, and reddish cheeks; there is a spatter of blood coming from the top of the toast similar to the Watchmen logo. A cartoonish Kawaii slice of toast with happy eyes, open smiling mouth, and reddish cheeks; there is a spatter of blood coming from the top of the toast similar to the Watchmen logo.
The first is a PNG image, the second is an SVG.

This is the alternative text for the image:

A cartoonish Kawaii slice of toast with happy eyes, open smiling mouth, and reddish cheeks; there is a spatter of blood coming from the top of the toast similar to the Watchmen logo.

CLIP Interrogator using ViT-L-14/openai generated this text prompt:

a slice of bread with a bloody face on it, a stock photo, inspired by Kanbun Master, twitch emote, kawaii cutest sticker ever, artoast8p, aliased, assassin, bun, spraying blood, choi, bengus, exploitable image, 1 5 0 4, 1 4 5 0, winking, scobillyflup

The Stable Diffusion output for each:

My alt text fed into Stable Diffusion for the flat color illustration
Four images. One is a cartoony slice of bread with a happy face, with an angry red face on its lower half, alongside photo-realistic bread. Another looks like a cyclops meatball wearing a ski goggle, but with a smile. The third is a maybe a cartoon slice of bread with three eyes and rectangle mouth, alongside an upside-down speckled frowning meatball. The last is a stylized black shape that may have horns or pointy ears, with a white angry face, floating above a wide-eyed slice of rye with a trail of jelly across its eyes and mouth.
I want to know why so many meatballs.
CLIP Interrogator prompt fed into Stable Diffusion for the flat color illustration
A flat-color slice of bread with two photo-realistic bits stuck on it as ears; it has red heart eyes, black nostrils, and a rectangular mouth with red circuitry lights. A photo-realistic slice of maybe Italian bread with two large pitch black circles as eyes. A cartoon maybe slice of bread or maybe round loaf, with what appears to be a horizontals gash through its middle, which is also its face, with a large blob of blood spilled out at the end, frown and eyes following it. A cartoon maybe gelato bowl, with a cookie sticking up out of the top and two expressive angry eyes behind a few drips of blue moisture, probably sweat.
These are certainly more unsettling, at least until the bowl of gelato.

Photo

A photograph
Looking across the atrium with four levels of the gallery space / ramp visible. The ramp contains exhibits as well as crowds of people moving among them, with some people leaning over the edge of the ramp wall.
I used the image in a tweet about the Guggenheim, so the venue context was there. Wallygva at English Wikipedia, CC BY-SA 3.0.

This is the alternative text for the image (venue location provided in original surrounding context):

Looking across the atrium with four levels of the gallery space / ramp visible. The ramp contains exhibits as well as crowds of people moving among them, with some people leaning over the edge of the ramp wall.

CLIP Interrogator using ViT-L-14/openai generated this text prompt:

a group of people standing inside of a building, by Alexander Calder, trending on unsplash, gutai group, panoramic anamorphic, frank lloyd wright, stone stairway, new york times, man in white t – shirt, museum photo, imet2020, artgram

The Stable Diffusion output for each:

My alt text fed into Stable Diffusion for the photo
Four separate images broadly showing crowds of people moving through large multi-storied open spaces with glass walls and ceiling. One image is looking down into an atrium, and feels most like the source image. Another shows a very symmetric and squared set of floors. None show ramps.
These at least convey the sweeping expanse of the space, if not its main attributes.
CLIP Interrogator prompt fed into Stable Diffusion for the photo
One image of a small crowd in a large plain white room with a glass wall and polished concrete floor. Another of a black room and light tile floor with people clustered just past an orange wall and doorway that take up much of the room. A sweeping raw concrete conical room that tapers slightly toward the top to a flat glass ceiling, a couple dozen people milling around the edge of the blank walls. Four people looking a blank white wall that has a few circles and swoops pressed into its surface.
These rooms feel too much like a human abbattoir. Except maybe the one with the odd climbing wall.

Digital Illustration

A digital illustration
A photo illustration travel poster showing a cluster of metallic hot air balloons with spheroid gondolas floating above the opaque clouds of Jupiter’s atmosphere. Behind and above the balloons is a sweeping aurora of teal and purple against a black starry sky. The advertisement reads “Experience the mighty auroras of Jupiter” in metallic block text at the bottom of the poster.
From NASA’s Visions of the Future poster series. Courtesy NASA/JPL-Caltech.

This is the alternative text for the image:

A photo illustration travel poster showing a cluster of metallic hot air balloons with spheroid gondolas floating above the opaque clouds of Jupiter’s atmosphere. Behind and above the balloons is a sweeping aurora of teal and purple against a black starry sky. The advertisement reads “Experience the mighty auroras of Jupiter” in metallic block text at the bottom of the poster.

CLIP Interrogator using ViT-L-14/openai generated this text prompt:

an image of a hot air balloon flying in the sky, concept art, inspired by Barclay Shaw, cg society contest winner, space art, infused with aurora borealis, vibrant tourism poster, ryan dyar, floating molecules, official product image, poster tour, acnh, albion, 2 0 1 9, esa, listing image, images on the sales website, dark energy

The Stable Diffusion output for each:

My alt text fed into Stable Diffusion for the digital illustration
Four images. One is three slightly textured balloons against a teal sky that opens into black with a mis-shapen planetoid above it; at the bottom of the frame is nonsense text in a black serif typeface. The second has four balloons, all dark except a green one half-covered in a green seemingly metallic sheath, all against purple sky with white, teal, and read aurora. Third has one prominent chunky purple balloon and a few in the background, against a white-streaked teal sky and floating over a flat rocky ground. The final are cartoony flat-color yellow or pink balloons against a start field that goes from dark purple to dark pink, with a yellow and teal spiral galaxy in the distance.
A token effort at bringing in the text.
CLIP Interrogator prompt fed into Stable Diffusion for the digital illustration
An aurora made up of a ghost image of a balloon being blown nearly horizontal, with one distant balloon and the crescent edge of one very near that may also be a pumpkin. Four colorful balloons in the purple stratosphere well above the highest clouds of blue and red and yellow arcs, with a green aurora above them. A nearby perfectly normal photo-realistic balloon against a perfectly ordinary dark sky. A lone balloon above a mountain horizon with yellow, teal, green, blue, and purple aurora.
No effort at text, because the prompt did not include the text.

Enhanced Photo

A digitally enhanced photograph
an undulating, translucent star-forming region in the Carina Nebula is shown in this Webb image, hued in ambers and blues; foreground stars with diffraction spikes can be seen, as can a speckling of background points of light through the cloudy nebula
One of the first images revealed to the public from NASA’s James Webb Space Telescope. Image credit: NASA, ESA, CSA, and STScI.

This is the alternative text for the image as written by NASA:

an undulating, translucent star-forming region in the Carina Nebula is shown in this Webb image, hued in ambers and blues; foreground stars with diffraction spikes can be seen, as can a speckling of background points of light through the cloudy nebula

CLIP Interrogator using ViT-L-14/openai generated this text prompt:

a star filled sky filled with lots of stars, inspired by Kim Keever, behance, space art, header, aspect ratio 1:3, dust

The Stable Diffusion output for each:

My alt text fed into Stable Diffusion for the enhanced photo
Four versions of colorful clouds of gas in a star-laden field, each showing hard transitions between clouds that mimic what we see in most nebulae images, if not the original image, along with the four-pointed diffraction spikes.
Any of these could be a NASA image.
CLIP Interrogator prompt fed into Stable Diffusion for the enhanced photo
One image is a dense yellow cloud with diffuse edges and purple starts against a black star field. Another is a cloud of lavender dust radiating out in fuzzy lines from the center, against black with rough colored dots. A loose cloud of white and purple taking up most of the frame, not much structure, only a little bit of black visible at the edge. A handful of narrow spiked glowing triangles of color on a field of black, one large white dust cloud in the corner, and edges of the image feeling like looking through torn parchment.
These feel like images pulled from India’s festival of color (Holi).

Painting

A painting
A night sky swirling with vivid blue spirals, a dazzling golden crescent moon, and constellations depicted as radiating spheres dominate the oil-on-canvas artwork. One or two flame-like cypress trees loom over the scene to the side, their black limbs curving and undulating to the motion of the partly obscured sky. A structured settlement lies in the distance in the bottom right of the canvas, among all of this activity. The modest houses and the thin spire of a church, which stands as a beacon against undulating blue hills, are made out of straight, controlled lines.
The Starry Night by Vincent van Gogh, 1889. The alternative text is cobbled together from an image description at Famous Paintings – Taking a Look at the World’s Most Popular Paintings. I figured it was worth letting professionals describe it.

This is the alternative text for the image (used as plain text description in source, and artist provided in original context):

A night sky swirling with vivid blue spirals, a dazzling golden crescent moon, and constellations depicted as radiating spheres dominate the oil-on-canvas artwork. One or two flame-like cypress trees loom over the scene to the side, their black limbs curving and undulating to the motion of the partly obscured sky. A structured settlement lies in the distance in the bottom right of the canvas, among all of this activity. The modest houses and the thin spire of a church, which stands as a beacon against undulating blue hills, are made out of straight, controlled lines.

CLIP Interrogator using ViT-L-14/openai generated this text prompt:

a painting of a starry night, featured on pixiv, post-impressionism, in style of old painting, masterpiece album cover, matte painting of human mind, broad strokes, black and white painting, with blue light dark blue sky, paint smears, an ai generated image, starry sky in background, vincent

The Stable Diffusion output for each:

My alt text fed into Stable Diffusion for the painting
Four images again. A lone tree on a rich orange hill outlined in a light blue glow against a dark blue sky filled with points of light and long horizontal orange streamers. A dark blue sky with wide arcs of almost chalk yellow and darker blue lines, with a glowing mound of yellow ring in purple on the ground, itself surrounded by what may be hedgerows. A painterly yellow sun, surrounded by yellow and blue flecks, circled by a light blue border, looking down from a dark sky over a few ribbons of gold and orange coming up from a rounded blue and light blue horizon as if from a great height. A pair of giant yellow-green  ampersands standing in a darkened field as an a classical illustration of planetary orbits around the sun fills the sky above.
These all have a vibe, though I would never confuse them for being a Van Gogh.
CLIP Interrogator prompt fed into Stable Diffusion for the painting
Four sloppy variations of Van Gogh’s “Starry Night”, the differences primarily being framing (though one has it in a wooden frame), zoom level, darkness, and blockiness of strokes.
This was clearly less a prompt than instructions to look up the image in its data store.

Takeaway

Some developers, having given up on (or been told to give up on) the Facebook-style underwhelming looks like… tools that describe images, have looked for better tools to free them having to personally consider blind and bandwidth constrained users. Sadly, these automated prompt generators themselves rely on context many users may not have, such as naming artists or styles.

At the very least, these prompts would need to be edited for human consumption. The art sites, artists, uses, and strings of numbers are meaningless on their own. Simply removing them does nothing to make the prompt itself useful as alternative text.

In short, this is also a flawed approach and no organization should consider it a viable option.

Wrap-up

24 assorted tiny thumbnails, mostly from nature except for one half face of a woman. The title of this post is the seed phrase for the opening image.

If you want to see what your favorite AI prompt generation tool might come up with using the same images I used, I made a Codepen. It might be easier to use the debug mode.

You can automatically generate alternative text or prompts from many tools. Don’t.

2 Comments

Reply

Made it without ever looking at any images. Then yawned and rubbed my eye. The lid almost flipped backwards!

In response to phil sawatsky. Reply

Well, we certainly do not want eyelid damage.

Leave a Comment or Response

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>