LLMs Directly Generate Images

New multimodal models let language models create images token by token, rather than handing prompts to a separate image tool. This yields precise, editable visuals (correct text, accurate annotations) and enables conversational, iterative art direction similar to text prompting. Early flaws remain, but the control and fidelity are a step beyond prior diffusion‑only pipelines. — Collapsing text and image generation into one intelligent system will reshape creative work, marketing, and disinformation risk by making high‑quality visuals as steerable as prose.

Sources

No elephants: Breakthroughs in image generation

Ethan Mollick 2025.03.30 100% relevant

Mollick’s 'no elephants' example shows GPT‑4o generating an annotated, elephant‑free room on demand, fixing spelling, and restyling an infographic through iterative prompts.