New multimodal models let language models create images token by token, rather than handing prompts to a separate image tool. This yields precise, editable visuals (correct text, accurate annotations) and enables conversational, iterative art direction similar to text prompting. Early flaws remain, but the control and fidelity are a step beyond prior diffusion‑only pipelines.
— Collapsing text and image generation into one intelligent system will reshape creative work, marketing, and disinformation risk by making high‑quality visuals as steerable as prose.
Ethan Mollick
2025.03.30
100% relevant
Mollick’s 'no elephants' example shows GPT‑4o generating an annotated, elephant‑free room on demand, fixing spelling, and restyling an infographic through iterative prompts.
← Back to All Ideas