AI Mirrors Web’s Missing Knowledge

Because the internet overrepresents Western, English, and digitized sources while neglecting local, oral, and non‑digitized traditions, AI systems trained on web data inherit those omissions. As people increasingly rely on chatbots for practical guidance, this skews what counts as 'authoritative' and can erase majority‑world expertise. — It reframes AI governance around data inclusion and digitization policy, warning that without deliberate countermeasures, AI will harden global knowledge inequities.

Sources

Generative AI Systems Miss Vast Bodies of Human Knowledge, Study Finds

msmash 2025.10.14 90% relevant

It cites Common Crawl’s English dominance (44%), the extreme underrepresentation of Hindi (0.2%) and Tamil (0.04%), that ~97% of languages are low‑resource, and a study where 75% of 12,495 medicinal‑plant uses were unique to a single local language—then warns LLM 'mode amplification' will further entrench these gaps as AI content feeds future training.

Holes in the web

Deepak Varuvel Dennison 2025.10.13 100% relevant

The author’s claim that 'huge swathes of human knowledge are missing from the internet' and that a 2025 ChatGPT‑use study shows many rely on it for information and guidance.