Goldman Sachs’ data chief says the open web is 'already' exhausted for training large models, so builders are pivoting to synthetic data and proprietary enterprise datasets. He argues there’s still 'a lot of juice' in corporate data, but only if firms can contextualize and normalize it well.
— If proprietary data becomes the key AI input, competition, privacy, and antitrust policy will hinge on who controls and can safely share these datasets.
Tyler Cowen
2025.12.03
60% relevant
The conversation emphasizes that putting everything online created the data ecosystem AI depends on; that trajectory explains why training pivots from public web corpora toward other proprietary streams (enterprise data) once the web is exhausted — a continuation of the internet→AI data story.
Anish J. Bhave
2025.12.03
62% relevant
Bhave’s proposal depends on feeding agents proprietary factory data (process logs, inspection images, throughput metrics) and using that data to produce supervision and quality insight — matching the existing idea that the next AI wave pivots to corporate/enterprise datasets as the core input.
EditorDavid
2025.11.30
86% relevant
Amazon’s memo pushing engineers to use Kiro rather than third‑party code generators creates an internal feedback loop and keeps developer telemetry in‑house, directly exemplifying the shift from training on the open web to proprietary enterprise data and workplace signals that existing idea flags as decisive for competitive advantage and policy.
EditorDavid
2025.11.30
50% relevant
Amazon’s use of internal AI to comb and select customer reviews is an example of firms mining proprietary content to create monetizable outputs, aligning with the broader shift from open‑web training data to proprietary corporate datasets powering products and campaigns.
msmash
2025.10.02
100% relevant
Neema Raphael on Goldman’s podcast: 'We’ve already run out of data,' citing DeepSeek’s use of model outputs and the need to mine enterprise data.