Data licensing deals, scaling, human inputs, and repeating trends in open vs. closed.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/the-data-wall
0:00 We aren't running out of training data, we are running out of open training data
2:51 Synthetic data: 1 trillion new tokens per day
4:18 Data licensing deals: High costs per token
6:33 Better tokens: Search and new frontiers