Alison Smith’s Post

View profile for Alison Smith, graphic

Director of Generative AI

The AI industry is approaching a potential bottleneck due to a finite amount of high-quality public data online, which thus far, has been crucial in training increasingly powerful models. Companies like OpenAI and Google are exhausting the internet's data reserves, necessitating the search for new data sources. For a deeper dive, check out: https://lnkd.in/eA5w77Et   To continue advancing the performance of these models while addressing data scarcity, chip shortages, and power limitations, tech companies may consider: 🤖 Creating synthetic data from models 📺 Collecting transcripts from videos (oh, hi! YouTube) 👩💻 Improving data selection and curation methods 💡 Developing novel and less data-hungry training methods   Wherever there is a challenge, there is an opportunity - even for companies in other industries. Many enterprises own an abundance of rich data that they can either monetize (if not sensitive) or use to fine-tune models in ways that large tech companies cannot.   Eyes peeled, everyone. The next act of AI ingenuity is just unfolding… Would you like me to unpack each of the 4 approaches listed above? 👀

For Data-Guzzling AI Companies, the Internet Is Too Small

For Data-Guzzling AI Companies, the Internet Is Too Small

wsj.com

Beckett D.

Innovator & Creator | AI Engineer @Yamamoto | Pioneering The First 100% Sustainable AI Solutions and Research

10mo

This is where the beauty of Invisible comes in. Hiring humans at a living wage to create high quality data for AI. All the studies have shown, like the Yi model research paper, that high quality data is vastly more effective and important than just throwing in whatever you can find. 10k awesome prompt/response pairs are way better than 100k of just 'walls of text'. We run the risk of AI, and the Internet, eating itself if we just keep scraping and feeding it text generated by itself and other LLMs.

Satchel Aviram

Generative AI Strategist - Booz Allen Hamilton

10mo

Although I can’t find a link to the full interview, Sam Altman shared this same sentiment when talking with Korean Silicon Valley correspondents. “In the long run, there will be a shortage of human-generated data. For this reason, we need models that can learn more with less data.” It will be interesting to see how this space evolves! Coverage of the interview @ Seoul Economic Daily: https://m.sedaily.com/NewsView/2D6O83AF81#cb

My initial take on video transcripts is that they would bias a model toward more informal language. Whether this would be helpful (or not) would depend on the use case. How would creating synthetic data avoid the risk of model collapse? I would be interested in hearing your perspective on that, Alison.

Like
Reply

Great topic. You are becoming quite LinkedIn prolific! :)

See more comments

To view or add a comment, sign in

Explore topics