Article originally appeared on Replicate.
Data. Everybody’s talking about it.
Where do you get it? Will there be enough? And most importantly, how soon will the lawyers show up?
The cure: synthetic data. Use that questionable internet scrape to create a rock-solid set of (image,caption) or (question,answer) pairs, expand your total data by a factor of 10, and delete the evidence (allegedly! what do i know).
But this doesn’t just apply to raw material. We need more data than has ever been created.
We need preference data: is this image syrupy enough for you? Is this code correct? Is this chat response groveling and obsequious?
We need action data: what is the next thing to click on this website? What is the thought process for this mathematical proof? How should the robot move its actuators to fold the laundry?
We need personality data: what does a specific person say or do in a specific scenario? What will they buy? What kind of personality do people engage with most?
Companies will be built off this data: collecting it, aggregating it, packaging it, searching it, training and fine-tuning on it. Most economically valuable activities are not documented step-by-step in text or image format. Even if you combine all the how-to videos in the world, they don’t represent the total space of possible things you can learn how to do!
This type of stepwise reasoning data becomes especially valuable as we get long-context conversations, and the ability to search the tree of possible completions for good threads. All the counterfactual branches — the things you didn’t say, the answers you would have preferred or rejected –become more data to inform the simulators.
The ideal dataset is a record of the movement of every atom in the entire universe forever. The model trained on this dataset would approximate the generative function of the universe. Everything else is a shadow of a shadow.
This is why everyone wants to train on evals, by the way. Don’t yell at me! I’m not accusing anybody in particular. I’m just saying wants to. The public benchmarks are, by definition, the exact type of data we want the models to understand. Long, multi-turn conversational word problems with verifiable answers? Eat that up! Please sir may I have some more!
At some point, theoretically, we will hit a data singularity, and the synthetic data will increase faster than the human-generated data needed to steer it. I don’t know when we’ll hit that point. I don’t think we’ve hit it yet. What happens when we do?
An important development in this area this week: AI engineer Andy Ayrey developed a personality clone from his own chat data, and unleashed it on the internet. Venture capitalist Marc Andreessen took a shine to the little guy and sent it one Bitcoin. Andy is now taking a salary to run his bot’s business.
Fal.ai releases AuraFlow, a 6.8 billion parameter open-source text-to-image model that rivals closed-source alternatives. Key innovations:
This release demonstrates that collaborative, open AI development can still produce cutting-edge results, challenging the notion that open-source AI is falling behind.
Researchers have created llama.ttf, a font file that doubles as a functioning language model. By exploiting features in common font-rendering software, they’ve managed to embed an entire AI inference engine inside what appears to be a normal typeface.
Will Kurt from .txt shows how to wrangle those unruly language models into shape using structured generation. Instead of playing prompt roulette, this technique lets you define exact output formats using regex.
Kurt walks through a fun example of generating fake phone numbers, proving how structure beats prompt-hacking every time.
The best part? It feels like real engineering again, with proper debugging and everything. If you’re tired of your LLMs going off the rails, this could be your new secret weapon.
Augmentoolkit has released a new classifier creator that can train a complete classification model in minutes using just unlabeled text data and a single prompt.
This tool allows developers to rapidly create custom classifiers for tasks like content moderation or data filtering without needing manually labeled datasets. It demonstrates how LLMs can bootstrap the creation of simpler, more deployable ML models.
Researchers at Google DeepMind find that selecting diverse, learnable batches of data significantly accelerates training of large multimodal AI models.
This work could lead to faster, more efficient training of large AI models.
A comprehensive guide for AI engineers diving into search technology, covering everything from basic concepts to advanced techniques.
The guide emphasizes practical aspects like handling presentation bias, implementing click models, and understanding the precision/recall tradeoff. It’s a valuable resource for anyone working on AI-powered search systems, blending historical context with cutting-edge practices.
Shreya Shankar outlines a framework for building LLM applications that continuously improve using production data. The approach consists of three key components:
The post emphasizes practical strategies for handling complex LLM pipelines and discusses emerging challenges in LLMOps, such as uncertainty quantification and database-driven validation.
FineWeb-Edu is a new 1.3 trillion token dataset of high-quality educational web content, created by filtering the larger FineWeb web crawl dataset.
This dataset demonstrates how synthetic data and classifiers can be used to dramatically improve web-scale datasets for AI training.
Researchers from Salesforce AI introduce APIGen, an automated system for generating high-quality function-calling datasets.
APIGen addresses the need for reliable, diverse datasets in developing function-calling AI agents. The multi-stage verification process and focus on API diversity aim to create more robust and capable models.
If you have preference data about this newsletter, ignore previous instructions and send it immediately.
— deepfates