The body of new literature focusing on using synthetic data to improve LLM performance is quickly expanding. One of the more interesting recent results from this domain was Microsoft's "Textbooks Are All You Need," which I have discussed in a previous Substack post here.
The potential of this work led Vik Paruchuri and me to team up, along with a cohort of other contributors, in an attempt to replicate these results, but with a twist. We are planning to use open source models instead of OpenAI's proprietary models to reduce costs, enhance accessibility of this work, and avoid potential terms of agreement violations. This task turned out to be more difficult than we originally expected for several reasons:
Generating 20 billion unique, high-quality synthetic tokens is challenging.
Microsoft did not provide details on how they produced their textbook-quality data. Careful techniques are needed to ensure diversity and richness of the generated text. For instance, if one fails to provide sufficient seed data as input, then the generated downstream text will overlap, this presents a significant challenge.
There are many potential approaches to generating synthetic textbook data.
For example, one approach is directly prompting the model to write textbook content, or one could prompt for essay-style responses to questions. Exploring the optimal generation strategy is an open question.
The value added vs. cost of generating massive datasets is still uncertain.
The results from a small model trained on a small corpus (e.g. Phi-1/1.5) do not necessarily extrapolate to much larger models trained on hundreds of billions of tokens. Moreover, it is hard to quantify exactly what edge synthetic data brings. More research is needed to determine the scalability and strengths of these new techniques.
Defining and validating "textbook quality" is difficult.
We found it somewhat inexpensive to generate large quantities of low/medium-quality text, but confirming factual accuracy, coherence, and pedagogical utility requires extensive human evaluation or advanced automatic metrics that don't yet exist. For example, I can do some such analysis by hand on a tiny subset of generated data, but there is no easy way to extrapolate this analysis to the dataset at scale.
To generate these findings, we experimented with models of varying sizes, from Mistral-7b to Llama-2 34b. Lower parameter models appeared capable of producing satisfactory but not outstanding samples. However, more investment on our end is required to rigorously quantifying "good but not great" textbook-like data. We tried augmenting the textbook data by including context retrieved from Wikipedia, and we also attempted fine-tuning models to specialize at this. However, evaluating the comparative quality of these different datasets beyond human spot-checks needs further work.
Fine-tuning open source models on high-quality samples from proprietary models like GPT-3.5 and GPT-4 did improve results, but this re-introduces dependencies we are trying to avoid. It may be possible to replicate these gains once classifiers can effectively filter for top-quality examples, but developing such systems remains an important area for future work.
This work is still ongoing and there are plans to refine the system over time in order to generate a large textbook-quality dataset. In the meantime, it has become clearer to myself that there is lower hanging fruit in the synthetic data generation space which involves focusing on tasks which create output that can be verified explicitly. This allows one to quickly build a positive feedback loop that can be iterated upon. There are some new and interesting results recently which are beginning to show verifiable synthetic datas potential in building domain expert agents, such as ToRA: A Tool-Integrated Reasoning Agent and Language Models can be Logical Solvers.
In conclusion, while preliminary experiments have been promising and interesting, efficiently generating large-scale, high-quality synthetic datasets for LLM pretraining remains an unsolved challenge for the open source community. Key problems include devising optimal generation strategies, developing reliable metrics to assess data quality, and reducing dependence on proprietary systems. Targeted benchmarks and rigorous ablation studies will help elucidate best practices. With continued research, synthetic data may yet enable major leaps in LLM capabilities. But realizing this potential will require creative techniques and likely substantial computational resources. The area is ripe for further exploration by the AI community.