AI Can Write Near Human-Level College Textbooks

In our pursuit to replicate recent research findings, we discovered the potential to create AI-authored textbooks that rival human-authored ones in quality.

Oct 04, 2023

Recent advancements in synthetic data research have garnered significant attention, with breakthroughs like Orca, Distilling Step-by-Step, and Textbooks are All You Need II, which have demonstrated that synthetic data can significantly enhance the efficiency of training and fine-tuning Large Language Models (LLMs).

Yet, a crucial, underexplored question emerges: "What role can AI play in human education?"

It's imperative to consider this, as AI's impact is likely to extend far beyond merely serving as conversational agents. The emergence of synthetic textbooks is supporting evidence to this point, something which we will cover in detail below. We came upon this understanding while attempting to replicate some recent research results.

Attempting to replicate Textbooks Are All You Need II:

This work presented a 1.3 billion parameter model, phi-1.5, that appears to match the common sense reasoning benchmark results of models nearly ten times its size that rely on datasets exceeding tenfold in volume.

The main focus of this work was the training data which was made up of mostly synthetic "textbook-like" data, amounting to roughly 21 billion synthetic tokens. The data was carefully constructed to impart common sense reasoning and a broad spectrum of world knowledge. The research underscored that such synthetic data shows great potential in improving the efficiency of training LLMs. We can see this work as a deep attempt to use AI to self-educate more efficiently. This work is interesting because it penetrates the learning stack more deeply than similar previous efforts in fine-tuning.

While both phi-1 and phi-1.5 were generously open sourced, the synthetic data and code to generate it were not. Given that training methods for smaller models like phi are well-known, I found the data to be most interesting part of this research and it is still being kept secret. Therefore, I set out to replicate Microsoft’s work in fully open source setting, where I have found others to work with. We are now attempting to further our understanding and benefit the open source AI research community.

Some Key Observations From Early Replication Attempts

Challenges in Synthetic Token Generation: Generating a staggering 21 billion synthetic tokens is no walk in the park. For perspective, a single A100 can churn out only between 50-100M tokens per day. Beyond sheer volume, ensuring diversity in seed generation is a complex undertaking.
Structuring Output Data: There are a lot of directions we can go here. Should the output data adopt a more instructional mold, elucidating step-by-step reasoning? Or should it mirror traditional textbooks, filled with facts and somewhat deeper prose than what we find on Wikipedia?
Knowledge Domain Distribution: There's ambiguity surrounding the optimal distribution of knowledge domains. How should we apportion the dataset across primary and secondary education (grades 1-12) versus collegiate level knowledge?
Topic-wise Distribution: Another facet to consider is the distribution of the dataset across various topics. What proportion should be allotted to the humanities, math and science, etc.?

If you would like to explore some of the early experiments, they are on HuggingFace, here, here, and here. To read some of the most exemplar textbooks, go on Github here.

Onto High Quality Synthetic Textbooks

My initial goal was to quickly produce fully synthetic textbook-like data snippets. These samples were meant to cover a smattering of topics and to include significant step-by-step reasoning in many of the examples. I made the dataset in this manner because it was my best interpretation of the work performed by Microsoft. However, one could take a more literal approach to the synthetic textbook proposal, which is something investigated by Vikas (Vik) Paruchuri here. I was impressed by this work and felt there was really something to the idea of building textbooks with AI. We have since teamed up to combine our efforts in replicating this research.

Whether or not my initial approach or the one taken by Vik is the best for training LLMs, there was something worth extending further to see if we could produce something useful for humans.

The approach I took to investigate this was to build a 4 step generation pipeline which does the following:

Scrapes all course pages from MIT OpenCourseWare.
Uses AI to convert the scraped pages into properly structured syllabi.
Uses AI to transform the syllabi into 20 chapter long textbook table of contents, complete with sections and subsections. You can see one such example here.
Use AI to generate a textbook from the table of contents with the script here.
Optional - As part of (4), augment generation by retrieving the most relevant content from Wikipedia.

To make the pipeline more concrete, I will share one random example of the final step and resulting output in authoring chapter subsection below.

Here is an example prompt for book generation:

### Instructions:
You are a writing a book titled “Organizing for Innovative Product Development: A Comprehensive Guide". You are currently writing the chapter and section shown below:
(NOTE FOR READERS - EXTRACTED FROM THE TABLE OF CONTENTS)
# Title: Organizing for Innovative Product Development: A Comprehensive Guide
## Chapter: - Chapter 20: Innovation and Legal Considerations:
### Section: - Section: 20.4 Legal Compliance in Innovation:
### Subsection (optional): 20.4c Implementing Legal Compliance in Innovation
To assist you in writing the chapter, you have been provided with some related context and recent chapter contents below:
### Related Context
``` (NOTE FOR READERS - THIS IS FROM WIKIPEDIA)
## Legal compliance
Legal compliance is the process or procedure to ensure that an organization follows relevant laws, regulations and business rules. The definition of legal compliance, especially in the context of corporate legal departments, has recently been expanded to include understanding and adhering to ethical codes within entire professions, as well. There are two requirements for an enterprise to be compliant with the law, first its policies need to be consistent with the law. Second, its policies need to be complete with respect to the law.
The role of legal compliance has also been expanded to include self-monitoring the non-governed behavior with industries and corporations that could lead to workplace indiscretions.
Within the LGRC realm, it is important to keep in mind that if a strong legal governance component is in place, risk can be accurately assessed and the monitoring of legal compliance be carried out efficiently. It is also important to realize that within the LGRC framework, legal teams work closely with executive teams and other business departments to align their goals and ensure proper communication.
### Legal consistency
Legal consistency is a property that declares enterprise policies to be free of contradictions with the law. Legal consistency has been defined as not having multiple verdicts for the same case.
The antonym Legal inconsistency is defined as having two rule that contradict each other. Other common definitions of consistency refer to “treating similar cases alike”. In the enterprise context, legal consistency refers to “obedience to the law”. In the context of legal requirements validation, legal consistency is defined
```
### Last textbook section content:
``` (NOTE FOR READERS - THIS IS FROM THE PREVIOUS AI RESPONSE)
## Chapter: Innovation and Legal Considerations
### Introduction
Innovation is the lifeblood of any organization that aims to stay competitive in today's fast-paced market. However, as we venture into the realm of new product development, it is crucial to understand that innovation does not exist in a vacuum. It is intertwined with a myriad of legal considerations that can significantly impact the trajectory of product development. This chapter, "Innovation and Legal Considerations," aims to shed light on this complex interplay.
In the process of creating groundbreaking products, organizations often find themselves navigating the intricate labyrinth of intellectual property rights, patents, copyrights, and trademarks. These legal aspects serve as both a shield and a sword, protecting one's own innovations while also ensuring that the organization does not infringe upon the rights of others.
Moreover, the legal landscape is not static. It evolves with the changing technological and societal norms, making it even more critical for organizations to stay abreast of the latest legal developments. This chapter will provide a comprehensive overview of these aspects, equipping readers with the knowledge to make informed decisions during the innovation process.
In addition, we will delve into the legal implications of collaborative innovation, a common practice in today's interconnected world. We will explore how to navigate partnerships, joint ventures, and open innovation initiatives while safeguarding the organization's legal interests.
In essence, this chapter will serve as a guide to understanding the legal considerations that underpin the innovation process. It will provide a roadmap for organizations to navigate the legal landscape, ensuring that their innovative endeavors are not only groundbreaking but also legally sound.
```
Notes:
- The book is being written in the popular Markdown format.
- The context may be truncated and is meant only to provide a starting point. Feel free to expand on it or take the response in any direction that fits the prompt, but keep it in a voice that is appropriate for an advanced undergraduate course at MIT.
- Avoid making any factual claims or opinions without proper citations or context to support them, stick to the proposed context.
- Format ALL math equations with the $ and $$ delimiters to insert math expressions in TeX and LaTeX style syntax. This content is then rendered using the highly popular MathJax library. E.g. write inline math like `$y_j(n)$` and equations like `$$\Delta w = ..$$`
- If starting a new section, include `### [Section Title]`
- If starting a new subsection, include `#### [Subsection Title]`
### Response:

Here is the AI powered completion to the above -

### Section: 20.4 Legal Compliance in Innovation
Innovation, while being a catalyst for growth and development, must be conducted within the confines of the law. This is where legal compliance comes into play. Legal compliance is the process or procedure to ensure that an organization follows relevant laws, regulations, and business rules. In the context of innovation, legal compliance is crucial to ensure that the innovative process and its outcomes do not infringe upon any laws or regulations.
#### Subsection: 20.4c Implementing Legal Compliance in Innovation
Implementing legal compliance in innovation involves a series of steps that ensure the innovative process and its outcomes are in line with the relevant laws and regulations. This involves understanding the legal landscape, identifying potential legal hurdles, and taking proactive steps to address them.
The first step in implementing legal compliance in innovation is understanding the legal landscape. This involves gaining a comprehensive understanding of the laws and regulations that are relevant to the innovative process. This could include laws related to intellectual property rights, regulatory compliance, and enterprise law, among others.
The second step is identifying potential legal hurdles. This involves identifying areas where the innovative process could potentially infringe upon laws or regulations. For example, if the innovation involves the development of a new product, it is crucial to ensure that the product does not infringe upon any existing patents.
The third step is taking proactive steps to address potential legal hurdles. This could involve seeking legal advice, conducting a thorough legal review of the innovation, and implementing measures to ensure legal compliance. For example, if the innovation involves the use of a certain technology, it may be necessary to obtain a license for its use.
In addition to these steps, it is also crucial to ensure legal consistency. Legal consistency is a property that declares enterprise policies to be free of contradictions with the law. In the context of innovation, legal consistency ensures that the innovative process and its outcomes are in line with the organization's policies and do not contradict any laws or regulations.
Implementing legal compliance in innovation is not just about avoiding legal disputes or penalties. It is also about leveraging the law as a tool to protect and enhance the value of the innovation. By ensuring legal compliance, organizations can protect their innovations from potential legal disputes, enhance their value by securing intellectual property rights, and ensure their long-term sustainability.
In conclusion, legal compliance plays a crucial role in the process of innovation. By understanding the legal landscape, identifying potential legal hurdles, and taking proactive steps to address them, organizations can ensure that their innovations are legally compliant and sustainable.

This demonstration, along with the linked textbooks above, show that the reality of AI written textbooks is not far off. This result has ramifications that extend far beyond the interesting research result originally posed by Microsoft. It is worth stopping for a moment to reflect on these potential impacts.

AI and Education

LLMs, like GPT-4 and its predecessors, have demonstrated an uncanny ability to generate human-like text, answer questions, and even assist in complex problem-solving. In fact, roughly half of this article’s content was AI generated (with human input and guidance). LLMs vast knowledge base, derived from extensive training on diverse datasets, allows them to provide insights on a wide range of topics. This capability is particularly beneficial in an educational context. Students, regardless of their learning pace or style, can interact with these models to get personalized explanations, delve deeper into topics of interest, or even get assistance with homework. In essence, LLMs can serve as on-demand tutors, democratizing access to quality education resources.

Recent research breakthroughs have showcased the potential of AI-generated content that rivals the quality of human-authored materials. These synthetic textbooks, curated and created by advanced AI models, can be tailored to fit specific curricula, student demographics, or even individual learner profiles. The dynamic nature of these textbooks allows for real-time updates, ensuring that the content remains current and relevant. Furthermore, they can be enriched with interactive elements, multimedia, and adaptive learning pathways, making learning more engaging and personalized. Even if textbooks begin to play a less pivotal role in direct human education, they are likely to remain as inputs to LLMs for quite some time.

However, the adoption of synthetic textbooks and LLM-assisted learning is not without challenges. Concerns about the accuracy, bias, and ethical implications of AI-generated content need to be addressed. Ensuring that these tools enhance rather than inhibit critical thinking and creativity is also paramount. The human touch, the empathy, passion, and intuition that educators bring to the classroom, remains irreplaceable. AI tools should be seen as complements, not replacements, to the traditional educational experience.

Where are we in the replication attempts?

It appears that we are able to make incredibly high quality textbooks. This is an effort which we are continuing to pursue to understand more deeply. Moreover, our dataset has grown to over 1 billion unique and differentiated tokens. We are doing ablation pre-training studies now to understand better how different datasets impact LLM learning. We are working on fine-tuning existing open source models to increase our data quality before scaling out to 20B tokens. Further, we are looking for human readers to critique and give feedback on the current state of the textbooks.

The targets are to do a complete replication of the phi-1.5 work, and likely to scale further. One such goal might be to see if it possible to create a more competitive 7 billion parameter model.

Conclusion

In the rapidly advancing world of artificial intelligence, the potential of AI-authored textbooks promises a revolution in education. As our research and replication attempts indicate, we are on the brink of harnessing AI to produce high-quality educational content that could democratize learning across the globe. While the results are promising, a balanced approach is necessary. Embracing AI's potential should not overshadow the intrinsic value of human touch in education. As we venture into this new frontier, collaboration between humans and AI will be the key, ensuring that the resultant content is accurate, unbiased, and tailored to the diverse needs of learners worldwide.

If this interests you, we ask you to join the open source collaboration around this work. Please reach out to me or join the Discord community to get started.

Acknowledgement

I would like to thank runpod.io for their gracious support in the form of computational resources. Similarly, I owe thanks to Google Cloud Compute and OpenAI for their allocated credits.

Owen’s Substack