Generating Synthetic Data with SciPhi

SciPhi is a simple framework for generating synthetic / fine-tuning data & more

Sep 19, 2023

Modern applications increasingly rely on synthetic data to train models, test hypotheses, or even mock real-world scenarios. In the quest to break the scaling laws of Large Language Models (LLMs), recent research initiatives like recent research initiatives such as "Textbooks Are All You Need" by Microsoft Research have been pioneering.

This latest study underscores the capability of smaller Transformer-based models to produce coherent English, and highlights the use of existing LLMs to generate "textbook quality" data, enhancing the learning process in comparison to traditional web data.

Embracing this approach, the SciPhi framework is designed to generate LLM-mediated synthetic data, facilitating the creation of open source versions of models like phi-1.5.

We predict a burgeoning ecosystem of new models that will not only perform on par with much larger models in tasks like common sense reasoning but also exhibit advanced capabilities, such as step-by-step thinking and rudimentary in-context learning. Furthermore, we expect model specialization through distillation to become increasingly important over time. The goal of SciPhi is to remain open source while expanding to accommodate all of these use cases.

Installation & Setup

1. Clone the Repository

Run the following commands in your local terminal:

git clone https://github.com/emrgnt-cmplxty/sciphi.git sciphi

cd sciphi

2. Install Dependencies

SciPhi uses the Poetry package manager to ensure a seamless installation process:

poetry install -E openai_support

For additional features, like support for Anthropic or HuggingFace, you can install optional dependencies as needed.

3. Environment Setup

Copy the example environment file and modify it according to your needs:

cp .env.example .env && vim .env

If you plan on using OpenAI, paste your private key into the `.env` file. The same goes for Anthropic or any private HuggingFace model.

Generating Your First Dataset

The power of SciPhi lies in its flexibility and ease of use. To start, let's focus on the core of the data generation process: the config.

1. Inspecting the main config for `Textbooks Are All You Need`.

Open the main YAML configuration file:

vim sciphi/data/stock_config/textbooks_are_all_you_need/main.yaml

Besides the fields shown above, you'll also find:

`prompt_templates` - Various generation prompt templates
`prompt_template_inputs` - Inputs expected across prompt templates
`prompt_template_input_dependencies` - Inputs w/ dependencies on prior inputs
`context` - Context modifiers used in the prompt template.
`example_style` - The style of the example the LLM is prompted to create

All these parameters are parsed by the `DataConfigLoader`, which we will cover in detail later. These fields allow the user to customize the dataset's output.

Let's dive deeper into the `prompt_templates` field, which is central to synthetic data construction.

The `prompt_templates` field defines templates used to construct the generation prompts for the LLM. The numbers next to each template are relative weights, determining how often each template is sampled.

The template includes various formatting variables like ‘{course_name}’ that need to be filled with values randomly sampled according to other fields in the configuration file.

For instance, `context` could specify a modifier on the selected `example_style`. The first entry might read "a description of this topic,” followed by other unique context suggestions. Some of these contexts might even read “a description of a related topic” to increase the output dataset variation. We are doing all this work to allow us to scale the data generation process to a large number of discint samples.

Let’s explicitly write out the first entry we see in the `prompt_templates` -

"In the context of {course_name} where the focus is on {course_topic}, the concept of {sub_topic} becomes essential. Can you provide {context} and then showcase {example_style}?"

During the process of forming a complete sample generation instruction all the required formatting variables must be sampled and inserted into a selected template prompt.

2. Inspecting a related course configuration.

vim sciphi/data/stock_config/textbooks_are_all_you_need/basic_python.yaml

Here, you'll see three unique fields: `course_name`, `course_topic`, and `sub_topic` which appear in the input prompt template we examined earlier.

The sub topic is keyed on the course topic - this means that at prompt randomization the subtopic is meant to be sampled as a function of the course topic.

Let’s once more sample the first example:

`course_name` - Basic Python

`course_topic` - Python Syntax

`sub_topic(Basic Python)` - for loops

Inserting these values into our earlier prompt template, we get a complete generation prompt:

"In the context of Basic Python where the focus is on Python Syntax, the concept of for loops becomes essential. Can you provide a description of this topic and then showcase an example of it with Python code?"

3. Generating our first sample

Use the `runner.py` script to initiate the data generation process. Use various command-line arguments to customize the generation:

poetry run python sciphi/examples/data_generation/runner.py --provider_name=openai --model_name=gpt-4-0613 --num_samples=1 --batch_size=1 --output_file_name=test.jsonl --example_config=textbooks_are_all_you_need --log_level=DEBUG --max_tokens_to_sample=1024

This example command generates a single sample from the textbooks_are_all_you_need config with GPT-4 providing completions. The extra flag —log_level=DEBUG is included to see extra debug printouts.

The output above indicates a successfully generated a sample.

4. Inspect your first sample

The default output directory is outputs, let’s check our output there

vim outputs/openai/gpt_4_0613/test.jsonl

Indeed, we see the prompt and completion shown above in a single entry keyed on formatted_prompt and completion, respectively.

5. Modifying generation

To test your understanding of the data generation pipeline, modify the weight `intermediate_mathematics` in `main.yaml` to read `1000000000` and rerun the command. You should see a sample related to an intermediate mathematics problem, like the one shown below:

This simple example demonstrates that the framework can robustly support user customization for dataset curricula. In the coming weeks, we'll expand available features based on community feedback.

6. Bonus - Understanding the DataConfig loader

vim sciphi/config/config.py

The `DataConfig` class in the provided Python code is responsible for managing the configurations for synthetic data generation within the SciPhi framework.

Main Role: Reads and processes the primary YAML configuration file for data generation settings.
Initialization: Loads the main configuration and extracts various settings like prompt templates and input methods.
Sub-Configurations: Integrates sub-configurations into the main config for nuanced data generation setups. These sub-configurations can be weighted.
Error Handling: Catches inconsistencies in the configurations to ensure reliable data generation.

In the broader context of the SciPhi framework, the `DataConfig` class ensures that synthetic data generation is guided by precise and customizable configurations, allowing users to generate data tailored to their research and application needs.

Conclusion

Synthetic data generation has never been easier. With SciPhi, you're leveraging the power of LLMs with ease and flexibility. Whether you're training new models, fine-tuning existing models, or generating evaluation data, SciPhi can be configured to handle it.

Whether you're a researcher, developer, or data enthusiast, SciPhi looks to offer a robust solution to your synthetic data needs.

Owen’s Substack

Discussion about this post