Grounding Synthetic Data with SciPhi+RAG

This tutorial shows how SciPhi can be used to ground synthetic data with truth

Sep 22, 2023

SciPhi is a novel open-source framework currently under development, aiming to streamline the creation of high-quality synthetic datasets. In this context, a synthetic dataset is one generated partially or entirely by a Large Language Model (LLM). Recent studies have showcased the compelling results that such datasets can achieve in pre-training new, smaller LLMs, such as phi-1.5, in fine-tuning existing LLMs, and in their evaluation. SciPhi is dedicated to accelerating this progress by establishing a robust pipeline for synthetic data generation.

Before we proceed, here are a few helpful resources

For a deeper understanding of how SciPhi works, refer to this introductory substack.
To view an example output dataset, check out HuggingFace here
This tutorial will assume a baseline understanding of Retrieval Augmented Generation (RAG). Llama-Index has a great deep dive on this topic here.

Why ground synthetic data?

One issue we will naturally face in creating synthetic data from LLM outputs lies in verifying the quality of the produced output. LLMs are known to “hallucinate”, or produce incorrect or wholly made up responses that are presented as fact.

The motivation for combining RAG with a synthetic data generation pipeline is to ground the LLM output in truthful data, thereby increasing the accuracy and overall quality of the output data.

Grounding Strategies

There are two simple but somewhat different strategies we can use to ground synthetic data:

Strategy A: Generate a question -> Lookup truth info -> LLM generates answer
Strategy B: Lookup truth info -> Generate a question -> LLM generates answer

Strategy A is more flexible for generating varied datasets, while Strategy B might be more suitable when starting with a comprehensive seed dataset. This guide will explain a simple implementation following Strategy A, but implementing Strategy B also holds potential for enhanced grounding of synthetic data. Most likely, the best data generation practices will incorporate both strategies in their pipeline, and Strategy B is something we are working on.

Implementation

We decided to use an out of the box open source solution rather than roll our own. In this instance, we didn’t think designing a pipeline was worth the effort because LlamaIndex offers a reasonable out of the box solution. This choice may change as the use cases and constraints of this problem become better understood, but for now this allows us to quickly implement a robust solution.

Integration: LlamaIndex is seamlessly integrated into SciPhi through specific configurations.
Index Creation/Loading: An index is either created or loaded from available sources like pdf textbooks.
Query Engine: A query engine utilizes the index to ground synthetic data in real-world information.
Data Generation: SciPhi generates synthetic data, which is then grounded through the LlamaIndex query engine.

We chose to wrap the LlamaIndex query engine in our local LLM abstraction. This choice is convenient as a generic interface can be constructed which allows us to treat llama-index as we do any completion provider.

Execution Example

An example execution command demonstrates the integration of LlamaIndex with SciPhi:

poetry run python sciphi/examples/basic_data_gen/runner.py --provider_name=llama-index --log_level=DEBUG --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need --model=gpt-3.5-turbo-0613

This command initiates the data generation process for a single sample, grounding the synthetic data with LlamaIndex and producing results that are promising for achieving high-quality, accurate synthetic datasets. The `runner` script defaults to using a pdf in the remote repository which corresponds to the book titled `Algorithms` by Jeff Ericson which was distributed under the Creative Commons Attribution 4.0 International license.

It is worth mentioning that grounding does not appear to do much for the pipeline given above - this indicates that we want to construct our pipeline with the grounding source in mind.

Studying with a synthetic prompt

Let’s purposefully ask a difficult question which is contained inside the book referenced above. For this use case, we flipped to the end of the book and selected a relevant paragraph header ‘12.12 Subset Sum (from Vertex Cover)’ which is pictured below

Performing a grounded query with Llama-index we obtain the following result -

Response A (grounded, gpt-3.5-turbo-0613)

The "Subset Sum (from Vertex Cover)" is a reduction that demonstrates the NP-hardness of the SubsetSum problem. It involves assigning values to the vertices and edges of a graph in such a way that finding a vertex cover of a specific size in the graph is equivalent to finding a subset of these assigned values that adds up to a given target value. By using this reduction, it can be shown that if there is a vertex cover of size k in the graph, then there exists a subset of the assigned values that sums to the target value. Conversely, if there is a subset of the assigned values that sums to the target value, then there is a vertex cover of size k in the graph. Therefore, this reduction proves that SubsetSum is NP-hard.

Response B (ungrounded, ChatGPT)

The term "Subset Sum (from Vertex Cover)" appears to be a combination of two well-known computational problems in computer science: the Subset Sum problem and the Vertex Cover problem. Let's briefly describe each of these problems before discussing their potential relationship.
1. Subset Sum Problem:
- The Subset Sum problem is a classic NP-complete problem in computer science and mathematics.
- In this problem, you are given a set of positive integers and a target sum.
- The goal is to determine whether there exists a subset of the given set of integers whose elements sum up to the target sum.
- It can be formally defined as follows: Given a set of positive integers S = {a1, a2, ..., an} and a target sum T, find a subset S' ⊆ S such that the sum of the elements in S' equals T.
2. Vertex Cover Problem:
- The Vertex Cover problem is another well-studied NP-complete problem in graph theory.
- In this problem, you are given an undirected graph, and the objective is to find the smallest possible set of vertices (nodes) such that every edge in the graph is incident to at least one vertex in the set. This set is called the "vertex cover."
- Formally, given a graph G = (V, E), find a subset V' ⊆ V such that for every edge (u, v) in E, either u or v (or both) belongs to V'.
Now, when you mention "Subset Sum (from Vertex Cover)," it's possible that you're referring to a variant of one of these problems that incorporates elements of the other. However, it's not immediately clear how these two problems would be directly related in a standard computational context, as they address different types of computational tasks.
It's possible that you are referring to a specialized problem or algorithm that combines elements of both problems, but without more context or a specific problem statement, it's challenging to provide a detailed explanation. If you have a specific question or a more detailed description of the problem you're interested in, please provide additional information, and I'll do my best to assist you.

Conclusion:

After looking at the extra details, Response A gets to the point and explains the "Subset Sum (from Vertex Cover)" idea accurately, making it the better answer. On the other hand, Response B gives general info but misses the mark on how the two ideas are connected, so it’s not as helpful. So, in simpler terms, Response A is the clear winner for giving the right info on this tricky topic.

Summary:

SciPhi is an open-source framework designed to optimize the generation of high-quality datasets, with a new focus on grounding synthetic data to enhance accuracy. By integrating innovative strategies and leveraging existing solutions, SciPhi is beginning to demonstrate promising results in producing reliable, quality datasets for pre-training and fine-tuning Large Language Models (LLMs), such as phi-1.5. This initiative sheds light on the potential and significance of grounded synthetic data in advancing the field.

Owen’s Substack

Discussion about this post