SciPhi [ΨΦ]: A framework for breaking LLM scaling laws

Overview

SciPhi is a Python package that provides two high-level features:

Configurable generation of LLM-mediated synthetic training / tuning data for LLMs.
Seamless LLM-mediated evaluation of model output.

Questions?

Join us on Discord here or contact me directly. For a SciPhi tutorial, go here.

Installation

# Repository setup
git clone https://github.com/emrgnt-cmplxty/sciphi.git
cd sciphi
# Install dependencies
# pip3 install poetry (if you don't have it)
poetry install -E all
# Setup your environment
cp .env.example .env && vim .env

Requirements

Python >= 3.11 and < 3.12
Poetry for package management

Optional Feature Requirements

For additional features, you can install the optional dependencies:

poetry install -E <extra_name>

anthropic_support: For running with Anthropic models.
hf_support: For running with the HuggingFace package, useful for a large variety of model access.
openai_support: For running with OpenAI models.
vllm_support: For with VLLM, useful for fast inference.
llama_index_support: For LlamaIndex, useful for grounded synthesis.
chroma_support: For Chroma support, used for large vector databases.
all: For all dependencies (ex-vllm, which requires a separate install).

Usage

Dataset Generation

You can use SciPhi for dataset generation by executing the relevant runner.py file with various command-line arguments.

poetry run python sciphi/examples/data_generation/runner.py --provider_name=openai --model_name=gpt-4 --log_level=DEBUG --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need

Key Command-Line Arguments

--provider: Which provider to use for completions (default: "openai").
--model_name: The name of the model to load from the provider (default: "gpt-3.5-turbo").
--temperature: Temperature parameter for the provided model (default: 0.7).
--example_config: Which example configuration to use (default: "textbooks_are_all_you_need").
--override_config_path: Used to override the example configurations with custom config.
--num_samples: Number of samples to generate (default: 1_024).
--output_dir: File path to override the default output output file path with.
--output_file_name: Filename to override the default output file name with.

Stock data configs

evol_instruct - A config for replicating the EvolInstruct dataset
textbooks_are_all_you_need - A config for replicating the Python textbook data from Textbooks Are All You Need [2]

Example generated data

Development

The code snippet below shows how to use SciPhi to generate synthetic data for a given LLM provider.

# Build an LLM and provider interface
llm_config = LLMConfigManager.get_config_for_provider(
    provider_name
).create(**build_llm_config(args))
llm_provider = InterfaceManager.get_provider(
    provider_name,
    model_name,
    llm_config,
)

# Initialize the data maker
data_maker = DataMaker(
    DataGeneratorMode(data_config.generator_mode),
    prompt_generator,
    prompt,
    # Optional field,
    # currently only used when generator_mode == "from_hf_dataset"
    dataset_name=data_config.dataset_name,
)

# Generate & write out the results
output_path = get_output_path(args)
logger.debug(f"Writing results to: {output_path}.")
writer = JsonlDataWriter(output_path)

for batch in data_maker.generator(args.batch_size, args.num_samples):
    completions = llm_provider.get_batch_completion(batch)
    for formatted_prompt, completion in zip(batch, completions):
        logger.debug("-" * 100)
        logger.debug(f"Formatted Prompt:\n{formatted_prompt}")
        logger.debug(f"\nCompletion:\n{completion}")
        logger.debug("-" * 100)

        # Write the results using DataWriter
        writer.write(
            [
                {
                    "formatted_prompt": formatted_prompt,
                    "completion": completion,
                }
            ]
        )

License

This project is licensed under the Apache-2.0 License.

Datasets Generated

[1] Python Synthetic Textbooks

Sources

[1] WizardCoder: Empowering Code Large Language Models with Evol-Instruct

[2] Textbooks Are All You Need

📖 Citation

Reference to cite if you use LlamaIndex in a paper:

@software{Emergent_AGI_SciPhi,
author = {Colegrove, Owen},
doi = {Pending},
month = {09},
title = {{LlamaIndex}},
url = {https://github.com/emrgnt-cmplxty/sciphi},
year = {2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github		.github
sciphi		sciphi
.env.example		.env.example
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciPhi [ΨΦ]: A framework for breaking LLM scaling laws

Overview

Questions?

Installation

Requirements

Optional Feature Requirements

Usage

Dataset Generation

Key Command-Line Arguments

Stock data configs

Example generated data

Development

License

Datasets Generated

Sources

📖 Citation

About

Releases

Packages

Languages

License

TheGrognardling/sciphi

Folders and files

Latest commit

History

Repository files navigation

SciPhi [ΨΦ]: A framework for breaking LLM scaling laws

Overview

Questions?

Installation

Requirements

Optional Feature Requirements

Usage

Dataset Generation

Key Command-Line Arguments

Stock data configs

Example generated data

Development

License

Datasets Generated

Sources

📖 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages