Weave is a flexible framework for generating and validating synthetic data across various domains. The system leverages Language Models (LLMs) to create high-quality, domain-specific datasets that can be used for training AI models, testing, and research purposes.
Note: This project is in its very early stages and is being actively developed in public. Expect frequent changes and improvements.
GitHub Repository: https://github.com/ashikshafi08/weave.git
You can install weave directly from GitHub using pip:
pip install git+https://github.com/ashikshafi08/weave.git
For development, you can clone the repository and install it in editable mode:
git clone https://github.com/ashikshafi08/weave.git
cd weave
pip install -e .
- Modular architecture for easy extensibility
- Support for various data generators and LLM interfaces (OpenAI, Hugging Face, vLLM)
- Customizable prompt templates for different tasks
- Data validation and quality checking
- Asynchronous operations for improved performance
- Comprehensive logging for debugging and monitoring
Here's a basic example of how to use weave:
import asyncio
import logging
from weave import SyntheticDataFramework, ProgrammingGenerator, OpenAIProvider
async def main():
# Configure logging
logging.basicConfig(level=logging.INFO)
# Initialize components
data_generator = ProgrammingGenerator()
llm_provider = OpenAIProvider(model="gpt-4o-mini", api_key="YOUR_API_KEY")
# Create framework
framework = SyntheticDataFramework(data_generator, llm_provider)
# Set custom prompt templates
framework.set_prompt_template("question_generation", "Generate a {difficulty} {language} programming question about {topic}. The answer should be: {answer}")
framework.set_prompt_template("answer_validation", "For the {language} question: {question}\nIs this a valid answer: {proposed_answer}? Answer with Yes or No.")
# Generate dataset
dataset = await framework.generate_dataset(10)
# Validate dataset
validations = await framework.validate_dataset(dataset)
# Evaluate dataset
criteria = {"aspect": "code_quality", "scale": "1-10"}
evaluations = await framework.evaluate_dataset(dataset, criteria)
print(f"Generated {len(dataset)} samples")
print(f"First sample: {dataset[0]}")
print(f"First validation: {validations[0]}")
print(f"First evaluation: {evaluations[0]}")
if __name__ == "__main__":
asyncio.run(main())
weave/core/
: Contains the core framework classesweave/generators/
: Data generator implementationsweave/llm_interfaces/
: LLM interface implementations (OpenAI, Hugging Face, vLLM)weave/prompts/
: Prompt management and templatesweave/config/
: Configuration filesweave/examples/
: Usage examples
- Create a new file in the
weave/generators/
directory. - Implement a class that inherits from
DataGenerator
. - Override the
generate()
andget_supported_types()
methods.
- Create a new file in the
weave/llm_interfaces/
directory. - Implement a class that inherits from
BaseLLMProvider
. - Override the required methods such as
generate_question()
,validate_answer()
,evaluate()
, etc.
Use the set_prompt_template()
method of the SyntheticDataFramework
or LLM provider to customize prompts for different tasks.
To see the rough plans for future development and features, check out our roadmap. This is not set in stone and is subject to change as we receive feedback and decide what features are most important.
The config/config.yaml
file allows you to set up your data generator, LLM provider, and other framework parameters. Here's an example:
data_generator:
type: "ProgrammingGenerator"
params:
languages: ["python", "javascript", "java"]
difficulties: ["easy", "medium", "hard"]
llm_provider:
type: "OpenAIProvider"
params:
model: "gpt-4o-mini"
api_key: "YOUR_API_KEY"
framework:
num_samples: 100
logging_level: "INFO"
As this project is in its early stages, contributions, suggestions, and feedback are highly welcome! Please feel free to submit issues, feature requests, or pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
This project is under active development. APIs may change, and features may be added or removed. It's a learning project and is not intended for production use as of now.