A toolkit for building validated datasets.
Dataset Foundry uses the concept of data pipelines to load, generate or validate datasets. A pipeline is a sequence of actions executed either against the dataset itself or the individual items in the dataset.
For details on which actions are supported, see the actions documentation.
- Clone the repository
- Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the package:
pip install -e .
- Create a
.env
file in the project root with your OpenAI API key:OPENAI_API_KEY=your_api_key_here
- Default Model: gpt-4o-mini
- Default Temperature: 0.7
- Default Number of Samples: 10
- Dataset Directory: ./datasets
- Logs Directory: ./logs
dataset-foundry/
├── src/
│ └── dataset_foundry/
│ ├── actions/ # Actions for processing datasets and items within those datasets
│ ├── cli/ # Command-line interface tools
│ ├── core/ # Core functionality
│ └── utils/ # Utility functions
├── datasets/ # Generated datasets
├── examples/ # Example pipelines
│ └── refactorable_code/ # Example pipelines to build a dataset of code requiring refactoring
└── logs/ # Operation logs
Pipelines can be run from the command line using the dataset-foundry
command.
dataset-foundry <pipeline_module> <dataset_name>
For example, to run the generate_spec
pipeline to create specs for a dataset saved to
datasets/dataset1
, you would use:
dataset-foundry examples/refactorable_code/generate_spec/pipeline.py dataset1
Use dataset-foundry --help
to see available arguments.
To generate a set of specs for a dataset named o3v5
, you would use:
dataset-foundry examples/refactorable_code/generate_spec/pipeline.py samples --num-samples=2
To generate a set of functions and unit testsfrom the specs, you would use:
dataset-foundry examples/refactorable_code/generate_all_from_spec/pipeline.py samples
To run the unit tests for the generated functions, you would use:
dataset-foundry examples/refactorable_code/regenerate_unit_tests/pipeline.py samples
If some of the unit tests fail, you can regenerate them by running:
dataset-foundry examples/refactorable_code/regenerate_unit_tests/pipeline.py samples
Variable substitutions allows you to use variables in your prompts and in certain parameters passed into pipeline actions.
Prompt templates and certain parameters are parsed as f-strings, with the following enhancements:
- Dotted references are supported and resolve both dictionary keys or object attributes. For
instance,
{spec.name}
will return the value ofspec['name']
ifspec
is a dictionary, or the value ofspec.name
ifspec
is an object. - Formatters can be specified after a colon. For example,
{spec:yaml}
will return thespec
object formatted as a YAML string. Supported formatters include:yaml
,json
,upper
,lower
.
For instance, if an item is being processed with an id of 123
and a spec
dictionary with a
name
key of my_function
, the following will save the code
property of the item as a file named
item_123_my_function.py
:
...
save_item(contents=Key("code"), filename="item_{id}_{spec.name}.py"),
...
This project is licensed under the MIT License - see the LICENSE file for details.