Mimir is a state-of-the-art error correction system.
Instructions have been tested on MacOS 14.5 on an Apple M2 Chip. They should work on UNIX systems and on an amd64 processor architecutre, too. Running Mimir on Windows is untested.
Mimir can be executed on using conda
or mamba
.
To install Mimir on your machine, follow these steps:
- Install Miniforge3 on you machine.
Follow the official installation instructions. - Clone this repository
git clone https://github.com/philipp-jung/mimir.git
. - Navigate into the newly cloned directory with
cd mimir
, then, runconda env create -n mimir -f environment.yml
to create a new conda environment calledmimir
. - Run
conda activate mimir
to activate themimir
environment. - Navigate into the
src/
folder in themimir/
directory. - Run
python correction.py
to correct sample data errors. Set parameters at the bottom ofcorrection.py
to adjust the correction process.
Mimir can be run as a container as well.
- Build an image
docker build -t <your_docker_username>/mimir:latest .
Consult thedocker buildx
documentation for cross-platform builds. - The measurement carried out by the container is controlled by using environment variables. The
CONFIG
environment variable is a serialized hashmap that contains all parameters that Mimir'sCorrector
object is configured with as keys, and an additional parameter calledrun
. TheEXPERIMENT_ID
is used to identify experiments, set it to your liking. - You can run a container that cleans the
hospital
dataset, using Mimir's full ensemble of correctors by executingdocker run -e CONFIG='{"dataset": "hospital", "n_rows": null, "error_fraction": 1, "error_class": "simple_mcar", "labeling_budget": 20, "synth_tuples": 100, "auto_instance_cache_model": true, "clean_with_user_input": true, "gpdep_threshold": 0.3, "training_time_limit": 600, "llm_name_corrfm": "gpt-3.5-turbo", "feature_generators": ["auto_instance", "fd", "llm_correction", "llm_master"], "classification_model": "ABC", "vicinity_orders": [1], "vicinity_feature_generator": "naive", "n_best_pdeps": 3, "synth_cleaning_threshold": 0.9, "test_synth_data_direction": "user_data", "pdep_features": ["pr"], "fd_feature": "norm_gpdep", "sampling_technique": "greedy", "run": 0}' -e EXPERIMENT_ID=hospital-all-correctors <your_docker_username>/mimir:latest
To run our benchmarking experiments, consider the README.md
file in the the benchmarks/
directory.
In the notebook/
directory, we provide the code used to generate all figures used in the Mimir publication.