Python package to trace sensitive information and process flows on the blockchain.
Leverages the blockchain’s inherent properties —immutability, transparency, availability, and traceability— to record and audit sequential steps in any process. Ideal for applications requiring verifiable records of actions or sensitive data trails.
Save sequential steps of anything.
- Improve reproducibility of Machine Learning models. There is a 'reproducibility crysis'. (Reproducibility and Traceability of ML models is where more focused is this work).
- Upload hashes of big data files.
- Trace NGO donations.
- Improve supply chain traceability.
- Save important data of scientific studies.
- Proof of authorship. Trace results with an address and a timestamp.
- Text.
- User-defined applications.
git clone https://github.com/francocerino/BlockchainTracer.git
cd BlockchainTracerpython3 -m venv blockchain_tracer_env
source blockchain_tracer_env/bin/activatepip install .Run this command in your consele:
npx shadcn@latest add "https://v0.app/chat/b/b_g1kTbNDXhik?token=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIn0..LPIK7itf1p9wLa7I.F6HOGYSmZvQniRTrCUZMAWm8yRrZP-Yg2F7XY82pPVOJOM3thdiHDJsjuh4.FnnualxaLhk_c6dTlGTWuQ"This case has similar ideas to supply chain traceability, but in this case is traceability for a Machine Learning pipeline, where the idea also aims to improve reproducibility through the use of standards developed for ML leveraged with the transparency, persistence, and immutability characteristics that blockchain provides.
-
Read saved and related bibliography to clarify the needed things for ML reproducibility.
- A Survey of Data Provenance in e-Science
- Ensuring Trustworthy Neural Network Training via Blockchain
- Towards Enabling Trusted Artificial Intelligence via Blockchain
- BlockFlow: Trust in Scientific Provenance Data
- ProML: A Decentralised Platform for Provenance Management of Machine Learning Software Systems
- Blockchain Based Provenance Sharing of Scientific Workflows
- Improving Reproducibility in Machine Learning Research (2021)
- Reproducibility in Machine Learning-Driven Research (2023)
- Leakage and the reproducibility crisis in machine learning-based science (2023)
- reforms: Reporting Standards for Machine Learning Based Science (2023)
- Traceability for Trustworthy AI: A Review of Models and Tools (2021). Comparison of some existing frameworks for ML reproducibility.
- Reproducibility in PyTorch
- Advancing Research Reproducibility in Machine Learning through Blockchain Technology (2024). Shows a review of works related to ML reproducibility with Blockchain.
- Promoting Distributed Trust in Machine Learning and Computational Simulation via a Blockchain Network
- Blockchain analytics and Artificial Intelligence
- Automatically Tracking Metadata and Provenance of Machine Learning Experiments Comments an approach for scikit-learn Pipelines and other libraries.
- Reproducibility in Machine Learning-based Research: Overview, Barriers and Drivers (2024)
- Model Cards for Model Reporting, Model Cards applied to known models. Each model card could be accompanied with Datasheets, Nutrition Labels, Data Statements, or Factsheets, describing datasets that the model was trained and evaluated on.
- ML Reproducibility Tools and Best Practices
-
Specifying differentiators of this work. A solution that has simultaneously:
- Traceability of ML models in EVM Blockchains with a Python API. Python is the most used language in ML, and EVM the most used for smart contracts.
- Open source code.
- Following standards of previous studies for ML reproducibility. Is a good idea more focus on narrative for reproducibility?
- Ability to trace other processes in general. But focused in ML reproducibility.
- Trace computer environment where the ML model was trained.
- Use Arweave or IPFS for large data, storing its hash in the EVM blockchain.
-
Fine-tune the requirements for good reproducibility.
- The NeurIPS 2019 ML reproducibility checklist of Improving Reproducibility in Machine Learning Research.
- JSON data structure with every configuration of the ML pipeline (hardware, environment, preprocesses, hyperparameters, seeds, metrics, package versions, etc).
- Model info sheet of Leakage and the reproducibility crisis in machine learning-based science (2023)
- Standarized enviroment. Leakage and the reproducibility crisis in machine learning-based science (2023)
- Checklist of reforms: Reporting Standards for Machine Learning Based Science (2023).
- Minimal Description Profile: Traceability for Trustworthy AI: A Review of Models and Tools (2021).
- Model Cards for Model Reporting
- MLFlow for data logging. Has an UI to compare models logged and is coded to work with very well known models from sklearn, XGBoost, etc.
-
Give the user things needed to reproduce models.
-
Ensure the code is easy to use and works well.
- Python code to facilitate technical people, not necessarily in blockchain.
- Integration with EVM blockchains (the most used and highly decentralized).
- The code must be secure with respect to private key.
- Test code.
- Solve what to do with code and binaries.
- Integration with IPFS or Arweave for large data.
- Frontend for scalability (usable by non-technical persons).
- Smart contract to decentralice the code used.
- Extend to other public blockchains.
- Extend to private blockchains.
- Display option to trace data with a new address.
- Expand to more RPCs (besides Infura).
- Automate model info sheet completion.