Welcome! This repository documents a complete AI-driven workflow developed during a one-year gap-year internship, focused on automating data extraction from scientific articles to create a database for modeling the initial dissolution rate of glasses—a critical property for materials used in nuclear waste containment.
🧠 Goal: Extract and structure glass composition and experimental parameters from PDF articles using LLMs, Langflow, OCR, and custom Python scripts, and use this data to train a model to predict the initial dissolution rate of glass under various conditions.
- 🔬 Project Background
- 🧰 Technologies Used
- 🧠 AI/ML Techniques
- ⚙️ Project Workflow
- 📊 Final Dataset & Modeling
- 📎 License & Citation
- 📚 References
In the context of nuclear waste storage, glass is used to immobilize radioactive materials. Predicting how this glass alters (dissolves) over time in various conditions is essential for long-term safety.
The initial dissolution rate is one of the key metrics, but gathering experimental data on this is difficult due to:
- Scattered information across decades of publications
- Lack of structured datasets
- Variability in composition formats and testing conditions
| Category | Tools / Technologies |
|---|---|
| AI/LLM Workflow | Langflow (v1.2.0) |
| Language Models | Mistral Small 22B (internal API) |
| Development | Python, VS Code 1.96.2 |
| Data Parsing | PDFMiner, PyMuPDF |
| OCR | Docling, Lab OCR Scanner |
| Database | SQLite + SQLAlchemy |
| ML Modeling | Scikit-learn |
| Visualization | matplotlib, seaborn |
- Document Question Answering via Langflow
- LLM Prompt Engineering to validate articles
- OCR preprocessing for scanned papers
- Structured information extraction from PDFs
- Unit conversion and normalization of compositions
- Tabular data organization for ML
- Neural network modeling of the dissolution rate
-
Collected ~50+ articles from:
- ScienceDirect
- Google Scholar
- Web of Science
- Supervisor suggestions
-
Used Docling and a lab OCR printer to digitize 3 scanned papers
Used Langflow + LLM with a strict specification sheet prompt to ensure each article includes:
- Full glass composition (~100%)
- Explicit initial dissolution rate (with unit)
- At least one experimental parameter (e.g., pH, temperature)
📄 Output: articles/valid_articles_metadata.json
🟢 20 valid articles (out of 50+ reviewed)
Extracted names of glasses with valid data using LLM, e.g.:
1. glass_1: NBS14/18
2. glass_2: NBSA
3. glass_3: NBSC
...
LLM prompts extract composition in various units → then converted using:
convert_composition.py- Output: mol% of elements (Si, O, B, Na, etc.)
For each test on each glass:
- Composition (mol%)
- Experimental conditions
- Initial rate (V₀ or r₀)
- Other metadata (irradiation, polishing, BET surface...)
🔢 Output: structured row for each glass-test combination (107 parameters)
- All cleaned data is inserted into an SQL database
- Converted to CSV:
data/final_dataset.csv
- Used
notebooks/03_model_training.ipynb - Feature engineering on mol% + key conditions
- Trained regression model (e.g., RandomForestRegressor)
- Output:
models/trained_model.pkl
- ✅ 20 validated papers
- 🧪 70+ unique tests on glasses
- 🧾 107 parameters per test
- 📈 Model performance: R²=97% and RMSE=0.42
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
You may not use this project for commercial purposes or publish derived works without attribution.
To view a copy of this license, visit:
http://creativecommons.org/licenses/by-nc/4.0/
If you use any part of this repository—including the workflow, prompts, dataset, or model—for research or publication, you must cite this work as:
@misc{ouanzougui2025glass,
author = {OUANZOUGUI Abdelhak},
title = {Predicting Glass Dissolution Rates with AI: A Langflow-Based Workflow},
year = 2025,
url = {https://github.com/OUANZOUGUIAbdelhak/GLASS-GPT}
}