Open-source data pipeline for analyzing human rights documents with focus on child and LGBTQ+ digital protection.
Scrape, process, tag, and analyze policy documents from international organizations. Track 10 human rights indicators across 194 countries. Support evidence-based advocacy and research.
π Website: GRIMdata.org | LittleRainbowRights.com π Documentation: docs/ π¬ Discussions: GitHub Discussions
- Multiple data sources - International organizations (UN, AU), treaty bodies (OHCHR, UPR, UNICEF, ACERWC, ACHPR), government sources, NGOs, legal databases, research publications
- Global and regional coverage - African, global, and country-specific sources across multiple regions
- Direct URL tracking - Government postings, public notices, community organizations, business/legal sources, policy documents
- Multi-format support - PDF, DOCX, HTML document processing
- Automated scraping - Respectful, rate-limited web scraping with fallback handlers
- Regex-based tagging - Identify child rights, LGBTQ+, AI, privacy, and digital policy themes
- Versioned tags - Compare results across different tag rule sets
- Tags history - Track all tagging operations with timestamps
- 194 countries tracked with 10 human rights indicators
- 2,543 source URLs - Authoritative sources from UNESCO, UNCTAD, ILGA, UNICEF, etc.
- Automated validation - Check source URLs for availability, detect changes
- CSV exports - Summary tables, by-indicator breakdowns, regional analysis
- 68 validator tests - Comprehensive input validation
- Path traversal protection - Prevent malicious file access
- URL validation - Block javascript:, file:, and other dangerous patterns
- File size limits - Protect against file bombs
- CSV exports - Tags summaries, scorecard data, analysis results
- Metadata tracking - Complete provenance for every document
- Reproducible - Version-controlled configs and timestamps
- REST API - 14 production-ready endpoints for programmatic data access (Flask backend)
- Documents: list with filters, pagination, sorting, detail view
- Scorecard: countries summary, indicators, statistics
- Tags: frequency analysis, version management, filtering
- Timeline: temporal analysis of tags over time
- Export: CSV downloads with SPDX license headers
- API key authentication, dynamic rate limiting (100-2000 req/hr)
- Docker deployment with Redis caching and Nginx reverse proxy
Digital systems (AI, surveillance, biometric identification, identity verification systems) are being deployed rapidly affecting children and LGBTQ+ youth. We don't yet know whether these systems help or harmβbut decisions are being made RIGHT NOW with permanent consequences.
The problem: Who should control vulnerable populations' digital rights?
- Parents? May not understand digital safety (already posting kids' photos publicly)
- Governments? May weaponize systems (countries criminalizing LGBTQ+ people using biometric data for tracking)
- Companies? May lack security (data "everywhere forever, easily hacked")
Without transparent tracking: Assumptions β decisions β irreversible consequences β by the time we know we were wrong, TOO LATE to reverse.
This pipeline tracks digital rights deployments across 194 countries, enabling evidence-based decisions BEFORE consequences become irreversible.
Research approach: We document facts and analyze enforcement mechanisms without imposing Western-centric values. Our methodology recognizes cultural context while focusing on protecting vulnerable populations' autonomy. See Data Governance for our cultural sensitivity framework.
Methodological foundation: Research Context | Published work: Vollmer & Vollmer (2022)
- Python 3.12 (required)
- 1GB+ disk space for code and small dataset
- Internet connection for scraping
# 1. Clone the repository
git clone https://github.com/MissCrispenCakes/DigitalChild.git
cd DigitalChild
# 2. Set up virtual environment
python3 -m venv .LittleRainbow
source .LittleRainbow/bin/activate # On Windows: .LittleRainbow\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Initialize project structure
python init_project.py# Run complete pipeline for AU Policy documents
python pipeline_runner.py --source au_policy
# Run with latest tags
python pipeline_runner.py --source au_policy --tags-version latest
# Process specific country (UPR documents)
python pipeline_runner.py --source upr --country kenya
# Run scorecard workflow
python pipeline_runner.py --mode scorecard --scorecard-action allExports appear in data/exports/ as CSV files ready for analysis.
Don't want to run the pipeline? Access data via REST API:
# Install API dependencies
pip install -r api_requirements.txt
# Start API server
python run_api.pyAccess data programmatically:
# Health check
curl http://localhost:5000/api/health
# Get all documents
curl http://localhost:5000/api/documents
# Filter by country
curl "http://localhost:5000/api/documents?country=Kenya"
# Get scorecard
curl http://localhost:5000/api/scorecard/KenyaPython example:
import requests
# Optional: Use API key for higher rate limits
headers = {"X-API-Key": "your-api-key"}
# Get Kenya's scorecard
response = requests.get("http://localhost:5000/api/scorecard/Kenya", headers=headers)
data = response.json()["data"]
print(data["indicators"])
# Get documents about AI policy
response = requests.get("http://localhost:5000/api/documents?tags=AI", headers=headers)
documents = response.json()["data"]["items"]
# Get tag frequency for Africa
response = requests.get("http://localhost:5000/api/tags?region=Africa", headers=headers)
tags = response.json()["data"]["tags"]
# Download scorecard CSV
response = requests.get("http://localhost:5000/api/export/scorecard_summary", headers=headers)
with open("scorecard.csv", "wb") as f:
f.write(response.content)14 endpoints available:
- Documents: list, filter, detail (with pagination and sorting)
- Scorecard: summary, country detail, statistics
- Tags: frequency analysis, version list (filterable by country/region/year)
- Timeline: tags over time (year Γ tag matrix)
- Export: list formats, download CSVs
- Health: status check, system info
Features:
- API key authentication (optional, higher rate limits when authenticated)
- Rate limiting: 100 req/hr public, 1000 req/hr authenticated
- Caching: 15min-1hr TTLs for optimal performance
- Docker deployment ready with Redis and Nginx
π Full API documentation: docs/api/index.md | Quick Reference | Production Deployment
Phase 1-2 Complete:
- β Core pipeline (scraping, processing, tagging) - Multiple sources: 6 automated scrapers + direct URL tracking
- β Scorecard system - 194 countries, 10 indicators, 2,543 source URLs tracked
- β Validation & security framework - 170 tests passing (68 validator tests)
- β Recommendations extraction system - Regex-based with versioning and history tracking
- β Timeline exports - Global, by-country, and by-region analysis over time
- β Comparison analytics - Compare tags and recommendations across versions
Phase 3 Complete (8/9 tasks): Advanced processing features operational
- β ISO 3166-1 alpha-2 country code mapping - 194 countries fully mapped
- β Document type classifier - Multi-stage rules-based classification
- β Scorecard maintenance system - Phase 1 critical updates completed (6 countries, 18 fields updated)
- β Multi-format scorecard exports - CSV, XLSX, ODS, Google Sheets JSON
Phase 4 Complete (5/5 weeks): REST API backend operational
- β
Flask API backend (Week 1-2) - Foundation and core endpoints
- App factory pattern, configuration management, extensions
- Documents API (list, filter, detail) with pagination and sorting
- Scorecard API (countries, indicators, statistics)
- Health and system info endpoints
- Request validation, caching (15min-1hr TTLs), error handling
- β
Extended APIs (Week 3) - Tags, Timeline, Export endpoints
- Tags API: frequency analysis, version management, filtering
- Timeline API: temporal analysis (year Γ tag matrices)
- Export API: CSV downloads with SPDX license headers
- β
Authentication & Rate Limiting (Week 4) - Security features
- API key authentication via X-API-Key header
- Dynamic rate limiting (100/1000 req/hr public/authenticated)
- Custom limits for expensive operations (exports, search)
- β
Production Deployment (Week 5) - Infrastructure ready
- Docker + docker-compose configuration
- Redis caching and rate limiting storage
- Nginx reverse proxy with SSL/TLS
- Complete deployment guide (678 lines)
- β Testing & Quality - 104 tests passing (100% success rate)
- β³ Interactive dashboard frontend - Planned for Phase 5
See docs/ROADMAP.md for detailed roadmap and docs/api/index.md for API documentation.
- FAQ - Frequently asked questions
- Architecture - System design and components
- Glossary - Key terms and definitions
- Runbook - Complete command reference
- Scorecard Workflow - Indicator tracking system
- Data Governance - Privacy, ethics, responsible research
- Roadmap - Development phases and future features
- API Documentation - REST API endpoints, usage, examples
- API Quick Start - Fast reference for API usage
See docs/DOCS_INDEX.md for full documentation index.
Common issues:
- Virtual environment - Activate before installing dependencies
- Python version - Must use Python 3.12 specifically
- Import errors - Run commands from project root, not subdirectories
- Pre-commit failures - Run
pre-commit run --all-filesto fix formatting
See First Run Error Checklist for detailed solutions.
We welcome contributions from researchers, developers, and human rights advocates!
Ways to contribute:
- Report bugs and issues
- Add new data sources (scrapers)
- Improve documentation
- Add test coverage
- Suggest features
Getting started:
- Read CONTRIBUTING.md for guidelines
- Check issues labeled
good first issue - Fork the repo and create a feature branch
- Submit a pull request
Developer setup:
# Install development tools
pip install pre-commit pytest pytest-cov
# Set up pre-commit hooks (required before committing)
pre-commit install
# Run tests
pytest tests/ -v
# Run all quality checks
pre-commit run --all-filesSee CONTRIBUTING.md for detailed guidelines.
MIT License - see LICENSE file
This project uses dual licensing:
- Code (software): MIT License - Free to use, modify, and distribute
- Data & Documentation: CC BY 4.0 - Attribution required (see LICENSE-DATA)
This means:
- β Use the code freely, including commercial applications
- β Use and share the scorecard data with attribution
- β Fork, modify, and redistribute
- β Don't remove attribution from data/docs
Full license terms: LICENSE (MIT) and LICENSE-DATA (CC BY 4.0)
If you use this project in your research, please cite it:
@software{digitalchild2026,
title = {DigitalChild: Human Rights Data Pipeline for Child and LGBTQ+ Digital Protection},
author = {Vollmer, S.C. and Vollmer, D.T.},
year = {2026},
version = {2.0.0},
url = {https://github.com/MissCrispenCakes/DigitalChild},
doi = {10.5281/zenodo.18318098},
note = {Available at: https://grimdata.org. ORCID: 0000-0002-3359-2810 (S.C. Vollmer)}
}Or use the format in CITATION.cff.
For the scorecard data specifically:
Vollmer, D.T., & Vollmer, S.C. (2025). LittleRainbowRights Scorecard: Child and LGBTQ+ Digital Rights Indicators. Licensed under CC BY 4.0. Available at: https://github.com/MissCrispenCakes/DigitalChild ORCID: 0000-0002-3359-2810 (S.C. Vollmer)
Found a security vulnerability? Do not open a public issue.
Report via GitHub Security Advisories - click "Report a vulnerability"
See SECURITY.md for full responsible disclosure policy.
This project analyzes publicly available human rights documents from:
- United Nations (OHCHR, UPR, UNICEF)
- African Union (AU Policy, ACERWC, ACHPR)
- UNESCO, UNCTAD, ILGA World, and other authoritative sources
Data sources tracked with 2,543 validated URLs ensuring transparency and verification.
Built with:
- Python 3.12, BeautifulSoup4, Selenium, pandas, pypdf, pytest
- Flask, Flask-CORS, Flask-Caching for REST API
- GitHub Pages for documentation
- MkDocs Material for website
Maintained by: PhD student (passion project, please be patient with response times!)
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Website: GRIMdata.org
Support the project:
- β Star this repository
- π’ Share with researchers and advocates
- π» Contribute code or documentation
- π Cite in your publications
Last updated: January 2026