Skip to content

working repo for GRIMdata.org LittleRainbowRights.com | QueerAI for the Digital Child journal/conference paper

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE-DATA
Notifications You must be signed in to change notification settings

MissCrispenCakes/DigitalChild

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

252 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DigitalChild

GRIMdata / LittleRainbowRights

CI CD DOI REUSE status License: MIT Data License: CC BY 4.0 Python 3.12

Open-source data pipeline for analyzing human rights documents with focus on child and LGBTQ+ digital protection.

Scrape, process, tag, and analyze policy documents from international organizations. Track 10 human rights indicators across 194 countries. Support evidence-based advocacy and research.

🌍 Website: GRIMdata.org | LittleRainbowRights.com πŸ“– Documentation: docs/ πŸ’¬ Discussions: GitHub Discussions


✨ Key Features

πŸ“₯ Data Collection

  • Multiple data sources - International organizations (UN, AU), treaty bodies (OHCHR, UPR, UNICEF, ACERWC, ACHPR), government sources, NGOs, legal databases, research publications
  • Global and regional coverage - African, global, and country-specific sources across multiple regions
  • Direct URL tracking - Government postings, public notices, community organizations, business/legal sources, policy documents
  • Multi-format support - PDF, DOCX, HTML document processing
  • Automated scraping - Respectful, rate-limited web scraping with fallback handlers

🏷️ Analysis & Tagging

  • Regex-based tagging - Identify child rights, LGBTQ+, AI, privacy, and digital policy themes
  • Versioned tags - Compare results across different tag rule sets
  • Tags history - Track all tagging operations with timestamps

πŸ“Š Scorecard System

  • 194 countries tracked with 10 human rights indicators
  • 2,543 source URLs - Authoritative sources from UNESCO, UNCTAD, ILGA, UNICEF, etc.
  • Automated validation - Check source URLs for availability, detect changes
  • CSV exports - Summary tables, by-indicator breakdowns, regional analysis

πŸ”’ Security & Validation

  • 68 validator tests - Comprehensive input validation
  • Path traversal protection - Prevent malicious file access
  • URL validation - Block javascript:, file:, and other dangerous patterns
  • File size limits - Protect against file bombs

πŸ“ˆ Export & Research

  • CSV exports - Tags summaries, scorecard data, analysis results
  • Metadata tracking - Complete provenance for every document
  • Reproducible - Version-controlled configs and timestamps
  • REST API - 14 production-ready endpoints for programmatic data access (Flask backend)
    • Documents: list with filters, pagination, sorting, detail view
    • Scorecard: countries summary, indicators, statistics
    • Tags: frequency analysis, version management, filtering
    • Timeline: temporal analysis of tags over time
    • Export: CSV downloads with SPDX license headers
    • API key authentication, dynamic rate limiting (100-2000 req/hr)
    • Docker deployment with Redis caching and Nginx reverse proxy

🎯 Why This Work Matters

Digital systems (AI, surveillance, biometric identification, identity verification systems) are being deployed rapidly affecting children and LGBTQ+ youth. We don't yet know whether these systems help or harmβ€”but decisions are being made RIGHT NOW with permanent consequences.

The problem: Who should control vulnerable populations' digital rights?

  • Parents? May not understand digital safety (already posting kids' photos publicly)
  • Governments? May weaponize systems (countries criminalizing LGBTQ+ people using biometric data for tracking)
  • Companies? May lack security (data "everywhere forever, easily hacked")

Without transparent tracking: Assumptions β†’ decisions β†’ irreversible consequences β†’ by the time we know we were wrong, TOO LATE to reverse.

This pipeline tracks digital rights deployments across 194 countries, enabling evidence-based decisions BEFORE consequences become irreversible.

Research approach: We document facts and analyze enforcement mechanisms without imposing Western-centric values. Our methodology recognizes cultural context while focusing on protecting vulnerable populations' autonomy. See Data Governance for our cultural sensitivity framework.

Methodological foundation: Research Context | Published work: Vollmer & Vollmer (2022)


πŸš€ Quick Start

Prerequisites

  • Python 3.12 (required)
  • 1GB+ disk space for code and small dataset
  • Internet connection for scraping

Installation

# 1. Clone the repository
git clone https://github.com/MissCrispenCakes/DigitalChild.git
cd DigitalChild

# 2. Set up virtual environment
python3 -m venv .LittleRainbow
source .LittleRainbow/bin/activate  # On Windows: .LittleRainbow\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Initialize project structure
python init_project.py

Basic Usage

# Run complete pipeline for AU Policy documents
python pipeline_runner.py --source au_policy

# Run with latest tags
python pipeline_runner.py --source au_policy --tags-version latest

# Process specific country (UPR documents)
python pipeline_runner.py --source upr --country kenya

# Run scorecard workflow
python pipeline_runner.py --mode scorecard --scorecard-action all

Exports appear in data/exports/ as CSV files ready for analysis.

πŸ†• Using the API (Alternative to Pipeline)

Don't want to run the pipeline? Access data via REST API:

# Install API dependencies
pip install -r api_requirements.txt

# Start API server
python run_api.py

Access data programmatically:

# Health check
curl http://localhost:5000/api/health

# Get all documents
curl http://localhost:5000/api/documents

# Filter by country
curl "http://localhost:5000/api/documents?country=Kenya"

# Get scorecard
curl http://localhost:5000/api/scorecard/Kenya

Python example:

import requests

# Optional: Use API key for higher rate limits
headers = {"X-API-Key": "your-api-key"}

# Get Kenya's scorecard
response = requests.get("http://localhost:5000/api/scorecard/Kenya", headers=headers)
data = response.json()["data"]
print(data["indicators"])

# Get documents about AI policy
response = requests.get("http://localhost:5000/api/documents?tags=AI", headers=headers)
documents = response.json()["data"]["items"]

# Get tag frequency for Africa
response = requests.get("http://localhost:5000/api/tags?region=Africa", headers=headers)
tags = response.json()["data"]["tags"]

# Download scorecard CSV
response = requests.get("http://localhost:5000/api/export/scorecard_summary", headers=headers)
with open("scorecard.csv", "wb") as f:
    f.write(response.content)

14 endpoints available:

  • Documents: list, filter, detail (with pagination and sorting)
  • Scorecard: summary, country detail, statistics
  • Tags: frequency analysis, version list (filterable by country/region/year)
  • Timeline: tags over time (year Γ— tag matrix)
  • Export: list formats, download CSVs
  • Health: status check, system info

Features:

  • API key authentication (optional, higher rate limits when authenticated)
  • Rate limiting: 100 req/hr public, 1000 req/hr authenticated
  • Caching: 15min-1hr TTLs for optimal performance
  • Docker deployment ready with Redis and Nginx

πŸ“– Full API documentation: docs/api/index.md | Quick Reference | Production Deployment


πŸ“‹ Project Status

Phase 1-2 Complete:

  • βœ… Core pipeline (scraping, processing, tagging) - Multiple sources: 6 automated scrapers + direct URL tracking
  • βœ… Scorecard system - 194 countries, 10 indicators, 2,543 source URLs tracked
  • βœ… Validation & security framework - 170 tests passing (68 validator tests)
  • βœ… Recommendations extraction system - Regex-based with versioning and history tracking
  • βœ… Timeline exports - Global, by-country, and by-region analysis over time
  • βœ… Comparison analytics - Compare tags and recommendations across versions

Phase 3 Complete (8/9 tasks): Advanced processing features operational

  • βœ… ISO 3166-1 alpha-2 country code mapping - 194 countries fully mapped
  • βœ… Document type classifier - Multi-stage rules-based classification
  • βœ… Scorecard maintenance system - Phase 1 critical updates completed (6 countries, 18 fields updated)
  • βœ… Multi-format scorecard exports - CSV, XLSX, ODS, Google Sheets JSON

Phase 4 Complete (5/5 weeks): REST API backend operational

  • βœ… Flask API backend (Week 1-2) - Foundation and core endpoints
    • App factory pattern, configuration management, extensions
    • Documents API (list, filter, detail) with pagination and sorting
    • Scorecard API (countries, indicators, statistics)
    • Health and system info endpoints
    • Request validation, caching (15min-1hr TTLs), error handling
  • βœ… Extended APIs (Week 3) - Tags, Timeline, Export endpoints
    • Tags API: frequency analysis, version management, filtering
    • Timeline API: temporal analysis (year Γ— tag matrices)
    • Export API: CSV downloads with SPDX license headers
  • βœ… Authentication & Rate Limiting (Week 4) - Security features
    • API key authentication via X-API-Key header
    • Dynamic rate limiting (100/1000 req/hr public/authenticated)
    • Custom limits for expensive operations (exports, search)
  • βœ… Production Deployment (Week 5) - Infrastructure ready
    • Docker + docker-compose configuration
    • Redis caching and rate limiting storage
    • Nginx reverse proxy with SSL/TLS
    • Complete deployment guide (678 lines)
  • βœ… Testing & Quality - 104 tests passing (100% success rate)
  • ⏳ Interactive dashboard frontend - Planned for Phase 5

See docs/ROADMAP.md for detailed roadmap and docs/api/index.md for API documentation.


πŸ“š Documentation

See docs/DOCS_INDEX.md for full documentation index.


πŸ›  Troubleshooting

Common issues:

  • Virtual environment - Activate before installing dependencies
  • Python version - Must use Python 3.12 specifically
  • Import errors - Run commands from project root, not subdirectories
  • Pre-commit failures - Run pre-commit run --all-files to fix formatting

See First Run Error Checklist for detailed solutions.


🀝 Contributing

We welcome contributions from researchers, developers, and human rights advocates!

Ways to contribute:

  • Report bugs and issues
  • Add new data sources (scrapers)
  • Improve documentation
  • Add test coverage
  • Suggest features

Getting started:

  1. Read CONTRIBUTING.md for guidelines
  2. Check issues labeled good first issue
  3. Fork the repo and create a feature branch
  4. Submit a pull request

Developer setup:

# Install development tools
pip install pre-commit pytest pytest-cov

# Set up pre-commit hooks (required before committing)
pre-commit install

# Run tests
pytest tests/ -v

# Run all quality checks
pre-commit run --all-files

See CONTRIBUTING.md for detailed guidelines.


πŸ“„ License

MIT License - see LICENSE file

This project uses dual licensing:

  • Code (software): MIT License - Free to use, modify, and distribute
  • Data & Documentation: CC BY 4.0 - Attribution required (see LICENSE-DATA)

This means:

  • βœ… Use the code freely, including commercial applications
  • βœ… Use and share the scorecard data with attribution
  • βœ… Fork, modify, and redistribute
  • ❌ Don't remove attribution from data/docs

Full license terms: LICENSE (MIT) and LICENSE-DATA (CC BY 4.0)


πŸ“– Citation

If you use this project in your research, please cite it:

@software{digitalchild2026,
  title = {DigitalChild: Human Rights Data Pipeline for Child and LGBTQ+ Digital Protection},
  author = {Vollmer, S.C. and Vollmer, D.T.},
  year = {2026},
  version = {2.0.0},
  url = {https://github.com/MissCrispenCakes/DigitalChild},
  doi = {10.5281/zenodo.18318098},
  note = {Available at: https://grimdata.org. ORCID: 0000-0002-3359-2810 (S.C. Vollmer)}
}

Or use the format in CITATION.cff.

For the scorecard data specifically:

Vollmer, D.T., & Vollmer, S.C. (2025). LittleRainbowRights Scorecard: Child and LGBTQ+ Digital Rights Indicators. Licensed under CC BY 4.0. Available at: https://github.com/MissCrispenCakes/DigitalChild ORCID: 0000-0002-3359-2810 (S.C. Vollmer)


πŸ”’ Security

Found a security vulnerability? Do not open a public issue.

Report via GitHub Security Advisories - click "Report a vulnerability"

See SECURITY.md for full responsible disclosure policy.


πŸ™ Acknowledgments

This project analyzes publicly available human rights documents from:

  • United Nations (OHCHR, UPR, UNICEF)
  • African Union (AU Policy, ACERWC, ACHPR)
  • UNESCO, UNCTAD, ILGA World, and other authoritative sources

Data sources tracked with 2,543 validated URLs ensuring transparency and verification.

Built with:

  • Python 3.12, BeautifulSoup4, Selenium, pandas, pypdf, pytest
  • Flask, Flask-CORS, Flask-Caching for REST API
  • GitHub Pages for documentation
  • MkDocs Material for website

Maintained by: PhD student (passion project, please be patient with response times!)


πŸ“ž Contact & Support

Support the project:

  • ⭐ Star this repository
  • πŸ“’ Share with researchers and advocates
  • πŸ’» Contribute code or documentation
  • πŸ“ Cite in your publications

Last updated: January 2026

About

working repo for GRIMdata.org LittleRainbowRights.com | QueerAI for the Digital Child journal/conference paper

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE-DATA

Contributing

Security policy

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •  

Languages