Skip to content
/ GlotWeb Public

๐Ÿ•ธ GlotWeb: Web Indexing for Low-Resource Languages -- under construction.

License

Notifications You must be signed in to change notification settings

cisnlp/GlotWeb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

921fbf2 ยท Apr 3, 2025
Apr 27, 2024
Feb 20, 2025
Feb 20, 2025
Feb 25, 2025
Feb 18, 2025
Feb 18, 2025
Feb 20, 2025
Feb 25, 2025
Feb 26, 2025
Feb 20, 2025
Sep 23, 2024
Nov 6, 2024
Feb 25, 2025
Apr 27, 2024
Apr 3, 2025
Oct 12, 2024

Repository files navigation

GlotWeb

About GlotWeb

GlotWeb is an advanced web indexing system specifically designed to address the digital resource gap for minority languages. Our system:

  • Identifies web content in 402+ languages through multi-source aggregation
  • Validates linguistic accuracy using GlotLID language identification
  • Filters content to ensure quality while minimizing religious bias
  • Compiles 169,155+ verified web links (47% in languages absent from major datasets)

Key Features

โœ” Covers languages missing from FLORES-200, MADLAD-400, and Glot500
โœ” Open-source pipeline with reproducible results
โœ” Interactive demo showcasing language resources (Click Image below to access)

Image

Getting Started

This documentation walks through GlotWeb's 4-step pipeline:

  1. Search Setup: Configure and run web searches
  2. Seed Generation: Filter initial results
  3. Crawling: Expand and validate links
  4. Cleaning: Deduplicate and finalize outputs

How to Use This Guide

  1. Follow steps sequentially (1 โ†’ 4)
  2. Each section includes:
    • Purpose explanation
    • Configuration options
    • Execution commands
    • Expected outputs
  3. Requires basic Python/Docker knowledge

Tip: For quick setup, clone the repository and use the provided configuration files as templates.

Ready to begin? Proceed to Step 1: Set up SearXNG and perform search.

Step 1: Set up SearXNG and perform search

SearXNG

Install

export PORT=8080

docker pull searxng/searxng
docker run --rm \
             -d -p ${PORT}:8080 \
             -v "${PWD}/searxng:/etc/searxng" \
             -e "BASE_URL=http://localhost:${PORT}/" \
             -e "INSTANCE_NAME=my-instance" \
             searxng/searxng

add JSON output format:

In the ${PWD}/searxng directory, there is a file named settings.yml. In that file, you need to enable the JSON output format in the SearX configuration under the search.formats key like this:

search:
  formats:
    - html
    - json

modify uwsgi.ini : In the ${PWD}/searxng directory, there is a file named uwsgi.ini. In that file you need to modify buffer-size. Default is 8k. Increasing to 9k sometimes help with 'Internal Error 500'.

Default Value:

buffer-size = 8192

Change to:

buffer-size = 9216

Search Service Script: search_service.py

This is an object-oriented Python script that leverages the Searx API to perform searches and save the results to JSON files. The script is configurable using a YAML configuration file called 'search_config.yaml'.

Features

  • Uses SearxSearchWrapper for querying multiple search engines.
  • Handles retries for failed requests.
  • Configurable search parameters through a YAML file.
  • Configurable input file, search range, output directory, and other parameters.
  • Automatically saves results in a structured JSON format.

Configuration file parameters:

searx_host: "http://127.0.0.1:8080"  # Searx instance URL
engines:
  - "bing"
  - "yahoo"
  - "qwant"
  - "duckduckgo"  # Search engines to be used
num_results: 50  # Number of results to fetch for each query
max_retries: 3  # Maximum number of retries for failed requests
retry_wait_time: 2  # Wait time (in seconds) between retries
output_file_prefix: "results"  # Prefix for output file names
output_directory: "search_dump"  # Directory to save output files
input_file: "input.txt"  # Path to the input file containing search queries
start_index: 0  # Start index for queries to process
end_index: 10  # End index for queries to process

Input file format:

The input file should be a tab-separated file where each line contains an ISO code and a sentence for search:

ISO_CODE_1    Search query 1
ISO_CODE_2    Search query 2
aa	Itiyobbiyah agattinoona sittal xayyossa yangalen qaadoodih baari gablusaanamah angicille le.
aai	Baise orot taโ€™itaโ€™imon matah toniwaโ€™an bar hinanutitiy gewas hinawowab.
aak	O xewanษจล‹o na'nษจ re rษจnษจล‹ษจnigษจnษจ, โ€˜A'mษจna' sea'yษจ e imo'nษจล‹a' wonษจrษจnษจ.โ€™

The iso code for the input text file can be either 2 lettered format or 3 lettered format.

Usage

Run the script using: pwd should be be root of the directory.

python pipeline/search_service.py

The search results will be saved in the specified output directory (e.g., search_dump) as JSON files named according to the specified prefix and index range, e.g., results_0-10.json.

Customization

You can easily adjust the following parameters in the config.yaml file:

  • Search engines: Add or remove engines in the engines list.
  • Search range: Modify start_index and end_index to control which lines in the input file are processed.
  • Output directory: Change output_directory to save results in a different location.

Step 2: Filter and generate seeds

Overview

This script filters web search dump/results based on domain restrictions, scrapes web pages, and performs language identification a FastText model for which we chose GlotLID. The processed data is stored in JSON format categorized by predicted languages.

Prerequisites

Dependencies

Ensure you have the following Python packages installed:

pip install fasttext trafilatura urllib3 tqdm pyyaml

Configuration already provided in the repository and must be changed according to user preferneces. Examples below:

model_path: "path/to/fasttext/model"
domain_file: "path/to/domain_filter.txt"
json_filename: "path/to/input.json"
iso_list_file: "path/to/iso_list.json"
output_directory: "path/to/output"

Running the Script

Execute the script with: pwd should be be root of the directory.

python pipeline/language_filter.py

Step 3: Search and scrape with seeds

Overview

This step takes the filtered seed URLs from Step 2 and performs deep crawling to discover additional web pages in the target languages. It includes:

  • Web crawling from seed URLs
  • Language detection using FastText (GlotLID)
  • Domain filtering
  • Parallel processing for efficiency
  • Comprehensive logging and metadata collection

Prerequisites

Dependencies

Ensure you have the following Python packages installed:

pip install fasttext beautifulsoup4 requests trafilatura tqdm pyyaml urllib3

Configuration

The script uses config.yaml with these key parameters:

seed_crawler:
  max_pages: 100            # Maximum pages to crawl per language
  max_time: 3600              # Maximum crawling time in seconds
  crawl_delay: 1              # Delay between requests
  to_visit_growth_factor: 50   # Threshold for detecting circular links
  max_workers: 4              # Threads for parallel processing

url_settings:
  request_timeout: 10         # Timeout for web requests
  max_url_length: 65000         # Maximum URL length to consider

language_detector:
  model_path: "path/to/model" # Path to FastText model
  minimum_confidence: 0.7     # Minimum language confidence score
  desired_language: "bpy_Beng"      # Target language code
  save_text: False            # Whether to save scraped text

output:
  directory: "output"         # Output directory
  output_file_name: "{language}_filtered.json"  # Output filename pattern

batch_processing:
  enabled: False              # Enable batch mode
  input_labels: []            # List of language codes for batch
  cooldown_between_languages: 60  # Cool-down between languages

Input Requirements

Input JSON files from Step 2 (named as [LANGUAGECODE_SCRIPT].json)

Each JSON file should contain entries with:

  • link: URL string

  • lid_confidence: Confidence score (float)

  • predicted_lid: Language code

Output

For each processed language, the script generates:

[LANGUAGE_CODE]_filtered.json - Filtered URLs with metadata

meta_data/[LANGUAGE_CODE]_meta_data.json - Crawling statistics including:

Seed URLs used

All discovered links

Filtered links

Unique new links

Rejected links

Usage

Single Language Processing

python pipeline/seed_crawler.py

Configure desired_language in config.yaml first.

Batch Processing:

Enable batch mode in config.yaml:

batch_processing:
  enabled: True
  input_labels: ["syl_Sylo", "bpy_Beng", "akh_Latn"]  # Your target languages

Run the same command:

python pipeline/seed_crawler_beta.py

Customization Options

  • Crawling Behavior: Adjust max_pages and max_time to control crawling scope Modify crawl_delay to be more/less aggressive
  • Language Detection: Change minimum_confidence for stricter/looser filtering
  • Set save_text: True to store scraped content
  • Performance: Increase max_workers for faster processing (requires more CPU) Adjust cooldown_between_languages for batch processing

Output:

  • Change output directory and filename patterns
  • Metadata collection is always enabled

Notes

  • The script automatically skips domains listed in your domain filter file
  • Progress bars are enabled by default (can be disabled in config)
  • Comprehensive logging helps troubleshoot issues

Step 4: Filtering and Deduplication

Step 4.1: Domain Filtering

Purpose

This script performs final domain filtering on crawled results to exclude unwanted domains from both the main output and metadata files.

Key Features

  • Loads crawled data and metadata JSON files
  • Applies domain filtering using the configured domain blocklist
  • Updates all metadata statistics after filtering
  • Handles both single-language and batch processing modes

Why Use It

  • Ensures final outputs comply with domain restrictions
  • Maintains consistency between data files and their metadata
  • Prepares clean data for subsequent deduplication steps

Usage

Configure domain_file path in config.yaml and run:

python result_filtering/final_domain_filter.py

Configuration

Uses these key config parameters:

domain_file: "path/to/domain_filter.txt"  # List of domains to exclude
output:
  directory: "output"                    # Where to find/save files

Output

Updates both:

[LANGUAGE]_filtered.json - With domain-filtered results

meta_data/[LANGUAGE]_meta_data.json - With filtered statistics

Step 4.2: Formatting Output for GlotWeb

Purpose

Transforms crawled language data into a structured format suitable for GlotWeb visualization, enriching it with metadata and linguistic information.

Key Features

  • Extracts language metadata (speaker counts, language family)
  • Checks inclusion in major multilingual datasets (MADLAD-400, Flores, Glot500)
  • Organizes URLs by domain with site categorization
  • Handles both single-language and batch processing

Why Use It

  • Creates standardized format for GlotWeb frontend
  • Enriches raw data with valuable linguistic metadata
  • Provides domain-level organization of web resources
  • Generates compatibility flags for popular multilingual datasets

Configuration

output:
  formated_directory: "formatted_output"  # Output directory
  formated_file_name: "{language}_formatted.json"  # Output filename pattern

Usage

python result_filtering/format_for_glotweb.py

Step 4.3: Robots.txt Compliance Filtering

Purpose

Filters out domains that explicitly block Common Crawl's CCBot in their robots.txt file, ensuring compliance with website crawling policies.

Key Features

  • Checks each domain's robots.txt for CCBot restrictions
  • Removes entire domains if they block CCBot
  • Preserves all other metadata while filtering
  • Handles both single-language and batch processing

Why Use It

  • Ensures ethical web scraping compliance
  • Prevents potential legal issues
  • Maintains good web citizenship by respecting robots.txt
  • Filters before final dataset compilation

Configuration

output:
  formated_directory: "formatted_output"  # Input directory (from Step 4.2)
  cleaned_directory: "cleaned_output"    # Output directory for filtered data

Usage

python result_filtering/robots_compliance_filter.py

Process Flow

  • Loads formatted JSON from Step 4.2
  • For each domain:
  • Fetches robots.txt
  • Checks for CCBot restrictions
  • Saves cleaned version with compliant domains only

Output

  • Maintains same structure as input
  • Only contains domains that allow CCBot
  • Saved as [LANGUAGE].json in cleaned directory

Notes

  • If robots.txt is inaccessible, assumes crawling is allowed
  • Only checks for explicit CCBot blocks (not general User-agent: *)
  • Processes domains sequentially with 5-second timeout
  • Preserves all non-URL metadata (speaker counts, language family etc.)

Step 4.4: Final Cleaning and Deduplication

Purpose

Performs final data cleaning through URL normalization and deduplication to create a polished dataset.

Process Overview

  1. HTTP/HTTPS Merging (http_merge_2.py):

    • Combines duplicate sites with different protocols (http/https)
    • Standardizes www/non-www variants
    • Preserves all unique links
  2. Hash Fragment Removal (remove_all_hash.py):

    • Removes URL fragments (#section)
    • Deduplicates URLs that only differ by fragments

Configuration

output:
  robots_filtered: "output/robots_filtered"  # Input from Step 4.3
  http_merged: "output/http_merged"         # Intermediate output
  deduplication: "output/deduplication"     # Final output

Usage

# Run protocol merging first
python result_filtering/http_merge_2.py

# Then run hash removal
python result_filtering/remove_all_hash.py

Key Features

  • Protocol-agnostic site merging
  • Consistent URL normalization
  • Fragment removal while preserving query parameters
  • Order-preserving deduplication

Output

Cleaned JSON files with:

  • Unified site entries
  • Normalized URLs
  • No duplicate content

โœ… Processed datasets in output/cleaned_output/[LANG].json containing:

  • Verified web links
  • Language metadata (speakers, family)
  • Domain categorization
  • Compatibility flags (FLORES/MADLAD/Glot500)

โœ… Metadata reports in output/meta_data/ with:

  • Crawling statistics
  • Domain distributions
  • Filtering metrics

Step 5: Dataset Validation & Community Contribution

Community Auditing Request

We urgently need native speakers and linguists to validate results:

How to Audit

  1. Explore your language in the GlotWeb Demo
  2. Check 10-20 random links for:
    • Actual language content (not machine translation)
    • Cultural/educational value
    • Flag as religious if content is from religious scriptures
    • Correct language/dialect labeling
  3. Report issues via:

Why This Matters

Impact Area Community Role
Data Quality Remove spam/misclassified content
Language Preservation Identify valuable resources
NLP Development Improve training data for LLMs

Get Involved

  • Speakers: Join us in Language auditing.
  • Researchers: Use data with citation (BibTeX forthcoming)

Native speakers of underrepresented languages are especially needed!