Skip to content
Change the repository type filter

All

    Repositories list

    • Tools for managing datasets for governance and training.
      HTML
      47871383Updated Dec 8, 2025Dec 8, 2025
    • Evaluation for Shades of Bias in Text
      HTML
      1810Updated Apr 23, 2025Apr 23, 2025
    • biomedical

      Public
      Tools for curating biomedical training data for large-scale language modeling
      Python
      11848716315Updated Dec 9, 2024Dec 9, 2024
    • xmtf

      Public
      Crosslingual Generalization through Multitask Finetuning
      Jupyter Notebook
      43537110Updated Sep 22, 2024Sep 22, 2024
    • petals

      Public
      🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
      Python
      5869.9k9219Updated Sep 7, 2024Sep 7, 2024
    • bigscience

      Public
      Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.
      Shell
      1031k137Updated Jul 29, 2024Jul 29, 2024
    • Megatron-DeepSpeed

      Public
      Ongoing research training transformer language models at scale, including: BERT & GPT-2
      Python
      2281.4k7647Updated Mar 20, 2024Mar 20, 2024
    • multilingual-modeling

      Public
      BLOOM+1: Adapting BLOOM model to support a new unseen language
      Python
      1774136Updated Mar 2, 2024Mar 2, 2024
    • promptsource

      Public
      Toolkit for creating, sharing and using natural language prompts.
      Python
      3773k1132Updated Oct 23, 2023Oct 23, 2023
    • Framework for BLOOM probing
      Python
      9900Updated Oct 17, 2023Oct 17, 2023
    • Python
      3359845Updated Jul 25, 2023Jul 25, 2023
    • metadata

      Public
      Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
      Python
      11312513Updated Jun 12, 2023Jun 12, 2023
    • A framework for few-shot evaluation of autoregressive language models.
      Python
      2.9k10578Updated May 9, 2023May 9, 2023
    • Code used for sourcing and cleaning the BigScience ROOTS corpus
      Jupyter Notebook
      42317100Updated Mar 20, 2023Mar 20, 2023
    • A list of BigScience publications
      TeX
      1310Updated Mar 13, 2023Mar 13, 2023
    • Python
      17200Updated Dec 5, 2022Dec 5, 2022
    • t-zero

      Public
      Reproduce results and replicate training fo T0 (Multitask Prompted Training Enables Zero-Shot Task Generalization)
      Python
      5346382Updated Nov 5, 2022Nov 5, 2022
    • A repository for `codecarbon` logs.
      Jupyter Notebook
      51320Updated Nov 3, 2022Nov 3, 2022
    • PII Processing code to detect and remediate PII in BigScience datasets. Reference implementation for the PII Hackathon
      Python
      6971Updated Oct 6, 2022Oct 6, 2022
    • lam

      Public
      Libraries, Archives and Museums (LAM)
      788340Updated Oct 4, 2022Oct 4, 2022
    • 42500Updated Jul 11, 2022Jul 11, 2022
    • A repo for running model shrinking experiments
      Python
      41000Updated Jun 21, 2022Jun 21, 2022
    • BigScience working group on language models for historical texts
      Jupyter Notebook
      7802Updated May 10, 2022May 10, 2022
    • Code and Data for Evaluation WG
      Python
      2442419Updated May 4, 2022May 4, 2022
    • Scripts to prepare catalogue data
      Jupyter Notebook
      1853Updated Apr 25, 2022Apr 25, 2022
    • 45112Updated Apr 22, 2022Apr 22, 2022
    • Tools for evaluating model robustness and consistency
      Python
      2202Updated Mar 9, 2022Mar 9, 2022
    • 11100Updated Feb 27, 2022Feb 27, 2022
    • Python
      21112Updated Feb 16, 2022Feb 16, 2022
    • Generate statistics over datasets used in the context of BS
      Makefile
      1200Updated Feb 1, 2022Feb 1, 2022