Skip to content
Change the repository type filter

All

    Repositories list

    • Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
      115700Updated Jan 28, 2024Jan 28, 2024
    • Big Data benchmark from Intel called HiBench
      Shell
      0000Updated Mar 5, 2018Mar 5, 2018
    • A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika
      Java
      17000Updated Feb 9, 2018Feb 9, 2018
    • A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
      Python
      48100Updated Feb 9, 2018Feb 9, 2018
    • HiBench

      Public
      HiBench is a big data benchmark suite.
      Java
      772000Updated Jan 31, 2018Jan 31, 2018
    • A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
      Jupyter Notebook
      18100Updated Apr 24, 2017Apr 24, 2017
    • DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
      Java
      7000Updated Dec 20, 2016Dec 20, 2016
    • Index URLs in Common Crawl
      Python
      47000Updated Sep 6, 2016Sep 6, 2016
    • A library of examples showing how to use the Common Crawl corpus.
      Java
      43000Updated Aug 5, 2016Aug 5, 2016