This is designed to scrape the data from the Github dependency graph page into a JSON file
- Install
yarn
- Clone the repo
- Run
yarn- Installs dependencies - Run
npx tsc- Compilesindex.ts - Run the scrapper
node index.js repoOwner/repo dependents.json- The command line arguments for the scrapper are as follows:
- (
githubOwnerAndRepo)repoOwner/repo- This is what's displayed in the Github URL when on the repo page e.g. for this repo it would bespacesailor24/github-dependents-scraper - (
dependentsFile)anything.json- This file can be named anything, but it needs to be a valid JSON file ending with the.jsonfile extension - (
resumeCrawl)trueorfalse- Eventually this crawler will get rate limited by Github, this flag allows you to run the crawler from where it left off before receiving the rate limit page- So if the crawler dies because of rate limiting, you'd start it up again with:
NOTE Starting it withnode index.js repoOwner/repo dependents.json truefalsewill override thedependentsFileand start scrapping from the first dependents page
- (
- The command line arguments for the scrapper are as follows:
Maybe I'll extend crawler to be able to sort the data, but for now, there's a nifty online sorter that'll do the trick!