Skip to content

Potential errors when scraping new organization. Skipped repos. #477

Open
@jordanperr

Description

@jordanperr

I am running MASTER.sh to download all data from the NREL github organization (which has 350 repos), but it's taking a very long time and I'm not sure if this is normal. For most repositories in the org, the query returns in under a second. It does appear that the script is scraping over 4,000 repositories (possibly dependencies?)

For some repositories, it seems to take much longer and the script prints out warning-like messages such as:

Sending REST query...
Checking response...
HTTP/1.1 202 Accepted
API Status {"limit": 5000, "remaining": 4414, "reset": 1607114323}
Query accepted but not yet processed. Trying again in 3sec...

Also, for a very small minority of repos, I get the following error-like message:

GraphQL API error.
[{"path": ["repository", "dependencyGraphManifests"], "locations": [{"line": 1, "column": 244}], "message": "loading"}]

These two errors do not seem to occur simultaneously.

The script is still humming along, and I will let it finish, but am wondering if these errors can simply be ignored.

Update: The script has finished and I am able to view the data using the Jekyll dev server. However, it appears that at least 3 repositories (out of 350) were skipped.

Steps to reproduce:

  1. Remove all data from explore/github_data.
  2. Remove all repos and orgs from _explore/input_lists.json, and add "NREL" as an org.
  3. Create python environment and install dependencies from requirements.txt
  4. Set GITHUB_API_TOKEN environment variable
  5. Run ./MASTER.sh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions