Skip to content

Commit 7768b62

Browse files
committed
zenodo download
1 parent 7139381 commit 7768b62

File tree

2 files changed

+15
-1
lines changed

2 files changed

+15
-1
lines changed

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,16 @@ mamba env create -f environment.yml
1919
## Example Usage
2020
The scripts should be run in the following order. Steps 1 and 2 only need to be run once, then steps 3 and 4 can be run as many times as desired using different query terms.
2121

22+
Alternatively, we have made a file available on [Zenodo](https://zenodo.org/record/7992427) with the results of steps 1-2. This file can be downloaded and used as input to steps 3-4 to avoid having to run steps 1-2. We've created a script to download and unzip this file: `download_from_zenodo.sh`. This download (~75GB) is slow even if you have a fast connection (may take 1-2+ hours due to Zenodo limitations). However, this may still be faster than downloading everything from USPTO. We also recommend this option if you don't have enough hard disk space to store the full dataset in step 1 (see [Download File Sizes](#download-file-sizes)).
23+
2224

2325
### 1. Download
2426
Download all USPTO patents for the given years:
2527

2628
`python download.py --years 2021 --data_dir /data/patents_data/`
2729

30+
The download speed seems to be restricted to ~5-10MB/s by USPTO, which means downloading the full set could require > 4 days if done in series. Alternatively, you can run multiple downloads in parallel. Either way, we recommend starting the downloads in a [tmux](https://github.com/tmux/tmux/wiki) session to run in the background. For example, you might consider starting several tmux sessions to do the download in parallel using a script similar to `download_tmux.sh`. Please see [Download File Sizes](#download-file-sizes) for the approximate size of the files to be downloaded and use caution to avoid filling your whole hard drive.
31+
2832
Additional options:
2933
* `--force_redownload`: download files even if they already exist
3034
* `--no_uncompress`: do not uncompress the `*.zip` and `*.tar` files after selecting the chemistry-related ones (to save space if downloading all years at once)
@@ -63,7 +67,7 @@ Additional options:
6367

6468
## Download File Sizes
6569

66-
The following file sizes are taken from the [USPTO Bulk Data Storage System](https://bulkdata.uspto.gov) using the URLs `https://bulkdata.uspto.gov/data/patent/grant/redbook/<YEAR>/` and converting from bytes to GB. The 2023 file size is as of 23 May 2023. The download speed seems to be restricted to ~5-10MB/s, which means downloading the full set could require > 4 days if done in series. Alternatively, you can run multiple downloads in parallel. Either way, we recommend starting the downloads in a [tmux](https://github.com/tmux/tmux/wiki) session to run in the background. Note that these file sizes are for the *compressed* (zip or tar) files, so the total space required to store this data is larger than what is reported below. Use caution to avoid filling your hard drive to capacity.
70+
The following file sizes are taken from the [USPTO Bulk Data Storage System](https://bulkdata.uspto.gov) using the URLs `https://bulkdata.uspto.gov/data/patent/grant/redbook/<YEAR>/` and converting from bytes to GB. The 2023 file size is as of 23 May 2023. Note that these file sizes are for the *compressed* (zip or tar) files, so the total space required to store this data is larger than what is reported below. Use caution to avoid filling your hard drive to capacity.
6771

6872
| **Year** | **File Size** | **Units** |
6973
|-------|-----------|-------|

download_from_zenodo.sh

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/bin/bash
2+
3+
# Run this script from the directory you want to become the `data_dir` in future steps
4+
# Download size is ~75GB - make sure you do this on a machine with enough space!
5+
6+
wget https://zenodo.org/record/7992427/files/20230528_chemical_patents_uspto.tar.gz
7+
tar -xzf 20230528_chemical_patents_uspto.tar.gz
8+
rm 20230528_chemical_patents_uspto.tar.gz
9+
cd 20230528_chemical_patents_uspto
10+
for y in {2001..2023}; do(tar -xzf $y.tar.gz $y); done

0 commit comments

Comments
 (0)