|
| 1 | +# PatentChem |
| 2 | + |
| 3 | +This code downloads the weekly bulk files from the [USPTO](https://developer.uspto.gov/product/patent-grant-dataxml), selects the patents relevant to chemistry, and queries the chemistry patent claims sections for given keywords to find molecule SMILES related to those keywords. |
| 4 | + |
| 5 | +## Installation |
| 6 | +``` |
| 7 | +conda install -c conda-forge mamba |
| 8 | +mamba env create -f environment.yml |
| 9 | +``` |
| 10 | + |
| 11 | +## Example Usage |
| 12 | +The scripts should be run in the following order. Steps 1 and 2 only need to be run once, then steps 3 and 4 can be run as many times as desired using different query terms. |
| 13 | + |
| 14 | + |
| 15 | +### 1. Download |
| 16 | +Download all USPTO patents for the given years: |
| 17 | + |
| 18 | +`python download.py --years 2021 --data_dir /data/patents_data/` |
| 19 | + |
| 20 | +Additional options: |
| 21 | +* `--remove_compressed`: remove the original .tar and .zip files for the weekly releases after decompressing them |
| 22 | + |
| 23 | +### 2. Select |
| 24 | +Select the patents related to chemistry based on the presence of CDX files: |
| 25 | + |
| 26 | +`python select_chem.py --years 2021 --data_dir /data/patents_data/` |
| 27 | + |
| 28 | +Additional options: |
| 29 | +* `--remove_compressed`: remove all original .zip files after unzipping those relevant to chemistry |
| 30 | + |
| 31 | +### 3. Search |
| 32 | +Provide a list of query terms to search for in the claims sections of the chemistry-related patents: |
| 33 | + |
| 34 | +`python search.py --years 2021 --data_dir /data/patents_data/ --naming opd --subject_list "organic electronic" "photodiode" "organic photovoltaic"` |
| 35 | + |
| 36 | +### 4. Clean |
| 37 | +Output a yaml file with the SMILES strings found related to the query terms and all of the patents related to each SMILES. Also output a text file with all unique SMILES strings from the query. |
| 38 | + |
| 39 | +`python clean.py --years 2021 --naming opd` |
| 40 | + |
| 41 | +Additional options: |
| 42 | +* `--output_dir`: parent directory of `--naming` if different from `.` |
| 43 | +* `--charged_only`: include only charged molecules in output |
| 44 | +* `--neutral_only`: include only neutral molecules in output |
| 45 | +* `--mw_min`: include only molecules with molecular weight greater than this |
| 46 | +* `--mw_max`: include only molecules with molecular weight less than this |
| 47 | + |
| 48 | +## Download File Sizes |
| 49 | + |
| 50 | +The following file sizes are taken from the [USPTO Bulk Data Storage System](https://bulkdata.uspto.gov) using the URLs `https://bulkdata.uspto.gov/data/patent/grant/redbook/<YEAR>/` and converting from bytes to GB. The 2023 file size is as of 10 March 2023. The download speed seems to be restricted to ~5-10MB/s, which means downloading the full set could require > 4 days if done in series. Alternatively, you can run multiple downloads in parallel. Either way, we recommend starting the downloads in a [tmux](https://github.com/tmux/tmux/wiki) session to run in the background. Note that these file sizes are for the *compressed* (zip or tar) files, so the total space required to store this data is larger than what is reported below. Use caution to avoid filling your hard drive to capacity. |
| 51 | + |
| 52 | +| **Year** | **File Size** | **Units** | |
| 53 | +|-------|-----------|-------| |
| 54 | +| 2001 | 37.07 | GB | |
| 55 | +| 2002 | 37.23 | GB | |
| 56 | +| 2003 | 39.20 | GB | |
| 57 | +| 2004 | 38.95 | GB | |
| 58 | +| 2005 | 30.67 | GB | |
| 59 | +| 2006 | 39.90 | GB | |
| 60 | +| 2007 | 39.98 | GB | |
| 61 | +| 2008 | 42.08 | GB | |
| 62 | +| 2009 | 43.82 | GB | |
| 63 | +| 2010 | 61.65 | GB | |
| 64 | +| 2011 | 68.24 | GB | |
| 65 | +| 2012 | 85.42 | GB | |
| 66 | +| 2013 | 96.55 | GB | |
| 67 | +| 2014 | 111.26 | GB | |
| 68 | +| 2015 | 119.98 | GB | |
| 69 | +| 2016 | 117.42 | GB | |
| 70 | +| 2017 | 124.10 | GB | |
| 71 | +| 2018 | 148.02 | GB | |
| 72 | +| 2019 | 182.28 | GB | |
| 73 | +| 2020 | 192.45 | GB | |
| 74 | +| 2021 | 184.41 | GB | |
| 75 | +| 2022 | 185 | GB | |
| 76 | +| 2023 | 39 | GB | |
| 77 | +| **Total** | **2.05** | **TB** | |
| 78 | + |
| 79 | +## Notes |
| 80 | +* Patent archives from 2001 - 2004 and 2009 follow a different format than other years; this code may not be able to process patents from these years properly. |
| 81 | +* For simplicity, `clean.py` replaces "*" and "Y" substituents from Markush structures with ethyl and O groups. This might not be appropriate for your applications. |
| 82 | + |
| 83 | +## Citation |
| 84 | +If you use this code, please cite the following [manuscript](): <!-- # TODO: fill in link and rest of bibtex citation --> |
| 85 | +``` |
| 86 | +@article{patents-generative2023, |
| 87 | + title={Automated patent extraction powers generative modeling in focused chemical spaces}, |
| 88 | + author={Subramanian, Akshay and Greenman, Kevin P. and Gervaix, Alexis and Yang, Tzuhsiung and G{\'{o}}mez-Bombarelli, Rafael}, |
| 89 | + journal={TBD}, |
| 90 | + doi={TBD}, |
| 91 | + year={2023} |
| 92 | +} |
| 93 | +``` |
0 commit comments