1- # How to Use pyrxiv CLI
1+ # How to Use ` pyrxiv ` CLI
22
33This guide explains how to use the ** pyrxiv** command-line interface (CLI) to search and download arXiv papers.
44
@@ -18,15 +18,15 @@ This guide explains how to use the **pyrxiv** command-line interface (CLI) to se
1818
1919## Installation
2020
21- Install pyrxiv using pip:
21+ Install ` pyrxiv ` using pip:
2222
2323``` bash
2424pip install pyrxiv
2525```
2626
2727## Available Commands
2828
29- pyrxiv provides two main commands:
29+ ` pyrxiv ` provides two main commands:
3030
31311 . ** ` search_and_download ` ** - Search for papers in a specific arXiv category and download them
32322 . ** ` download_pdfs ` ** - Download PDFs from existing HDF5 metadata files
@@ -39,15 +39,14 @@ pyrxiv --help
3939
4040## Pipeline Overview
4141
42- The typical ** pyrxiv** workflow follows these steps:
42+ The typical ` pyrxiv ` workflow follows these steps:
4343
44441 . ** Search and Filter** : Use ` search_and_download ` to:
45+
4546 - Fetch papers from a specific arXiv category
4647 - Optionally filter papers using a regex pattern
4748 - Download PDFs and/or save metadata to HDF5 files
48-
49492 . ** Process or Analyze** : Work with the downloaded papers and metadata
50-
51503 . ** Re-download if Needed** : Use ` download_pdfs ` to re-download PDFs from HDF5 metadata files if you previously deleted them
5251
5352## Command 1: search_and_download
@@ -72,14 +71,32 @@ pyrxiv search_and_download --category physics.optics --n-papers 10
7271
7372#### Filtering with Regex Patterns
7473
75- Download papers that match a specific regex pattern. When using ` --regex-pattern ` , pyrxiv will continue fetching papers until it finds the specified number that match the pattern:
74+ Download papers whose text contain a specific matched regex pattern. When using ` --regex-pattern ` , pyrxiv will continue fetching papers until it finds the specified number that match the pattern:
7675
7776``` bash
7877pyrxiv search_and_download --category cond-mat.str-el --regex-pattern " DMFT|Hubbard" --n-papers 5
7978```
8079
8180** Important** : Papers that don't match the regex pattern are automatically discarded and not downloaded.
8281
82+ #### Resuming from a Specific Paper
83+
84+ Resume downloading from a specific arXiv ID:
85+
86+ ``` bash
87+ pyrxiv search_and_download --category cond-mat.str-el --start-id " 2201.12345v1" --n-papers 10
88+ ```
89+
90+ ** Important** : you have to add the full arXiv ID, including the versioning part.
91+
92+ Resume from the last downloaded paper in your download directory:
93+
94+ ``` bash
95+ pyrxiv search_and_download --category cond-mat.str-el --start-from-filepath True --n-papers 10
96+ ```
97+
98+ These options are important if your search and download process abruptly stopped while processing the number of papers.
99+
83100#### Saving Metadata to HDF5
84101
85102Save both PDFs and metadata to HDF5 files:
@@ -106,19 +123,7 @@ pyrxiv search_and_download --category cond-mat.str-el --n-papers 5 --save-hdf5 -
106123
107124** Note** : You must use ` --save-hdf5 ` even if you're deleting HDF5 files, as the files need to be created first before deletion.
108125
109- #### Resuming from a Specific Paper
110-
111- Resume downloading from a specific arXiv ID:
112-
113- ``` bash
114- pyrxiv search_and_download --category cond-mat.str-el --start-id " 2201.12345" --n-papers 10
115- ```
116-
117- Resume from the last downloaded paper in your download directory:
118-
119- ``` bash
120- pyrxiv search_and_download --category cond-mat.str-el --start-from-filepath True --n-papers 10
121- ```
126+ ** Note 2** : Yes, this option does not make any sense, but GitHub Copilot wrote this and thought it was funny to keep it :-)
122127
123128#### Customizing PDF Text Extraction
124129
@@ -144,19 +149,19 @@ pyrxiv search_and_download --download-path my_papers --category cond-mat.str-el
144149
145150### Options Reference
146151
147- | Option | Short | Description | Default |
148- | --------| -------| -------------| ---------|
149- | ` --download-path ` | ` -path ` | Path for downloading PDFs and HDF5 files | ` data ` |
150- | ` --category ` | ` -c ` | arXiv category to search | ` cond-mat.str-el ` |
151- | ` --n-papers ` | ` -n ` | Number of papers to download | ` 5 ` |
152- | ` --regex-pattern ` | ` -regex ` | Regex pattern to filter papers | None |
153- | ` --start-id ` | ` -s ` | arXiv ID to start from | None |
154- | ` --start-from-filepath ` | ` -sff ` | Resume from last downloaded paper | ` False ` |
155- | ` --loader ` | ` -l ` | PDF text extraction loader (` pdfminer ` or ` pypdf ` ) | ` pdfminer ` |
156- | ` --clean-text ` | ` -ct ` | Clean extracted text (remove references, whitespace) | ` True ` |
157- | ` --save-hdf5 ` | ` -h5 ` | Save metadata to HDF5 files | ` False ` |
158- | ` --delete-pdf ` | ` -dp ` | Delete PDFs after processing | ` False ` |
159- | ` --delete-hdf5 ` | ` -dh5 ` | Delete HDF5 files after processing | ` False ` |
152+ | Option | Short | Description | Default |
153+ | ------------------------- | ---------- | ------------------------------------------------------ | ------------------- |
154+ | ` --download-path ` | ` -path ` | Path for downloading PDFs and HDF5 files | ` data ` |
155+ | ` --category ` | ` -c ` | arXiv category to search | ` cond-mat.str-el ` |
156+ | ` --n-papers ` | ` -n ` | Number of papers to download | ` 5 ` |
157+ | ` --regex-pattern ` | ` -regex ` | Regex pattern to filter papers | None |
158+ | ` --start-id ` | ` -s ` | arXiv ID to start from | None |
159+ | ` --start-from-filepath ` | ` -sff ` | Resume from last downloaded paper | ` False ` |
160+ | ` --loader ` | ` -l ` | PDF text extraction loader (` pdfminer ` or ` pypdf ` ) | ` pdfminer ` |
161+ | ` --clean-text ` | ` -ct ` | Clean extracted text (remove references, whitespace) | ` True ` |
162+ | ` --save-hdf5 ` | ` -h5 ` | Save metadata to HDF5 files | ` False ` |
163+ | ` --delete-pdf ` | ` -dp ` | Delete PDFs after processing | ` False ` |
164+ | ` --delete-hdf5 ` | ` -dh5 ` | Delete HDF5 files after processing | ` False ` |
160165
161166## Command 2: download_pdfs
162167
@@ -178,8 +183,8 @@ pyrxiv download_pdfs --data-path my_papers
178183
179184### Options
180185
181- | Option | Short | Description | Default |
182- | --------| -------| -------------| ---------|
186+ | Option | Short | Description | Default |
187+ | --------------- | --------- | --------------------------- | -------- |
183188| ` --data-path ` | ` -path ` | Path where HDF5 files exist | ` data ` |
184189
185190## Complete Pipeline Examples
@@ -269,26 +274,21 @@ pyrxiv download_pdfs --data-path research_papers
269274
270275## Best Practices
271276
272- 1 . ** Start Small** : Begin with a small number of papers (` --n-papers 5 ` ) to test your setup and regex patterns.
273-
277+ 1 . ** Start Small** : Begin with a small number of papers (e.g., ` --n-papers 5 ` ) to test your setup and regex patterns.
2742782 . ** Use Meaningful Regex** : When using ` --regex-pattern ` , make sure your pattern is specific enough to avoid false positives but broad enough to capture relevant papers.
275-
2762793 . ** Save Metadata** : Use ` --save-hdf5 ` to preserve paper metadata, which is useful for later analysis and record-keeping.
277-
2782804 . ** Organize by Category** : Use different download paths for different categories to keep your papers organized:
281+
279282 ``` bash
280283 pyrxiv search_and_download --download-path papers/condensed_matter --category cond-mat.str-el
281284 pyrxiv search_and_download --download-path papers/optics --category physics.optics
282285 ```
283-
2842865 . ** Resume Capability** : Use ` --start-from-filepath True ` when continuing a previous download session to avoid re-downloading papers.
287+ 6 . ** Storage Management** :
285288
286- 6 . ** Storage Management** :
287289 - Use ` --delete-pdf ` with ` --save-hdf5 ` if you primarily need metadata and text content
288290 - Use ` download_pdfs ` later to retrieve specific PDFs when needed
289-
2902917 . ** Text Extraction** : The default ` pdfminer ` loader generally works well, but if you encounter issues with specific PDFs, try ` --loader pypdf ` .
291-
2922928 . ** Monitor Progress** : The CLI displays a progress bar during downloads. For large batches, be patient as the tool may need to fetch many papers to find matches for your regex pattern.
293293
294294## Troubleshooting
@@ -297,11 +297,11 @@ pyrxiv download_pdfs --data-path research_papers
297297
298298- Try broadening your regex pattern
299299- Check the pattern syntax is correct
300- - Remember that pyrxiv searches the full text of papers, not just titles or abstracts
300+ - Remember that ` pyrxiv ` searches the full text of papers, not just titles or abstracts
301301
302302### Downloads are slow
303303
304- - arXiv has rate limits; pyrxiv respects these
304+ - arXiv has rate limits; ` pyrxiv ` respects these
305305- When using regex filtering, the tool must download and process papers until it finds enough matches
306306- Consider reducing ` --n-papers ` or using a less restrictive regex pattern
307307
@@ -317,4 +317,4 @@ pyrxiv download_pdfs --data-path research_papers
317317
318318---
319319
320- For more information, see the [ main README] ( ../README.md ) or visit the [ pyrxiv GitHub repository] ( https://github.com/JosePizarro3/pyrxiv ) .
320+ For more information, see the [ main README] ( ../README.md ) or visit the [ ` pyrxiv ` GitHub repository] ( https://github.com/JosePizarro3/pyrxiv ) .
0 commit comments