Skip to content

Commit 893f437

Browse files
authored
Merge pull request #33 from JosePizarro3/29-rewrite-readme-with-latest-developments
29 rewrite readme with latest developments
2 parents b7cf91d + 88b3310 commit 893f437

File tree

2 files changed

+324
-0
lines changed

2 files changed

+324
-0
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,10 @@ pyrxiv search_and_download --category cond-mat.str-el --regex-pattern "DMFT|Hubb
4242

4343
**Note**: When using `--regex-pattern`, the tool will continue fetching papers from arXiv until it finds the specified number of papers (`--n-papers`) that match the pattern. Papers that don't match the regex are automatically discarded.
4444

45+
## Documentation
46+
47+
For a comprehensive guide on how to use the CLI and recommended pipelines, see the [How to Use `pyrxiv`](docs/how_to_use_pyrxiv.md) documentation.
48+
4549
---
4650

4751
# Development

docs/how_to_use_pyrxiv.md

Lines changed: 320 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,320 @@
1+
# How to Use `pyrxiv` CLI
2+
3+
This guide explains how to use the **pyrxiv** command-line interface (CLI) to search and download arXiv papers.
4+
5+
## Table of Contents
6+
7+
- [Installation](#installation)
8+
- [Available Commands](#available-commands)
9+
- [Pipeline Overview](#pipeline-overview)
10+
- [Command 1: search_and_download](#command-1-search_and_download)
11+
- [Basic Usage](#basic-usage)
12+
- [Advanced Usage](#advanced-usage)
13+
- [Options Reference](#options-reference)
14+
- [Command 2: download_pdfs](#command-2-download_pdfs)
15+
- [Complete Pipeline Examples](#complete-pipeline-examples)
16+
- [Best Practices](#best-practices)
17+
- [Troubleshooting](#troubleshooting)
18+
19+
## Installation
20+
21+
Install `pyrxiv` using pip:
22+
23+
```bash
24+
pip install pyrxiv
25+
```
26+
27+
## Available Commands
28+
29+
`pyrxiv` provides two main commands:
30+
31+
1. **`search_and_download`** - Search for papers in a specific arXiv category and download them
32+
2. **`download_pdfs`** - Download PDFs from existing HDF5 metadata files
33+
34+
To see all available commands:
35+
36+
```bash
37+
pyrxiv --help
38+
```
39+
40+
## Pipeline Overview
41+
42+
The typical `pyrxiv` workflow follows these steps:
43+
44+
1. **Search and Filter**: Use `search_and_download` to:
45+
46+
- Fetch papers from a specific arXiv category
47+
- Optionally filter papers using a regex pattern
48+
- Download PDFs and/or save metadata to HDF5 files
49+
2. **Process or Analyze**: Work with the downloaded papers and metadata
50+
3. **Re-download if Needed**: Use `download_pdfs` to re-download PDFs from HDF5 metadata files if you previously deleted them
51+
52+
## Command 1: search_and_download
53+
54+
The `search_and_download` command searches for papers in arXiv and downloads them to a specified directory.
55+
56+
### Basic Usage
57+
58+
Download the 5 most recent papers from the default category (`cond-mat.str-el`):
59+
60+
```bash
61+
pyrxiv search_and_download
62+
```
63+
64+
Download 10 papers from a specific category:
65+
66+
```bash
67+
pyrxiv search_and_download --category physics.optics --n-papers 10
68+
```
69+
70+
### Advanced Usage
71+
72+
#### Filtering with Regex Patterns
73+
74+
Download papers whose text contains a specific matched regex pattern. When using `--regex-pattern`, pyrxiv will continue fetching papers until it finds the specified number that match the pattern:
75+
76+
```bash
77+
pyrxiv search_and_download --category cond-mat.str-el --regex-pattern "DMFT|Hubbard" --n-papers 5
78+
```
79+
80+
**Important**: Papers that don't match the regex pattern are automatically discarded and not downloaded.
81+
82+
#### Resuming from a Specific Paper
83+
84+
Resume downloading from a specific arXiv ID:
85+
86+
```bash
87+
pyrxiv search_and_download --category cond-mat.str-el --start-id "2201.12345v1" --n-papers 10
88+
```
89+
90+
**Important**: you have to add the full arXiv ID, including the versioning part.
91+
92+
Resume from the last downloaded paper in your download directory:
93+
94+
```bash
95+
pyrxiv search_and_download --category cond-mat.str-el --start-from-filepath True --n-papers 10
96+
```
97+
98+
These options are important if your search and download process abruptly stopped while processing the number of papers.
99+
100+
#### Saving Metadata to HDF5
101+
102+
Save both PDFs and metadata to HDF5 files:
103+
104+
```bash
105+
pyrxiv search_and_download --category cond-mat.str-el --regex-pattern "DMFT|Hubbard" --n-papers 5 --save-hdf5
106+
```
107+
108+
#### Saving Only Metadata (No PDFs)
109+
110+
If you only need metadata and want to save storage:
111+
112+
```bash
113+
pyrxiv search_and_download --category cond-mat.str-el --n-papers 5 --save-hdf5 --delete-pdf
114+
```
115+
116+
#### Saving Only PDFs (No HDF5)
117+
118+
If you only need PDFs and want to clean up metadata files:
119+
120+
```bash
121+
pyrxiv search_and_download --category cond-mat.str-el --n-papers 5 --save-hdf5 --delete-hdf5
122+
```
123+
124+
**Note**: You must use `--save-hdf5` even if you're deleting HDF5 files, as the files need to be created first before deletion.
125+
126+
**Note 2**: Yes, this option does not make any sense, but GitHub Copilot wrote this and thought it was funny to keep it :-)
127+
128+
#### Customizing PDF Text Extraction
129+
130+
Choose a different PDF loader:
131+
132+
```bash
133+
pyrxiv search_and_download --category cond-mat.str-el --loader pypdf --n-papers 5
134+
```
135+
136+
Disable text cleaning (keep references and extra whitespace):
137+
138+
```bash
139+
pyrxiv search_and_download --category cond-mat.str-el --clean-text False --n-papers 5
140+
```
141+
142+
#### Custom Download Path
143+
144+
Specify a custom directory for downloads:
145+
146+
```bash
147+
pyrxiv search_and_download --download-path my_papers --category cond-mat.str-el --n-papers 5
148+
```
149+
150+
### Options Reference
151+
152+
| Option | Short | Description | Default |
153+
| ------------------------- | ---------- | ------------------------------------------------------ | ------------------- |
154+
| `--download-path` | `-path` | Path for downloading PDFs and HDF5 files | `data` |
155+
| `--category` | `-c` | arXiv category to search | `cond-mat.str-el` |
156+
| `--n-papers` | `-n` | Number of papers to download | `5` |
157+
| `--regex-pattern` | `-regex` | Regex pattern to filter papers | None |
158+
| `--start-id` | `-s` | arXiv ID to start from | None |
159+
| `--start-from-filepath` | `-sff` | Resume from last downloaded paper | `False` |
160+
| `--loader` | `-l` | PDF text extraction loader (`pdfminer` or `pypdf`) | `pdfminer` |
161+
| `--clean-text` | `-ct` | Clean extracted text (remove references, whitespace) | `True` |
162+
| `--save-hdf5` | `-h5` | Save metadata to HDF5 files | `False` |
163+
| `--delete-pdf` | `-dp` | Delete PDFs after processing | `False` |
164+
| `--delete-hdf5` | `-dh5` | Delete HDF5 files after processing | `False` |
165+
166+
## Command 2: download_pdfs
167+
168+
The `download_pdfs` command downloads PDFs from existing HDF5 metadata files. This is useful if you previously saved only metadata or deleted PDFs to save space.
169+
170+
### Usage
171+
172+
Download PDFs from HDF5 files in the default `data/` directory:
173+
174+
```bash
175+
pyrxiv download_pdfs
176+
```
177+
178+
Download PDFs from HDF5 files in a custom directory:
179+
180+
```bash
181+
pyrxiv download_pdfs --data-path my_papers
182+
```
183+
184+
### Options
185+
186+
| Option | Short | Description | Default |
187+
| --------------- | --------- | --------------------------- | -------- |
188+
| `--data-path` | `-path` | Path where HDF5 files exist | `data` |
189+
190+
## Complete Pipeline Examples
191+
192+
### Example 1: Basic Paper Collection
193+
194+
Collect 10 recent papers from condensed matter physics:
195+
196+
```bash
197+
# Download papers
198+
pyrxiv search_and_download --category cond-mat.str-el --n-papers 10
199+
200+
# Papers will be saved in ./data/
201+
```
202+
203+
### Example 2: Filtered Search with Metadata
204+
205+
Search for papers about DMFT or Hubbard models, saving both PDFs and metadata:
206+
207+
```bash
208+
# Download and filter papers
209+
pyrxiv search_and_download \
210+
--category cond-mat.str-el \
211+
--regex-pattern "DMFT|Hubbard" \
212+
--n-papers 5 \
213+
--save-hdf5
214+
215+
# Work with the downloaded papers...
216+
217+
# If you later delete PDFs to save space, you can re-download them:
218+
pyrxiv download_pdfs
219+
```
220+
221+
### Example 3: Metadata-Only Collection
222+
223+
Collect metadata without keeping PDFs (useful for building a searchable database):
224+
225+
```bash
226+
# Download papers, extract text, save metadata, delete PDFs
227+
pyrxiv search_and_download \
228+
--category physics.optics \
229+
--n-papers 20 \
230+
--save-hdf5 \
231+
--delete-pdf
232+
233+
# Your ./data/ directory will contain only .hdf5 files with metadata and extracted text
234+
```
235+
236+
### Example 4: Continuous Collection
237+
238+
Set up a continuous collection workflow:
239+
240+
```bash
241+
# First batch
242+
pyrxiv search_and_download --category cond-mat.str-el --n-papers 10 --save-hdf5
243+
244+
# Later, resume from where you left off
245+
pyrxiv search_and_download \
246+
--category cond-mat.str-el \
247+
--start-from-filepath True \
248+
--n-papers 10 \
249+
--save-hdf5
250+
```
251+
252+
### Example 5: Multi-Step Research Pipeline
253+
254+
A complete research workflow:
255+
256+
```bash
257+
# Step 1: Collect papers matching your research topic
258+
pyrxiv search_and_download \
259+
--category cond-mat.str-el \
260+
--regex-pattern "topological insulator|Weyl semimetal" \
261+
--n-papers 20 \
262+
--save-hdf5 \
263+
--download-path research_papers
264+
265+
# Step 2: Analyze the collected papers (your custom scripts)
266+
# ... perform analysis on PDFs and HDF5 metadata ...
267+
268+
# Step 3: Clean up PDFs if you only need metadata going forward
269+
rm research_papers/*.pdf
270+
271+
# Step 4: Later, re-download specific PDFs you need
272+
pyrxiv download_pdfs --data-path research_papers
273+
```
274+
275+
## Best Practices
276+
277+
1. **Start Small**: Begin with a small number of papers (e.g., `--n-papers 5`) to test your setup and regex patterns.
278+
2. **Use Meaningful Regex**: When using `--regex-pattern`, make sure your pattern is specific enough to avoid false positives but broad enough to capture relevant papers.
279+
3. **Save Metadata**: Use `--save-hdf5` to preserve paper metadata, which is useful for later analysis and record-keeping.
280+
4. **Organize by Category**: Use different download paths for different categories to keep your papers organized:
281+
282+
```bash
283+
pyrxiv search_and_download --download-path papers/condensed_matter --category cond-mat.str-el
284+
pyrxiv search_and_download --download-path papers/optics --category physics.optics
285+
```
286+
5. **Resume Capability**: Use `--start-from-filepath True` when continuing a previous download session to avoid re-downloading papers.
287+
6. **Storage Management**:
288+
289+
- Use `--delete-pdf` with `--save-hdf5` if you primarily need metadata and text content
290+
- Use `download_pdfs` later to retrieve specific PDFs when needed
291+
7. **Text Extraction**: The default `pdfminer` loader generally works well, but if you encounter issues with specific PDFs, try `--loader pypdf`.
292+
8. **Monitor Progress**: The CLI displays a progress bar during downloads. For large batches, be patient as the tool may need to fetch many papers to find matches for your regex pattern.
293+
294+
## Troubleshooting
295+
296+
### No papers match my regex pattern
297+
298+
- Try broadening your regex pattern
299+
- Check the pattern syntax is correct
300+
- Remember that `pyrxiv` searches the full text of papers, not just titles or abstracts
301+
302+
### Downloads are slow
303+
304+
- arXiv has rate limits; `pyrxiv` respects these
305+
- When using regex filtering, the tool must download and process papers until it finds enough matches
306+
- Consider reducing `--n-papers` or using a less restrictive regex pattern
307+
308+
### PDF extraction errors
309+
310+
- Try switching between `--loader pdfminer` and `--loader pypdf`
311+
- Some papers may have PDF issues; these will be skipped automatically
312+
313+
### Can't find HDF5 files
314+
315+
- Ensure you used `--save-hdf5` when running `search_and_download`
316+
- Check that the `--data-path` matches where you saved the files
317+
318+
---
319+
320+
For more information, see the [main README](../README.md) or visit the [`pyrxiv` GitHub repository](https://github.com/JosePizarro3/pyrxiv).

0 commit comments

Comments
 (0)