Skip to content

Commit 4fb1901

Browse files
committed
Polished how_to md and README
1 parent 3a61c7d commit 4fb1901

File tree

2 files changed

+47
-47
lines changed

2 files changed

+47
-47
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ pyrxiv search_and_download --category cond-mat.str-el --regex-pattern "DMFT|Hubb
4444

4545
## Documentation
4646

47-
For a comprehensive guide on how to use the CLI and recommended pipelines, see the [How to Use pyrxiv](docs/how_to_use_pyrxiv.md) documentation.
47+
For a comprehensive guide on how to use the CLI and recommended pipelines, see the [How to Use `pyrxiv`](docs/how_to_use_pyrxiv.md) documentation.
4848

4949
---
5050

docs/how_to_use_pyrxiv.md

Lines changed: 46 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# How to Use pyrxiv CLI
1+
# How to Use `pyrxiv` CLI
22

33
This guide explains how to use the **pyrxiv** command-line interface (CLI) to search and download arXiv papers.
44

@@ -18,15 +18,15 @@ This guide explains how to use the **pyrxiv** command-line interface (CLI) to se
1818

1919
## Installation
2020

21-
Install pyrxiv using pip:
21+
Install `pyrxiv` using pip:
2222

2323
```bash
2424
pip install pyrxiv
2525
```
2626

2727
## Available Commands
2828

29-
pyrxiv provides two main commands:
29+
`pyrxiv` provides two main commands:
3030

3131
1. **`search_and_download`** - Search for papers in a specific arXiv category and download them
3232
2. **`download_pdfs`** - Download PDFs from existing HDF5 metadata files
@@ -39,15 +39,14 @@ pyrxiv --help
3939

4040
## Pipeline Overview
4141

42-
The typical **pyrxiv** workflow follows these steps:
42+
The typical `pyrxiv` workflow follows these steps:
4343

4444
1. **Search and Filter**: Use `search_and_download` to:
45+
4546
- Fetch papers from a specific arXiv category
4647
- Optionally filter papers using a regex pattern
4748
- Download PDFs and/or save metadata to HDF5 files
48-
4949
2. **Process or Analyze**: Work with the downloaded papers and metadata
50-
5150
3. **Re-download if Needed**: Use `download_pdfs` to re-download PDFs from HDF5 metadata files if you previously deleted them
5251

5352
## Command 1: search_and_download
@@ -72,14 +71,32 @@ pyrxiv search_and_download --category physics.optics --n-papers 10
7271

7372
#### Filtering with Regex Patterns
7473

75-
Download papers that match a specific regex pattern. When using `--regex-pattern`, pyrxiv will continue fetching papers until it finds the specified number that match the pattern:
74+
Download papers whose text contain a specific matched regex pattern. When using `--regex-pattern`, pyrxiv will continue fetching papers until it finds the specified number that match the pattern:
7675

7776
```bash
7877
pyrxiv search_and_download --category cond-mat.str-el --regex-pattern "DMFT|Hubbard" --n-papers 5
7978
```
8079

8180
**Important**: Papers that don't match the regex pattern are automatically discarded and not downloaded.
8281

82+
#### Resuming from a Specific Paper
83+
84+
Resume downloading from a specific arXiv ID:
85+
86+
```bash
87+
pyrxiv search_and_download --category cond-mat.str-el --start-id "2201.12345v1" --n-papers 10
88+
```
89+
90+
**Important**: you have to add the full arXiv ID, including the versioning part.
91+
92+
Resume from the last downloaded paper in your download directory:
93+
94+
```bash
95+
pyrxiv search_and_download --category cond-mat.str-el --start-from-filepath True --n-papers 10
96+
```
97+
98+
These options are important if your search and download process abruptly stopped while processing the number of papers.
99+
83100
#### Saving Metadata to HDF5
84101

85102
Save both PDFs and metadata to HDF5 files:
@@ -106,19 +123,7 @@ pyrxiv search_and_download --category cond-mat.str-el --n-papers 5 --save-hdf5 -
106123

107124
**Note**: You must use `--save-hdf5` even if you're deleting HDF5 files, as the files need to be created first before deletion.
108125

109-
#### Resuming from a Specific Paper
110-
111-
Resume downloading from a specific arXiv ID:
112-
113-
```bash
114-
pyrxiv search_and_download --category cond-mat.str-el --start-id "2201.12345" --n-papers 10
115-
```
116-
117-
Resume from the last downloaded paper in your download directory:
118-
119-
```bash
120-
pyrxiv search_and_download --category cond-mat.str-el --start-from-filepath True --n-papers 10
121-
```
126+
**Note 2**: Yes, this option does not make any sense, but GitHub Copilot wrote this and thought it was funny to keep it :-)
122127

123128
#### Customizing PDF Text Extraction
124129

@@ -144,19 +149,19 @@ pyrxiv search_and_download --download-path my_papers --category cond-mat.str-el
144149

145150
### Options Reference
146151

147-
| Option | Short | Description | Default |
148-
|--------|-------|-------------|---------|
149-
| `--download-path` | `-path` | Path for downloading PDFs and HDF5 files | `data` |
150-
| `--category` | `-c` | arXiv category to search | `cond-mat.str-el` |
151-
| `--n-papers` | `-n` | Number of papers to download | `5` |
152-
| `--regex-pattern` | `-regex` | Regex pattern to filter papers | None |
153-
| `--start-id` | `-s` | arXiv ID to start from | None |
154-
| `--start-from-filepath` | `-sff` | Resume from last downloaded paper | `False` |
155-
| `--loader` | `-l` | PDF text extraction loader (`pdfminer` or `pypdf`) | `pdfminer` |
156-
| `--clean-text` | `-ct` | Clean extracted text (remove references, whitespace) | `True` |
157-
| `--save-hdf5` | `-h5` | Save metadata to HDF5 files | `False` |
158-
| `--delete-pdf` | `-dp` | Delete PDFs after processing | `False` |
159-
| `--delete-hdf5` | `-dh5` | Delete HDF5 files after processing | `False` |
152+
| Option | Short | Description | Default |
153+
| ------------------------- | ---------- | ------------------------------------------------------ | ------------------- |
154+
| `--download-path` | `-path` | Path for downloading PDFs and HDF5 files | `data` |
155+
| `--category` | `-c` | arXiv category to search | `cond-mat.str-el` |
156+
| `--n-papers` | `-n` | Number of papers to download | `5` |
157+
| `--regex-pattern` | `-regex` | Regex pattern to filter papers | None |
158+
| `--start-id` | `-s` | arXiv ID to start from | None |
159+
| `--start-from-filepath` | `-sff` | Resume from last downloaded paper | `False` |
160+
| `--loader` | `-l` | PDF text extraction loader (`pdfminer` or `pypdf`) | `pdfminer` |
161+
| `--clean-text` | `-ct` | Clean extracted text (remove references, whitespace) | `True` |
162+
| `--save-hdf5` | `-h5` | Save metadata to HDF5 files | `False` |
163+
| `--delete-pdf` | `-dp` | Delete PDFs after processing | `False` |
164+
| `--delete-hdf5` | `-dh5` | Delete HDF5 files after processing | `False` |
160165

161166
## Command 2: download_pdfs
162167

@@ -178,8 +183,8 @@ pyrxiv download_pdfs --data-path my_papers
178183

179184
### Options
180185

181-
| Option | Short | Description | Default |
182-
|--------|-------|-------------|---------|
186+
| Option | Short | Description | Default |
187+
| --------------- | --------- | --------------------------- | -------- |
183188
| `--data-path` | `-path` | Path where HDF5 files exist | `data` |
184189

185190
## Complete Pipeline Examples
@@ -269,26 +274,21 @@ pyrxiv download_pdfs --data-path research_papers
269274

270275
## Best Practices
271276

272-
1. **Start Small**: Begin with a small number of papers (`--n-papers 5`) to test your setup and regex patterns.
273-
277+
1. **Start Small**: Begin with a small number of papers (e.g., `--n-papers 5`) to test your setup and regex patterns.
274278
2. **Use Meaningful Regex**: When using `--regex-pattern`, make sure your pattern is specific enough to avoid false positives but broad enough to capture relevant papers.
275-
276279
3. **Save Metadata**: Use `--save-hdf5` to preserve paper metadata, which is useful for later analysis and record-keeping.
277-
278280
4. **Organize by Category**: Use different download paths for different categories to keep your papers organized:
281+
279282
```bash
280283
pyrxiv search_and_download --download-path papers/condensed_matter --category cond-mat.str-el
281284
pyrxiv search_and_download --download-path papers/optics --category physics.optics
282285
```
283-
284286
5. **Resume Capability**: Use `--start-from-filepath True` when continuing a previous download session to avoid re-downloading papers.
287+
6. **Storage Management**:
285288

286-
6. **Storage Management**:
287289
- Use `--delete-pdf` with `--save-hdf5` if you primarily need metadata and text content
288290
- Use `download_pdfs` later to retrieve specific PDFs when needed
289-
290291
7. **Text Extraction**: The default `pdfminer` loader generally works well, but if you encounter issues with specific PDFs, try `--loader pypdf`.
291-
292292
8. **Monitor Progress**: The CLI displays a progress bar during downloads. For large batches, be patient as the tool may need to fetch many papers to find matches for your regex pattern.
293293

294294
## Troubleshooting
@@ -297,11 +297,11 @@ pyrxiv download_pdfs --data-path research_papers
297297

298298
- Try broadening your regex pattern
299299
- Check the pattern syntax is correct
300-
- Remember that pyrxiv searches the full text of papers, not just titles or abstracts
300+
- Remember that `pyrxiv` searches the full text of papers, not just titles or abstracts
301301

302302
### Downloads are slow
303303

304-
- arXiv has rate limits; pyrxiv respects these
304+
- arXiv has rate limits; `pyrxiv` respects these
305305
- When using regex filtering, the tool must download and process papers until it finds enough matches
306306
- Consider reducing `--n-papers` or using a less restrictive regex pattern
307307

@@ -317,4 +317,4 @@ pyrxiv download_pdfs --data-path research_papers
317317

318318
---
319319

320-
For more information, see the [main README](../README.md) or visit the [pyrxiv GitHub repository](https://github.com/JosePizarro3/pyrxiv).
320+
For more information, see the [main README](../README.md) or visit the [`pyrxiv` GitHub repository](https://github.com/JosePizarro3/pyrxiv).

0 commit comments

Comments
 (0)