Polished how_to md and README

JosePizarro3 · JosePizarro3 · commit 4fb190168bc3 · 2025-11-06T15:17:06.000+01:00
diff --git a/README.md b/README.md
@@ -44,7 +44,7 @@ pyrxiv search_and_download --category cond-mat.str-el --regex-pattern "DMFT|Hubb
 
 ## Documentation
 
-For a comprehensive guide on how to use the CLI and recommended pipelines, see the [How to Use pyrxiv](docs/how_to_use_pyrxiv.md) documentation.
+For a comprehensive guide on how to use the CLI and recommended pipelines, see the [How to Use `pyrxiv`](docs/how_to_use_pyrxiv.md) documentation.
 
 ---
 
diff --git a/docs/how_to_use_pyrxiv.md b/docs/how_to_use_pyrxiv.md
@@ -1,4 +1,4 @@
-# How to Use pyrxiv CLI
+# How to Use `pyrxiv` CLI
 
 This guide explains how to use the **pyrxiv** command-line interface (CLI) to search and download arXiv papers.
 
@@ -18,15 +18,15 @@ This guide explains how to use the **pyrxiv** command-line interface (CLI) to se
 
 ## Installation
 
-Install pyrxiv using pip:
+Install `pyrxiv` using pip:
 
 ```bash
 pip install pyrxiv
 ```
 
 ## Available Commands
 
-pyrxiv provides two main commands:
+`pyrxiv` provides two main commands:
 
 1. **`search_and_download`** - Search for papers in a specific arXiv category and download them
 2. **`download_pdfs`** - Download PDFs from existing HDF5 metadata files
@@ -39,15 +39,14 @@ pyrxiv --help
 
 ## Pipeline Overview
 
-The typical **pyrxiv** workflow follows these steps:
+The typical `pyrxiv` workflow follows these steps:
 
 1. **Search and Filter**: Use `search_and_download` to:
+
    - Fetch papers from a specific arXiv category
    - Optionally filter papers using a regex pattern
    - Download PDFs and/or save metadata to HDF5 files
-
 2. **Process or Analyze**: Work with the downloaded papers and metadata
-
 3. **Re-download if Needed**: Use `download_pdfs` to re-download PDFs from HDF5 metadata files if you previously deleted them
 
 ## Command 1: search_and_download
@@ -72,14 +71,32 @@ pyrxiv search_and_download --category physics.optics --n-papers 10
 
 #### Filtering with Regex Patterns
 
-Download papers that match a specific regex pattern. When using `--regex-pattern`, pyrxiv will continue fetching papers until it finds the specified number that match the pattern:
+Download papers whose text contain a specific matched regex pattern. When using `--regex-pattern`, pyrxiv will continue fetching papers until it finds the specified number that match the pattern:
 
 ```bash
 pyrxiv search_and_download --category cond-mat.str-el --regex-pattern "DMFT|Hubbard" --n-papers 5
 ```
 
 **Important**: Papers that don't match the regex pattern are automatically discarded and not downloaded.
 
+#### Resuming from a Specific Paper
+
+Resume downloading from a specific arXiv ID:
+
+```bash
+pyrxiv search_and_download --category cond-mat.str-el --start-id "2201.12345v1" --n-papers 10
+```
+
+**Important**: you have to add the full arXiv ID, including the versioning part.
+
+Resume from the last downloaded paper in your download directory:
+
+```bash
+pyrxiv search_and_download --category cond-mat.str-el --start-from-filepath True --n-papers 10
+```
+
+These options are important if your search and download process abruptly stopped while processing the number of papers.
+
 #### Saving Metadata to HDF5
 
 Save both PDFs and metadata to HDF5 files:
@@ -106,19 +123,7 @@ pyrxiv search_and_download --category cond-mat.str-el --n-papers 5 --save-hdf5 -
 
 **Note**: You must use `--save-hdf5` even if you're deleting HDF5 files, as the files need to be created first before deletion.
 
-#### Resuming from a Specific Paper
-
-Resume downloading from a specific arXiv ID:
-
-```bash
-pyrxiv search_and_download --category cond-mat.str-el --start-id "2201.12345" --n-papers 10
-```
-
-Resume from the last downloaded paper in your download directory:
-
-```bash
-pyrxiv search_and_download --category cond-mat.str-el --start-from-filepath True --n-papers 10
-```
+**Note 2**: Yes, this option does not make any sense, but GitHub Copilot wrote this and thought it was funny to keep it :-)
 
 #### Customizing PDF Text Extraction
 
@@ -144,19 +149,19 @@ pyrxiv search_and_download --download-path my_papers --category cond-mat.str-el
 
 ### Options Reference
 
-| Option | Short | Description | Default |
-|--------|-------|-------------|---------|
-| `--download-path` | `-path` | Path for downloading PDFs and HDF5 files | `data` |
-| `--category` | `-c` | arXiv category to search | `cond-mat.str-el` |
-| `--n-papers` | `-n` | Number of papers to download | `5` |
-| `--regex-pattern` | `-regex` | Regex pattern to filter papers | None |
-| `--start-id` | `-s` | arXiv ID to start from | None |
-| `--start-from-filepath` | `-sff` | Resume from last downloaded paper | `False` |
-| `--loader` | `-l` | PDF text extraction loader (`pdfminer` or `pypdf`) | `pdfminer` |
-| `--clean-text` | `-ct` | Clean extracted text (remove references, whitespace) | `True` |
-| `--save-hdf5` | `-h5` | Save metadata to HDF5 files | `False` |
-| `--delete-pdf` | `-dp` | Delete PDFs after processing | `False` |
-| `--delete-hdf5` | `-dh5` | Delete HDF5 files after processing | `False` |
+| Option                    | Short      | Description                                            | Default             |
+| ------------------------- | ---------- | ------------------------------------------------------ | ------------------- |
+| `--download-path`       | `-path`  | Path for downloading PDFs and HDF5 files               | `data`            |
+| `--category`            | `-c`     | arXiv category to search                               | `cond-mat.str-el` |
+| `--n-papers`            | `-n`     | Number of papers to download                           | `5`               |
+| `--regex-pattern`       | `-regex` | Regex pattern to filter papers                         | None                |
+| `--start-id`            | `-s`     | arXiv ID to start from                                 | None                |
+| `--start-from-filepath` | `-sff`   | Resume from last downloaded paper                      | `False`           |
+| `--loader`              | `-l`     | PDF text extraction loader (`pdfminer` or `pypdf`) | `pdfminer`        |
+| `--clean-text`          | `-ct`    | Clean extracted text (remove references, whitespace)   | `True`            |
+| `--save-hdf5`           | `-h5`    | Save metadata to HDF5 files                            | `False`           |
+| `--delete-pdf`          | `-dp`    | Delete PDFs after processing                           | `False`           |
+| `--delete-hdf5`         | `-dh5`   | Delete HDF5 files after processing                     | `False`           |
 
 ## Command 2: download_pdfs
 
@@ -178,8 +183,8 @@ pyrxiv download_pdfs --data-path my_papers
 
 ### Options
 
-| Option | Short | Description | Default |
-|--------|-------|-------------|---------|
+| Option          | Short     | Description                 | Default  |
+| --------------- | --------- | --------------------------- | -------- |
 | `--data-path` | `-path` | Path where HDF5 files exist | `data` |
 
 ## Complete Pipeline Examples
@@ -269,26 +274,21 @@ pyrxiv download_pdfs --data-path research_papers
 
 ## Best Practices
 
-1. **Start Small**: Begin with a small number of papers (`--n-papers 5`) to test your setup and regex patterns.
-
+1. **Start Small**: Begin with a small number of papers (e.g., `--n-papers 5`) to test your setup and regex patterns.
 2. **Use Meaningful Regex**: When using `--regex-pattern`, make sure your pattern is specific enough to avoid false positives but broad enough to capture relevant papers.
-
 3. **Save Metadata**: Use `--save-hdf5` to preserve paper metadata, which is useful for later analysis and record-keeping.
-
 4. **Organize by Category**: Use different download paths for different categories to keep your papers organized:
+
    ```bash
    pyrxiv search_and_download --download-path papers/condensed_matter --category cond-mat.str-el
    pyrxiv search_and_download --download-path papers/optics --category physics.optics
    ```
-
 5. **Resume Capability**: Use `--start-from-filepath True` when continuing a previous download session to avoid re-downloading papers.
+6. **Storage Management**:
 
-6. **Storage Management**: 
    - Use `--delete-pdf` with `--save-hdf5` if you primarily need metadata and text content
    - Use `download_pdfs` later to retrieve specific PDFs when needed
-
 7. **Text Extraction**: The default `pdfminer` loader generally works well, but if you encounter issues with specific PDFs, try `--loader pypdf`.
-
 8. **Monitor Progress**: The CLI displays a progress bar during downloads. For large batches, be patient as the tool may need to fetch many papers to find matches for your regex pattern.
 
 ## Troubleshooting
@@ -297,11 +297,11 @@ pyrxiv download_pdfs --data-path research_papers
 
 - Try broadening your regex pattern
 - Check the pattern syntax is correct
-- Remember that pyrxiv searches the full text of papers, not just titles or abstracts
+- Remember that `pyrxiv` searches the full text of papers, not just titles or abstracts
 
 ### Downloads are slow
 
-- arXiv has rate limits; pyrxiv respects these
+- arXiv has rate limits; `pyrxiv` respects these
 - When using regex filtering, the tool must download and process papers until it finds enough matches
 - Consider reducing `--n-papers` or using a less restrictive regex pattern
 
@@ -317,4 +317,4 @@ pyrxiv download_pdfs --data-path research_papers
 
 ---
 
-For more information, see the [main README](../README.md) or visit the [pyrxiv GitHub repository](https://github.com/JosePizarro3/pyrxiv).
+For more information, see the [main README](../README.md) or visit the [`pyrxiv` GitHub repository](https://github.com/JosePizarro3/pyrxiv).