Improve scripts code by kodsurfer · Pull Request #954 · cqfn/aibolit

kodsurfer · 2025-10-13T23:09:18Z

Summary by CodeRabbit

New Features
- CLI tool to download trending GitHub repositories with options for number, output directory, and timeout; updated program entrypoint.
Improvements
- Creates owner/repo folders, skips already-downloaded repos, reports progress/status, robustly handles network and subprocess errors, and normalizes URLs for reliable cloning.
Refactor
- Downloader reorganized into a structured, reusable class-based flow.
Tests
- Comprehensive unit tests for fetching, URL normalization, cloning behavior, CLI parsing, and end-to-end flow.

coderabbitai · 2025-10-13T23:09:41Z

Walkthrough

Refactors scripts/01-fetch-github.py into a class-based RepositoryDownloader with setup, fetching, cloning, and orchestration methods; adds CLI parsing (--nrepos, --output-dir) and main() entrypoint. Adds comprehensive unit tests for the downloader in test/scripts/test_fetch_github.py.

Changes

Cohort / File(s)	Summary of changes
Downloader refactor and CLI `scripts/01-fetch-github.py`	Replaced procedural script with `RepositoryDownloader` class (`__init__(output, url, timeout)`), added `setup_directories()`, `fetch_trending_repositories() -> List[str]`, `clone_repository(repo_url, owner, repo_name) -> bool`, `download_repositories(max_repositories)`, `parse_arguments()` and `main()`; updated `__main__` to call `main()`.
Unit tests for downloader `test/scripts/test_fetch_github.py`	Added tests covering directory setup, HTTP fetch and normalization of trending repo URLs, cloning success/failure and subprocess behavior, download workflow with limits and partial failures, argument parsing, and main invocation; mocks network, subprocess, and filesystem side-effects.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as argparse CLI
  participant RD as RepositoryDownloader
  participant GH as GitHub (HTTP)
  participant GIT as git subprocess
  rect rgb(240,248,255)
    User->>CLI: run script (--nrepos, --output-dir)
    CLI->>RD: instantiate(output, url, timeout)
    RD->>RD: setup_directories()
  end
  rect rgb(245,255,240)
    RD->>GH: GET trending page
    GH-->>RD: HTML response
    RD->>RD: parse & normalize repo URLs (ensure full GitHub .git)
  end
  loop each repo (up to max_repositories)
    RD->>RD: derive owner and repo_name
    alt repo not present locally
      RD->>GIT: git clone <repo_url> into output/owner/repo
      GIT-->>RD: success / error
      RD->>RD: record result
    else repo exists
      RD->>RD: skip clone (log)
    end
  end
  RD-->>CLI: finish (summary)
  CLI-->>User: exit

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Areas to review closely:
- URL parsing/normalization and HTML selectors in fetch_trending_repositories().
- Filesystem path handling and directory creation in setup_directories() and clone target logic.
- Subprocess invocation, error handling, and idempotency in clone_repository().
- Accuracy and isolation of mocks/assertions in test/scripts/test_fetch_github.py.

Poem

A twitch of whiskers, code in paw,
I hop through links with careful awe.
I fetch and clone, then snuggle tight,
Repos tucked in burrows for the night.
Bravo, friend — we did it right! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title Check	❓ Inconclusive	The title "Improve scripts code" is too vague and generic to meaningfully convey the nature of the changes. While the changeset does contain improvements (refactoring to a class-based RepositoryDownloader architecture and adding comprehensive unit tests), the title uses non-descriptive language that could apply to nearly any enhancement. A teammate reviewing git history would not understand whether the PR involved refactoring, bug fixes, feature additions, or performance improvements, which is critical information for understanding the codebase evolution.	Consider revising the title to be more specific and descriptive of the actual changes. For example, "Refactor repository downloader to class-based architecture" or "Restructure fetch-github script with RepositoryDownloader class and unit tests" would clearly communicate the primary changes to someone reading the history without requiring them to inspect the diff.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35f5b68 and 5b14be9.

📒 Files selected for processing (1)

test/scripts/test_fetch_github.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

test/scripts/test_fetch_github.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: latexmk

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (1)

scripts/01-fetch-github.py (1)

74-77: URL parsing logic is fragile.

The string manipulation approach to extract owner and repo name is brittle. It assumes a specific URL format and doesn't handle edge cases like trailing slashes, different URL schemes, or malformed URLs.

Consider using urllib.parse or a regex pattern for more robust URL parsing:

+            from urllib.parse import urlparse
+            
-            path_parts = repo_url.replace('https://github.com/', '').replace('.git', '').split('/')
-            if len(path_parts) < 2:
-                continue
-            owner, repo_name = path_parts[0], path_parts[1]
+            parsed = urlparse(repo_url)
+            path_parts = parsed.path.strip('/').replace('.git', '').split('/')
+            if len(path_parts) != 2:
+                print(f"Skipping malformed URL: {repo_url}")
+                continue
+            owner, repo_name = path_parts[0], path_parts[1]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 540a5cc and cb7a4a3.

📒 Files selected for processing (1)

scripts/01-fetch-github.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

scripts/01-fetch-github.py (1)

stubs/bs4/__init__.pyi (1)

BeautifulSoup (7-37)

🪛 GitHub Actions: flake8

scripts/01-fetch-github.py

[error] 5-5: flake8: F401 'os' imported but unused

[error] 15-15: flake8: W291 trailing whitespace

[error] 20-20: flake8: W293 blank line contains whitespace

[error] 24-24: flake8: W293 blank line contains whitespace

[error] 33-33: flake8: W291 trailing whitespace

[error] 36-36: flake8: W291 trailing whitespace

[error] 38-38: flake8: W293 blank line contains whitespace

[error] 41-41: flake8: W291 trailing whitespace

[error] 44-44: flake8: W291 trailing whitespace

[error] 58-58: flake8: W291 trailing whitespace

[error] 62-62: flake8: W293 blank line contains whitespace

[error] 64-64: flake8: W291 trailing whitespace

[error] 69-69: flake8: W291 trailing whitespace

[error] 70-70: flake8: W291 trailing whitespace

[error] 73-73: flake8: W291 trailing whitespace

[error] 76-76: flake8: W291 trailing whitespace

[error] 77-77: flake8: W291 trailing whitespace

[error] 78-78: flake8: W291 trailing whitespace

[error] 80-80: flake8: W291 trailing whitespace

[error] 107-107: flake8: W291 trailing whitespace

[error] 1-1: flake8: General lint failure causing pipeline exit

🪛 GitHub Actions: pylint

scripts/01-fetch-github.py

[error] 15-15: Trailing whitespace (trailing-whitespace) in line 15

[error] 20-20: Trailing whitespace (trailing-whitespace) in line 20

[error] 24-24: Trailing whitespace (trailing-whitespace) in line 24

[error] 33-33: Trailing whitespace (trailing-whitespace) in line 33

[error] 36-36: Trailing whitespace (trailing-whitespace) in line 36

[error] 38-38: Trailing whitespace (trailing-whitespace) in line 38

[error] 41-41: Trailing whitespace (trailing-whitespace) in line 41

[error] 44-44: Trailing whitespace (trailing-whitespace) in line 44

[error] 58-58: Trailing whitespace (trailing-whitespace) in line 58

[error] 62-62: Trailing whitespace (trailing-whitespace) in line 62

[error] 64-64: Trailing whitespace (trailing-whitespace) in line 64

[error] 69-69: Trailing whitespace (trailing-whitespace) in line 69

[error] 70-70: Trailing whitespace (trailing-whitespace) in line 70

[error] 73-73: Trailing whitespace (trailing-whitespace) in line 73

[error] 76-76: Trailing whitespace (trailing-whitespace) in line 76

[error] 77-77: Trailing whitespace (trailing-whitespace) in line 77

[error] 78-78: Trailing whitespace (trailing-whitespace) in line 78

[error] 80-80: Trailing whitespace (trailing-whitespace) in line 80

[error] 107-107: Trailing whitespace (trailing-whitespace) in line 107

[error] 5-5: Unused import 'os' (unused-import) detected by pylint

🪛 GitHub Actions: ruff

scripts/01-fetch-github.py

[error] 5-5: F401: 'os' imported but unused

[error] 16-16: Q000: Double quotes found but single quotes preferred

[error] 18-18: Q000: Double quotes found but single quotes preferred

[error] 30-30: Q000: Double quotes found but single quotes preferred

[error] 36-36: Q000: Double quotes found but single quotes preferred

[error] 43-43: Q000: Double quotes found but single quotes preferred

[error] 54-54: Q000: Double quotes found but single quotes preferred

[error] 57-57: Q000: Double quotes found but single quotes preferred

[error] 60-60: Q000: Double quotes found but single quotes preferred

[error] 64-64: Q000: Double quotes found but single quotes preferred

[error] 68-68: Q000: Double quotes found but single quotes preferred

[error] 78-78: Q000: Double quotes found but single quotes preferred

[error] 81-81: Q000: Double quotes found but single quotes preferred

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: make
GitHub Check: latexmk

🔇 Additional comments (5)

scripts/01-fetch-github.py (5)

16-23: Good use of Path and sensible defaults.

The initialization and setup logic is clean. Using pathlib.Path for directory operations and providing sensible defaults for output directory and timeout is a solid choice.

39-61: Solid clone implementation with good error handling.

The repository cloning logic is well-structured:

Creates owner directories as needed

Checks for existing repositories to avoid redundant clones

Captures and reports both subprocess failures and exceptions

Returns boolean status for caller to track success

84-103: Clean argument parsing with sensible defaults.

The parse_arguments function is well-structured with clear help text and appropriate defaults for both the number of repositories and output directory.

106-110: Clean main function orchestration.

The main function effectively wires together argument parsing, downloader instantiation, and execution flow in a clear, linear manner.

113-114: Proper main entry point.

The if __name__ == '__main__' block correctly calls the main() function, following Python best practices.

scripts/01-fetch-github.py

coderabbitai · 2025-10-13T23:12:56Z

scripts/01-fetch-github.py

+            print(f"Error fetching trending repositories: {e}", file=sys.stderr)
+            raise
+        soup = BeautifulSoup(response.text, 'html.parser')
+        repositories = []        


⚠️ Potential issue | 🟡 Minor

Remove trailing whitespace.

Lines 33, 36, and 38 have trailing whitespace causing linter failures.

Also applies to: 36-36, 38-38

🧰 Tools

🪛 GitHub Actions: flake8

[error] 33-33: flake8: W291 trailing whitespace

🪛 GitHub Actions: pylint

[error] 33-33: Trailing whitespace (trailing-whitespace) in line 33

🤖 Prompt for AI Agents

In scripts/01-fetch-github.py around lines 33, 36 and 38, there are trailing spaces causing linter failures; remove the trailing whitespace at the end of those lines (and any other accidental trailing spaces in the file), save the file, and re-run the linter to confirm the warnings are resolved.

scripts/01-fetch-github.py

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

scripts/01-fetch-github.py (2)

15-19: Use a Requests Session and set a User-Agent.

Improves connection reuse and avoids UA-related blocks. Add Session in init and use it.

     def __init__(self, output_dir: str = "target/01"):
         self.output_dir = Path(output_dir)
-        self.trending_url = "https://github.com/trending/java?since=daily"
+        self.trending_url = 'https://github.com/trending/java?since=daily'
         self.request_timeout = 30
+        self.session = requests.Session()
+        self.session.headers.update(
+            {'User-Agent': 'aibolit/01-fetch-github (+https://github.com/cqfn/aibolit)'}
+        )
@@
-            response = requests.get(self.trending_url, timeout=self.request_timeout)
+            response = self.session.get(self.trending_url, timeout=self.request_timeout)

Based on learnings

Also applies to: 26-26

83-102: Validate CLI: ensure --nrepos is non-negative.

Prevents surprising early exit with negative values.

-    parser.add_argument(
+    parser.add_argument(
         '--nrepos',
-        type=int,
+        type=positive_int,
         required=False,
         default=100,
         help='Number of repositories to download (default: 100)'
     )

Add helper (outside this range):

def positive_int(value: str) -> int:
    ivalue = int(value)
    if ivalue < 0:
        raise argparse.ArgumentTypeError('nrepos must be >= 0')
    return ivalue

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 540a5cc and aed4fb2.

📒 Files selected for processing (1)

scripts/01-fetch-github.py (1 hunks)

🧰 Additional context used

🪛 GitHub Actions: flake8

scripts/01-fetch-github.py

[error] 14-14: flake8: W291 trailing whitespace.

[error] 19-19: flake8: W293 blank line contains whitespace.

[error] 23-23: flake8: W293 blank line contains whitespace.

[error] 32-32: flake8: W291 trailing whitespace.

[error] 35-35: flake8: W291 trailing whitespace.

[error] 37-37: flake8: W293 blank line contains whitespace.

[error] 40-40: flake8: W291 trailing whitespace.

[error] 43-43: flake8: W291 trailing whitespace.

[error] 57-57: flake8: W291 trailing whitespace.

[error] 61-61: flake8: W293 blank line contains whitespace.

[error] 63-63: flake8: W291 trailing whitespace.

[error] 68-68: flake8: W291 trailing whitespace.

[error] 69-69: flake8: W291 trailing whitespace.

[error] 72-72: flake8: W291 trailing whitespace.

[error] 75-75: flake8: W291 trailing whitespace.

[error] 76-76: flake8: W291 trailing whitespace.

[error] 79-79: flake8: W291 trailing whitespace.

[error] 106-106: flake8: W291 trailing whitespace.

🪛 GitHub Actions: pylint

scripts/01-fetch-github.py

[error] 14-14: Trailing whitespace (trailing-whitespace)

[error] 19-19: Trailing whitespace (trailing-whitespace)

[error] 23-23: Trailing whitespace (trailing-whitespace)

[error] 32-32: Trailing whitespace (trailing-whitespace)

[error] 35-35: Trailing whitespace (trailing-whitespace)

[error] 37-37: Trailing whitespace (trailing-whitespace)

[error] 40-40: Trailing whitespace (trailing-whitespace)

[error] 43-43: Trailing whitespace (trailing-whitespace)

[error] 57-57: Trailing whitespace (trailing-whitespace)

[error] 61-61: Trailing whitespace (trailing-whitespace)

[error] 63-63: Trailing whitespace (trailing-whitespace)

[error] 68-68: Trailing whitespace (trailing-whitespace)

[error] 69-69: Trailing whitespace (trailing-whitespace)

[error] 72-72: Trailing whitespace (trailing-whitespace)

[error] 75-75: Trailing whitespace (trailing-whitespace)

[error] 76-76: Trailing whitespace (trailing-whitespace)

[error] 77-77: Trailing whitespace (trailing-whitespace)

[error] 79-79: Trailing whitespace (trailing-whitespace)

[error] 106-106: Trailing whitespace (trailing-whitespace)

🪛 GitHub Actions: ruff

scripts/01-fetch-github.py

[error] 15-15: Q000: Double quotes found but single quotes preferred

[error] 17-17: Q000: Double quotes found but single quotes preferred

[error] 29-29: Q000: Double quotes found but single quotes preferred

[error] 35-35: Q000: Double quotes found but single quotes preferred

[error] 42-42: Q000: Double quotes found but single quotes preferred

[error] 53-53: Q000: Double quotes found but single quotes preferred

[error] 56-56: Q000: Double quotes found but single quotes preferred

[error] 59-59: Q000: Double quotes found but single quotes preferred

[error] 63-63: Q000: Double quotes found but single quotes preferred

[error] 67-67: Q000: Double quotes found but single quotes preferred

[error] 77-77: Q000: Double quotes found but single quotes preferred

[error] 80-80: Q000: Double quotes found but single quotes preferred

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: latexmk
GitHub Check: make

🔇 Additional comments (2)

scripts/01-fetch-github.py (2)

14-23: Nice refactor to a class with Path-based dirs.

Cleaner structure and directory setup look good.

10-11: Dependencies are declared and pinned
beautifulsoup4==4.14.2 and requests==2.32.5 are pinned in pyproject.toml.

scripts/01-fetch-github.py

coderabbitai · 2025-10-13T23:20:26Z

scripts/01-fetch-github.py

+    def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
+        owner_dir = self.output_dir / owner
+        owner_dir.mkdir(exist_ok=True)        
+        if (owner_dir / repo_name).exists():
+            print(f"Repository {owner}/{repo_name} already exists, skipping...")
+            return True        
+        try:
+            result = subprocess.run(
+                ['git', 'clone', repo_url],
+                cwd=owner_dir,
+                capture_output=True,
+                text=True,
+                check=False
+            )
+            if result.returncode == 0:
+                print(f"Successfully cloned {owner}/{repo_name}")
+                return True
+            else:
+                print(f"Failed to clone {owner}/{repo_name}: {result.stderr}")
+                return False                
+        except subprocess.SubprocessError as e:
+            print(f"Error cloning {owner}/{repo_name}: {e}", file=sys.stderr)
+            return False
+


⚠️ Potential issue | 🟠 Major

Harden cloning: shallow clone, timeout, parents=True, and handle missing git.

Prevents hangs, reduces bandwidth, and avoids crashes when git is absent.

def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool: owner_dir = self.output_dir / owner - owner_dir.mkdir(exist_ok=True) + owner_dir.mkdir(parents=True, exist_ok=True) if (owner_dir / repo_name).exists(): - print(f"Repository {owner}/{repo_name} already exists, skipping...") + print(f'Repository {owner}/{repo_name} already exists, skipping...') return True try: result = subprocess.run( - ['git', 'clone', repo_url], + ['git', 'clone', '--depth', '1', repo_url], cwd=owner_dir, capture_output=True, text=True, - check=False + check=False, + timeout=300, ) if result.returncode == 0: - print(f"Successfully cloned {owner}/{repo_name}") + print(f'Successfully cloned {owner}/{repo_name}') return True else: - print(f"Failed to clone {owner}/{repo_name}: {result.stderr}") + print(f'Failed to clone {owner}/{repo_name}: {result.stderr}') return False + except FileNotFoundError: + print('git not found on PATH; cannot clone repositories.', file=sys.stderr) + return False except subprocess.SubprocessError as e: - print(f"Error cloning {owner}/{repo_name}: {e}", file=sys.stderr) + print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr) return False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:

owner_dir = self.output_dir / owner

owner_dir.mkdir(exist_ok=True)

if (owner_dir / repo_name).exists():

print(f"Repository {owner}/{repo_name} already exists, skipping...")

return True

try:

result = subprocess.run(

['git', 'clone', repo_url],

cwd=owner_dir,

capture_output=True,

text=True,

check=False

)

if result.returncode == 0:

print(f"Successfully cloned {owner}/{repo_name}")

return True

else:

print(f"Failed to clone {owner}/{repo_name}: {result.stderr}")

return False

except subprocess.SubprocessError as e:

print(f"Error cloning {owner}/{repo_name}: {e}", file=sys.stderr)

return False

def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:

owner_dir = self.output_dir / owner

owner_dir.mkdir(parents=True, exist_ok=True)

if (owner_dir / repo_name).exists():

print(f'Repository {owner}/{repo_name} already exists, skipping...')

return True

try:

result = subprocess.run(

['git', 'clone', '--depth', '1', repo_url],

cwd=owner_dir,

capture_output=True,

text=True,

check=False,

timeout=300,

)

if result.returncode == 0:

print(f'Successfully cloned {owner}/{repo_name}')

return True

else:

print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')

return False

except FileNotFoundError:

print('git not found on PATH; cannot clone repositories.', file=sys.stderr)

return False

except subprocess.SubprocessError as e:

print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)

return False

🧰 Tools

🪛 GitHub Actions: flake8

[error] 40-40: flake8: W291 trailing whitespace.

[error] 43-43: flake8: W291 trailing whitespace.

[error] 57-57: flake8: W291 trailing whitespace.

[error] 61-61: flake8: W293 blank line contains whitespace.

🪛 GitHub Actions: pylint

[error] 40-40: Trailing whitespace (trailing-whitespace)

[error] 43-43: Trailing whitespace (trailing-whitespace)

[error] 57-57: Trailing whitespace (trailing-whitespace)

[error] 61-61: Trailing whitespace (trailing-whitespace)

🪛 GitHub Actions: ruff

[error] 42-42: Q000: Double quotes found but single quotes preferred

[error] 53-53: Q000: Double quotes found but single quotes preferred

[error] 56-56: Q000: Double quotes found but single quotes preferred

[error] 59-59: Q000: Double quotes found but single quotes preferred

🤖 Prompt for AI Agents

In scripts/01-fetch-github.py around lines 38 to 61, the clone_repository method should be hardened: create owner_dir with owner_dir.mkdir(parents=True, exist_ok=True); perform a shallow clone by adding '--depth', '1' to the git args to reduce bandwidth and time; pass a timeout (e.g. timeout=60) to subprocess.run and catch subprocess.TimeoutExpired to handle hangs; also catch FileNotFoundError to produce a clear error when git is not installed; keep checking result.returncode and surface result.stderr on failure. Ensure you add those exception handlers and use the parents=True mkdir flag.

scripts/01-fetch-github.py

coderabbitai · 2025-10-13T23:20:26Z

scripts/01-fetch-github.py

+        for repo_url in repositories:
+            if downloaded_count >= max_repositories:
+                break                
+            path_parts = repo_url.replace('https://github.com/', '').replace('.git', '').split('/')
+            if len(path_parts) < 2:
+                continue                
+            owner, repo_name = path_parts[0], path_parts[1]            
+            print(f"Processing {owner}/{repo_name}...")            
+            if self.clone_repository(repo_url, owner, repo_name):
+                downloaded_count += 1        


⚠️ Potential issue | 🟠 Major

Parse owner/repo via urlparse for correctness across URL shapes.

String replace is brittle; use URL parsing.

- path_parts = repo_url.replace('https://github.com/', '').replace('.git', '').split('/') - if len(path_parts) < 2: - continue - owner, repo_name = path_parts[0], path_parts[1] - print(f"Processing {owner}/{repo_name}...") + parsed = urlparse(repo_url) + parts = [p for p in parsed.path.strip('/').split('/') if p] + if len(parts) < 2: + continue + owner, repo_name = parts[0], parts[1].split('.git', 1)[0] + print(f'Processing {owner}/{repo_name}...') if self.clone_repository(repo_url, owner, repo_name): downloaded_count += 1

Add import (outside this range):

from urllib.parse import urlparse

🧰 Tools

🪛 GitHub Actions: flake8

[error] 72-72: flake8: W291 trailing whitespace.

[error] 75-75: flake8: W291 trailing whitespace.

[error] 76-76: flake8: W291 trailing whitespace.

[error] 79-79: flake8: W291 trailing whitespace.

🪛 GitHub Actions: pylint

[error] 72-72: Trailing whitespace (trailing-whitespace)

[error] 75-75: Trailing whitespace (trailing-whitespace)

[error] 76-76: Trailing whitespace (trailing-whitespace)

[error] 77-77: Trailing whitespace (trailing-whitespace)

[error] 79-79: Trailing whitespace (trailing-whitespace)

🪛 GitHub Actions: ruff

[error] 77-77: Q000: Double quotes found but single quotes preferred

🤖 Prompt for AI Agents

In scripts/01-fetch-github.py around lines 70 to 79, parsing owner/repo by string replace is brittle; use urllib.parse.urlparse to handle different GitHub URL shapes. Update the loop to parse repo_url with urlparse, extract the path, strip leading/trailing slashes, split on '/' and validate there are at least two segments, then assign owner and repo_name from those segments (remove a trailing .git from repo_name if present). Also add the import "from urllib.parse import urlparse" at the top of the file.

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aed4fb2 and 5126f0a.

📒 Files selected for processing (1)

scripts/01-fetch-github.py (1 hunks)

🧰 Additional context used

🪛 GitHub Actions: pylint

scripts/01-fetch-github.py

[error] 32-32: Pylint: Line too long (104/100) (C0301). Command 'make pylint' failed with exit code 2.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: latexmk
GitHub Check: make

🔇 Additional comments (7)

scripts/01-fetch-github.py (7)

14-18: LGTM!

The class initialization is well-structured with sensible defaults and proper use of Path for filesystem operations.

20-21: LGTM!

The directory setup correctly uses parents=True to ensure intermediate directories are created.

23-42: Good error handling and URL normalization.

The method correctly handles HTTP errors and normalizes repository URLs. The CSS selectors properly target GitHub Trending entries.

68-86: Approve the overall coordination logic.

The method correctly orchestrates fetching and cloning with appropriate error handling and progress reporting. Once the URL parsing is fixed, this will be robust.

89-108: LGTM!

The argument parsing is well-structured with clear descriptions and sensible defaults.

111-115: LGTM!

The main function provides a clean entry point with proper initialization sequence.

118-119: LGTM!

The entry point correctly uses the standard Python idiom.

scripts/01-fetch-github.py

coderabbitai · 2025-10-13T23:54:39Z

scripts/01-fetch-github.py

+    def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
+        owner_dir = self.output_dir / owner
+        owner_dir.mkdir(exist_ok=True)
+        if (owner_dir / repo_name).exists():
+            print(f'Repository {owner}/{repo_name} already exists, skipping...')
+            return True
+        try:
+            result = subprocess.run(
+                ['git', 'clone', repo_url],
+                cwd=owner_dir,
+                capture_output=True,
+                text=True,
+                check=False
+            )
+            if result.returncode == 0:
+                print(f'Successfully cloned {owner}/{repo_name}')
+                return True
+            else:
+                print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
+                return False
+        except subprocess.SubprocessError as e:
+            print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
+            return False


⚠️ Potential issue | 🟠 Major

Address unresolved hardening concerns from previous review.

The clone operation still lacks several critical improvements requested in the previous review:

Line 46: owner_dir.mkdir(exist_ok=True) should use parents=True to handle nested paths.

Lines 52: Missing --depth 1 flag for shallow cloning (reduces bandwidth and time).

Line 51: Missing timeout parameter on subprocess.run (prevents hangs on unresponsive clones).

Line 64: Missing FileNotFoundError handler to gracefully handle missing git executable.

Apply this diff to address all concerns:

def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool: owner_dir = self.output_dir / owner - owner_dir.mkdir(exist_ok=True) + owner_dir.mkdir(parents=True, exist_ok=True) if (owner_dir / repo_name).exists(): print(f'Repository {owner}/{repo_name} already exists, skipping...') return True try: result = subprocess.run( - ['git', 'clone', repo_url], + ['git', 'clone', '--depth', '1', repo_url], cwd=owner_dir, capture_output=True, text=True, - check=False + check=False, + timeout=300, ) if result.returncode == 0: print(f'Successfully cloned {owner}/{repo_name}') return True else: print(f'Failed to clone {owner}/{repo_name}: {result.stderr}') return False + except FileNotFoundError: + print('git not found on PATH; cannot clone repositories.', file=sys.stderr) + return False + except subprocess.TimeoutExpired: + print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr) + return False except subprocess.SubprocessError as e: print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr) return False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:

owner_dir = self.output_dir / owner

owner_dir.mkdir(exist_ok=True)

if (owner_dir / repo_name).exists():

print(f'Repository {owner}/{repo_name} already exists, skipping...')

return True

try:

result = subprocess.run(

['git', 'clone', repo_url],

cwd=owner_dir,

capture_output=True,

text=True,

check=False

)

if result.returncode == 0:

print(f'Successfully cloned {owner}/{repo_name}')

return True

else:

print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')

return False

except subprocess.SubprocessError as e:

print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)

return False

def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:

owner_dir = self.output_dir / owner

owner_dir.mkdir(parents=True, exist_ok=True)

if (owner_dir / repo_name).exists():

print(f'Repository {owner}/{repo_name} already exists, skipping...')

return True

try:

result = subprocess.run(

['git', 'clone', '--depth', '1', repo_url],

cwd=owner_dir,

capture_output=True,

text=True,

check=False,

timeout=300,

)

if result.returncode == 0:

print(f'Successfully cloned {owner}/{repo_name}')

return True

else:

print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')

return False

except FileNotFoundError:

print('git not found on PATH; cannot clone repositories.', file=sys.stderr)

return False

except subprocess.TimeoutExpired:

print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)

return False

except subprocess.SubprocessError as e:

print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)

return False

🤖 Prompt for AI Agents

In scripts/01-fetch-github.py around lines 44 to 66, the clone_repository implementation needs hardening: create owner_dir with owner_dir.mkdir(parents=True, exist_ok=True) to handle nested paths; run git with a shallow clone flag by adding '--depth', '1' to the git clone args; pass a sensible timeout to subprocess.run (e.g., timeout=60) to avoid hanging; and add a separate except FileNotFoundError handler to log a clear error when the git executable is missing (return False), keeping the existing SubprocessError handler for other subprocess errors.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

scripts/01-fetch-github.py (1)

48-70: Critical hardening issues remain unaddressed from previous review.

The clone_repository method still lacks several critical improvements that were requested in the previous review but have not been implemented:

Line 50: owner_dir.mkdir(exist_ok=True) is missing parents=True, which will cause failures if the owner path has nested directories.
Line 56: The git clone command is missing --depth 1 for shallow cloning, wasting bandwidth and time by cloning full repository history.
Line 55: subprocess.run is missing the timeout parameter, allowing the process to hang indefinitely on unresponsive remotes.
Line 68: Only catches SubprocessError, but missing a FileNotFoundError handler to provide a clear error message when the git executable is not installed.

Apply this diff to address all unresolved hardening concerns:

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5126f0a and fc5fa0e.

📒 Files selected for processing (1)

scripts/01-fetch-github.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

scripts/01-fetch-github.py (1)

stubs/bs4/__init__.pyi (1)

BeautifulSoup (7-37)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: make
GitHub Check: latexmk

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

scripts/01-fetch-github.py (1)

48-70: Address unresolved hardening concerns from previous review.

The clone operation still lacks critical improvements flagged in previous reviews:

Line 50: owner_dir.mkdir(exist_ok=True) missing parents=True for nested paths
Line 56: Missing --depth 1 flag for shallow cloning (reduces bandwidth/time)
Line 55: Missing timeout parameter on subprocess.run (prevents hangs)
Line 68: Missing FileNotFoundError handler for missing git executable

Apply this diff to address all concerns:

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False

🧹 Nitpick comments (1)

scripts/01-fetch-github.py (1)

115-119: LGTM! Consider making URL and timeout configurable.

The main function flow is correct. Optionally, you could add CLI arguments for --url and --timeout to make these configurable instead of hardcoded on line 117.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc5fa0e and 239e8d3.

📒 Files selected for processing (1)

scripts/01-fetch-github.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

scripts/01-fetch-github.py (1)

stubs/bs4/__init__.pyi (1)

BeautifulSoup (7-37)

🪛 GitHub Actions: flake8

scripts/01-fetch-github.py

[error] 15-15: SyntaxError: non-default argument follows default argument (E999) during linting. This caused 'flake8 --max-line-length=120' to fail.

🪛 GitHub Actions: pylint

scripts/01-fetch-github.py

[error] 15-15: Pylint parsing failed: 'non-default argument follows default argument' (syntax-error) in 01-fetch-github.py:15

🪛 GitHub Actions: ruff

scripts/01-fetch-github.py

[error] 15-15: Syntax error: Parameter without a default cannot follow a parameter with a default in init signature.

🪛 GitHub Actions: ty

scripts/01-fetch-github.py

[error] 15-15: Step: uvx ty==0.0.1-alpha.8 check. Command failed with exit code 1. invalid-syntax: Parameter without a default cannot follow a parameter with a default in init(self, output_dir: str = 'target/01', trend_url, timeout).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: latexmk
GitHub Check: make

🔇 Additional comments (4)

scripts/01-fetch-github.py (4)

20-21: LGTM!

The directory setup correctly uses parents=True to handle nested paths.

23-46: LGTM!

The implementation is robust with proper error handling, multiple CSS selector fallbacks, and defensive URL normalization. The timeout and error propagation ensure the caller can handle failures appropriately.

93-112: LGTM!

The argument parser is well-structured with sensible defaults and clear help text.

122-123: LGTM!

Standard and correct __main__ guard implementation.

scripts/01-fetch-github.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

scripts/01-fetch-github.py (1)

48-70: Critical: Address unresolved hardening concerns from previous reviews.

The clone operation still lacks several critical improvements requested in previous reviews:

Line 50: owner_dir.mkdir(exist_ok=True) should use parents=True to handle nested directory paths safely.
Line 56: Missing --depth 1 flag for shallow cloning (significantly reduces bandwidth and time).
Line 55: Missing timeout parameter on subprocess.run (prevents hangs on unresponsive clones).
Line 68: Missing FileNotFoundError handler to gracefully handle missing git executable.

Apply this diff to address all concerns:

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 239e8d3 and c60fa36.

📒 Files selected for processing (1)

scripts/01-fetch-github.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

scripts/01-fetch-github.py (1)

stubs/bs4/__init__.pyi (1)

BeautifulSoup (7-37)

🪛 GitHub Actions: pylint

scripts/01-fetch-github.py

[error] 117-117: Pylint: Line too long (C0301) 106/100 in scripts/01-fetch-github.py:117. Command that failed: 'make pylint' (uv run pylint aibolit test scripts --ignore=scripts/target).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: make
GitHub Check: latexmk

🔇 Additional comments (1)

scripts/01-fetch-github.py (1)
15-18: Critical: Fix parameter ordering in __init__ signature.

Parameters with default values must come after parameters without defaults. This syntax error is blocking all CI pipelines (flake8, pylint, ruff, ty).

Apply this diff to fix the parameter ordering:
-    def __init__(self, output, url, timeout):
+    def __init__(self, url, timeout, output='target/01'):
Alternatively, if all parameters should be explicit:
-    def __init__(self, output, url, timeout):
+    def __init__(self, output: str, url: str, timeout: int):
Then update the call site at line 117 to pass all three arguments explicitly.

Likely an incorrect or invalid review comment.

scripts/01-fetch-github.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

scripts/01-fetch-github.py (1)

48-70: Critical: Implement all requested clone hardening improvements.

This method still lacks the critical improvements requested in multiple previous reviews:

Line 50: mkdir missing parents=True — will fail on nested owner paths.
Line 56: Missing --depth 1 flag — wastes bandwidth and time on full clones.
Line 55: Missing timeout parameter — process can hang indefinitely on unresponsive remotes.
Line 68: Missing FileNotFoundError handler — crashes ungracefully when git is not installed.

Apply this diff to implement all hardening improvements:

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False

🧹 Nitpick comments (1)

scripts/01-fetch-github.py (1)
15-18: Add type hints to __init__ parameters.

The parameters lack type annotations, reducing code clarity and preventing static type checkers from catching errors.

Apply this diff:
-    def __init__(self, output, url, timeout):
+    def __init__(self, output: str, url: str, timeout: int):
         self.output_dir = Path(output)
         self.trending_url = url
         self.request_timeout = timeout

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c60fa36 and e396bee.

📒 Files selected for processing (1)

scripts/01-fetch-github.py (1 hunks)

🔇 Additional comments (2)

scripts/01-fetch-github.py (2)

20-46: LGTM!

The directory setup and repository fetching logic are well implemented:

setup_directories correctly uses parents=True to handle nested paths.

fetch_trending_repositories has robust error handling and uses a fallback chain of DOM selectors.

Defensive parsing of href attributes prevents crashes on malformed HTML.

93-127: LGTM!

The CLI argument parsing, main function, and module guard are well structured:

parse_arguments properly configures defaults and help text.

main orchestrates the download workflow cleanly.

The __main__ guard correctly invokes main().

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

scripts/01-fetch-github.py (1)

48-70: Critical: Address unresolved hardening concerns from previous reviews.

The clone operation still lacks several critical improvements requested in multiple previous reviews:

Line 50: owner_dir.mkdir(exist_ok=True) should use parents=True to handle nested paths.
Line 56: Missing --depth 1 flag for shallow cloning (reduces bandwidth and time).
Line 55: Missing timeout parameter on subprocess.run (prevents hangs on unresponsive clones).
Line 68: Missing FileNotFoundError handler to gracefully handle missing git executable.
Missing TimeoutExpired handler for the timeout parameter.

Apply this diff to address all concerns:

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e396bee and 9ac9c53.

📒 Files selected for processing (1)

scripts/01-fetch-github.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

scripts/01-fetch-github.py (1)

stubs/bs4/__init__.pyi (1)

BeautifulSoup (7-37)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: make
GitHub Check: latexmk

scripts/01-fetch-github.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

scripts/01-fetch-github.py (1)

49-71: Harden git clone: shallow clone, timeout, parents=True, and handle missing git/timeouts.

Prevents hangs, reduces bandwidth, and improves error handling. This was flagged previously and remains unresolved.

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
-                print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
+                print(f'Failed to clone {owner}/{repo_name}: {result.stderr}', file=sys.stderr)
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False

🧹 Nitpick comments (4)

scripts/01-fetch-github.py (4)

24-47: Trending parsing: solid selectors and normalization. Minor hardening suggestion.

Consider adding a User-Agent to reduce chances of 403s from GitHub.

-            response = requests.get(self.trending_url, timeout=self.request_timeout)
+            response = requests.get(
+                self.trending_url,
+                timeout=self.request_timeout,
+                headers={'User-Agent': 'aibolit/1.0 (+https://github.com/cqfn/aibolit)'}
+            )

73-79: Minor: print fetch failure to stderr.

-            print('Failed to fetch repository list. Exiting.')
+            print('Failed to fetch repository list. Exiting.', file=sys.stderr)

8-8: Use built‑in generics (list[str]) to drop typing.List.

-from typing import List
+# from typing import List  # no longer needed with built-in generics

-    def fetch_trending_repositories(self) -> List[str]:
+    def fetch_trending_repositories(self) -> list[str]:
-        repositories: List[str] = []
+        repositories: list[str] = []

Also applies to: 24-25, 32-33

96-107: Validate --nrepos as positive integer.

-def parse_arguments() -> argparse.Namespace:
+def parse_arguments() -> argparse.Namespace:
+    def positive_int(value: str) -> int:
+        iv = int(value)
+        if iv < 1:
+            raise argparse.ArgumentTypeError('nrepos must be a positive integer')
+        return iv
@@
-        '--nrepos',
-        type=int,
+        '--nrepos',
+        type=positive_int,
         required=False,
         default=100,

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ef4522 and 75b1a21.

📒 Files selected for processing (1)

scripts/01-fetch-github.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

scripts/01-fetch-github.py (1)

stubs/bs4/__init__.pyi (1)

BeautifulSoup (7-37)

🪛 GitHub Actions: pylint

scripts/01-fetch-github.py

[warning] 12-12: C0411: standard import 'urllib.parse.urlparse' should be placed before third party imports 'requests', 'bs4.BeautifulSoup' (wrong-import-order)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: make
GitHub Check: latexmk

🔇 Additional comments (2)

scripts/01-fetch-github.py (2)

16-19: Constructor typing and order look good.

84-90: Good switch to urlparse; confirm Python baseline for removesuffix.

str.removesuffix needs Python 3.9+. If 3.8 must be supported, use rsplit('.git', 1)[0] instead.

coderabbitai · 2025-10-14T16:14:45Z

scripts/01-fetch-github.py

+import requests
 from bs4 import BeautifulSoup
+from urllib.parse import urlparse


⚠️ Potential issue | 🟠 Major

Fix import order to unblock pylint (standard libs before third‑party).

Move urlparse above requests/bs4 and separate groups with a blank line.

-import requests -from bs4 import BeautifulSoup -from urllib.parse import urlparse +from urllib.parse import urlparse + +import requests +from bs4 import BeautifulSoup

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import requests

from bs4 import BeautifulSoup

from urllib.parse import urlparse

from urllib.parse import urlparse

import requests

from bs4 import BeautifulSoup

🧰 Tools

🪛 GitHub Actions: pylint

[warning] 12-12: C0411: standard import 'urllib.parse.urlparse' should be placed before third party imports 'requests', 'bs4.BeautifulSoup' (wrong-import-order)

🤖 Prompt for AI Agents

In scripts/01-fetch-github.py around lines 10 to 12, the standard-library import from urllib.parse (urlparse) should be moved above third-party imports and groups separated by a blank line to satisfy pylint; reorder so standard-library imports come first (urllib.parse import), add a single blank line, then keep requests and bs4 imports below.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

scripts/01-fetch-github.py (2)

56-71: Major: Add shallow cloning, timeout, and missing error handlers.

These hardening improvements were requested in previous reviews but remain unaddressed:

Line 57: Missing --depth 1 for shallow cloning (wastes bandwidth and time downloading full history)
Line 56: Missing timeout parameter (could hang indefinitely)
Line 69: Missing FileNotFoundError handler (crashes unclearly if git not installed)

Apply this diff:

         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False

51-51: Critical: Add parents=True to prevent failures on nested paths.

This issue was flagged in previous reviews but remains unresolved. Line 51 uses mkdir(exist_ok=True) without parents=True, which will fail if the owner name requires intermediate directory creation.

Apply this diff:

-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)

🧹 Nitpick comments (1)

scripts/01-fetch-github.py (1)

118-126: LGTM! Clean entrypoint with proper initialization.

The main function correctly:

Parses CLI arguments
Converts output path to Path object
Initializes downloader with appropriate parameters
Calls methods in correct order

Optional: Consider making the trending URL and timeout configurable via CLI arguments for flexibility:

     parser.add_argument(
         '--output-dir',
         type=str,
         required=False,
         default='target/01',
         help='Output directory for downloaded repositories (default: target/01)'
     )
+    parser.add_argument(
+        '--url',
+        type=str,
+        required=False,
+        default='https://github.com/trending/java?since=daily',
+        help='Trending URL to fetch repositories from'
+    )
+    parser.add_argument(
+        '--timeout',
+        type=int,
+        required=False,
+        default=30,
+        help='HTTP request timeout in seconds (default: 30)'
+    )
     return parser.parse_args()

Then update main:

     downloader = RepositoryDownloader(
         Path(args.output_dir),
-        'https://github.com/trending/java?since=daily',
-        30
+        args.url,
+        args.timeout
     )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ef4522 and 91392da.

📒 Files selected for processing (1)

scripts/01-fetch-github.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

scripts/01-fetch-github.py (1)

stubs/bs4/__init__.pyi (1)

BeautifulSoup (7-37)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: latexmk
GitHub Check: make

🔇 Additional comments (7)

scripts/01-fetch-github.py (7)

4-12: LGTM! Imports are clean and necessary.

All imports are used and properly organized (standard library first, then third-party).

15-19: LGTM! Constructor is well-typed and clear.

Type hints are present and correct, parameter order is valid, and attribute names are descriptive.

21-22: LGTM! Directory creation is robust.

Using parents=True and exist_ok=True together ensures the directory structure is created safely without failing on existing paths.

24-47: LGTM! Robust repository fetching with good error handling.

The implementation includes:

Timeout on HTTP request

Proper exception handling with re-raise

Multiple CSS selector fallbacks for different GitHub layouts

Defensive href extraction preventing KeyError/AttributeError

URL normalization ensuring consistent format

73-93: LGTM! Clean orchestration with proper URL parsing.

The method correctly:

Handles fetch errors gracefully

Uses urlparse for robust URL parsing (addressing previous brittleness concerns)

Validates path structure before extraction

Uses removesuffix('.git') for clean repo name extraction

Respects max_repositories limit

96-115: LGTM! Standard argparse usage with clear defaults.

The CLI configuration is straightforward with sensible defaults (100 repos, 'target/01' directory) and descriptive help text.

129-130: LGTM! Standard Python entrypoint.

Correct usage of the __main__ guard calling the main function.

kodsurfer · 2025-10-14T16:30:25Z

@yegor256 @literally-bug-creator hello, please check PR

deyna256 · 2025-10-14T16:36:17Z

@kodsurfer First things first, we should cover the code with tests, and then we can patch it
@yegor256 wdyt?

kodsurfer · 2025-10-15T09:34:58Z

@literally-bug-creator @yegor256 I can add tests for new code

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

test/scripts/test_fetch_github.py (2)
178-199: Consider testing production code directly instead of duplicating logic.

The _normalize_url helper method duplicates the URL normalization logic from fetch_trending_repositories. This creates a maintenance burden: if the production logic changes, this test helper must also be updated, and the test might pass even when production behavior is incorrect.

Consider one of these approaches:

Option 1: Test the normalization through fetch_trending_repositories itself (which you're already doing in lines 35-62).

Option 2: If URL normalization is complex enough to warrant separate testing, extract it as a standalone method in the production code:

In scripts/01-fetch-github.py:
@staticmethod
def normalize_github_url(href: str) -> str:
    """Normalize a GitHub repository URL."""
    if not href:
        return href
    if href.startswith('/'):
        href = f'https://github.com{href}'
    if not href.endswith('.git'):
        href = f'{href}.git'
    return href
Then test it directly:
def test_url_parsing_and_normalization(self):
    """Test URL parsing and .git suffix normalization"""
    test_cases = [...]
    for input_href, expected_url in test_cases:
        with self.subTest(href=input_href):
            normalized = RepositoryDownloader.normalize_github_url(input_href)
            self.assertEqual(normalized, expected_url)
1-237: Consider standardizing quote style for consistency.

The linter (ruff) flags multiple instances where double quotes are used instead of single quotes. While this doesn't affect functionality, consistent quote usage improves code readability.

If your project prefers single quotes (as indicated by the ruff configuration), you can run an automated formatter to fix all instances at once:
#!/bin/bash
# Use ruff to automatically fix quote style
ruff check --select Q --fix test/scripts/test_fetch_github.py

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 91392da and 35f5b68.

📒 Files selected for processing (1)

test/scripts/test_fetch_github.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

test/scripts/test_fetch_github.py (1)

scripts/01-fetch-github.py (6)

RepositoryDownloader (15-93)

setup_directories (21-22)

fetch_trending_repositories (24-47)

clone_repository (49-71)

download_repositories (73-93)

parse_arguments (96-115)

🪛 GitHub Actions: flake8

test/scripts/test_fetch_github.py

[error] 4-4: F401 'os' imported but unused.

[error] 8-8: F401 'unittest.mock.MagicMock' imported but unused.

[error] 220-220: W291 trailing whitespace.

[error] 224-224: E302 expected 2 blank lines, found 1.

[error] 228-228: W291 trailing whitespace.

[error] 230-230: F841 local variable 'mock_main' is assigned to but never used.

🪛 GitHub Actions: make

test/scripts/test_fetch_github.py

[error] 12-12: ModuleNotFoundError: No module named 'RepositoryDownloader' during test collection. Ensure the dependency is installed or adjust import path.

🪛 GitHub Actions: ruff

test/scripts/test_fetch_github.py

[error] 4-4: F401: 'os' imported but unused

[error] 8-8: F401: 'unittest.mock.MagicMock' imported but unused

[warning] 18-18: Q000: Double quotes found but single quotes preferred

[warning] 21-21: Q000: Double quotes found but single quotes preferred

[warning] 40-40: Q001: Double quotes found but single quotes preferred

[warning] 55-55: Q000: Double quotes found but single quotes preferred

[warning] 60-60: Q000: Double quotes found but single quotes preferred

[warning] 66-66: Q001: Double quotes found but single quotes preferred

[warning] 75-75: Q000: Double quotes found but single quotes preferred

[warning] 85-85: Q000: Double quotes found but single quotes preferred

[warning] 88-88: Q000: Double quotes found but single quotes preferred

[warning] 89-89: Q000: Double quotes found but single quotes preferred

[warning] 90-90: Q000: Double quotes found but single quotes preferred

[warning] 95-95: Q000: Double quotes found but single quotes preferred

[warning] 105-105: Q000: Double quotes found but single quotes preferred

[warning] 116-116: Q000: Double quotes found but single quotes preferred

[warning] 126-126: Q000: Double quotes found but single quotes preferred

[warning] 131-131: Q000: Double quotes found but single quotes preferred

[warning] 132-132: Q000: Double quotes found but single quotes preferred

[warning] 133-133: Q000: Double quotes found but single quotes preferred

[warning] 140-140: Q000: Double quotes found but single quotes preferred

[warning] 141-141: Q000: Double quotes found but single quotes preferred

[warning] 152-152: Q000: Double quotes found but single quotes preferred

[warning] 162-162: Q000: Double quotes found but single quotes preferred

[warning] 169-169: Q000: Double quotes found but single quotes preferred

[warning] 181-181: Q000: Double quotes found but single quotes preferred

[warning] 230-230: F841: Local variable mock_main is assigned to but never used

🪛 GitHub Actions: ty

test/scripts/test_fetch_github.py

[error] 12-12: Cannot resolve imported module RepositoryDownloader.

[error] 206-206: Cannot resolve imported module your_module.

[error] 219-219: Cannot resolve imported module your_module.

[error] 231-231: Cannot resolve imported module your_module.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: latexmk

🔇 Additional comments (4)

test/scripts/test_fetch_github.py (4)
15-33: LGTM!

The test class setup, teardown, and directory creation test are well-structured. Proper use of temporary directories ensures test isolation.

124-135: LGTM!

The test correctly validates that existing repositories are skipped without requiring mocks.

236-237: LGTM!

Standard unittest main guard is correctly implemented.

35-62: Add missing @patch decorator.

The test method expects a mock_get parameter but lacks the required @patch decorator.

Apply this diff:
+@patch('scripts.01-fetch-github.requests.get')
 def test_fetch_trending_repositories_success(self, mock_get):
Note: Adjust the patch path to match your actual module structure (e.g., scripts.fetch_github.requests.get if using Python-friendly naming).

Likely an incorrect or invalid review comment.

coderabbitai · 2025-10-27T16:06:40Z

test/scripts/test_fetch_github.py

+# SPDX-FileCopyrightText: Copyright (c) 2019-2025 Aibolit
+# SPDX-License-Identifier: MIT
+
+import os


⚠️ Potential issue | 🔴 Critical

Fix critical import errors and remove unused imports.

Multiple import issues detected:

Line 4: os is imported but never used

Line 8: MagicMock is imported but never used

Line 12: The import statement is completely incorrect and will cause ModuleNotFoundError

Apply this diff to fix the imports:

-import os import tempfile import unittest from pathlib import Path -from unittest.mock import Mock, patch, MagicMock +from unittest.mock import Mock, patch import subprocess import requests -import RepositoryDownloader +from scripts['01-fetch-github'] import RepositoryDownloader

Note: The actual import path may need adjustment based on your project structure. Common alternatives:

from scripts.fetch_github import RepositoryDownloader (if the file is renamed)

Add the scripts directory to sys.path before importing

Also applies to: 8-8, 12-12

🧰 Tools

🪛 GitHub Actions: flake8

[error] 4-4: F401 'os' imported but unused.

🪛 GitHub Actions: ruff

[error] 4-4: F401: 'os' imported but unused

🤖 Prompt for AI Agents

In test/scripts/test_fetch_github.py around lines 4, 8 and 12 remove unused imports and fix the broken import: delete the unused "import os" (line 4) and remove the unused "MagicMock" import (line 8), then replace the incorrect module import on line 12 with a correct path to the tested class (for example use "from scripts.fetch_github import RepositoryDownloader" or the proper relative import for your package); if the test runner cannot find the scripts package, add a short sys.path append before importing to include the project root (e.g. insert sys.path manipulation above the import) so the import resolves.

coderabbitai · 2025-10-27T16:06:40Z

test/scripts/test_fetch_github.py

+    def test_fetch_trending_repositories_request_exception(self, mock_get):
+        """Test handling of request exceptions"""
+        mock_get.side_effect = requests.RequestException("Network error")
+
+        with self.assertRaises(requests.RequestException):
+            self.downloader.fetch_trending_repositories()
+
+    def test_fetch_trending_repositories_empty_anchors(self, mock_get):
+        """Test handling of empty repository list"""
+        mock_response = Mock()
+        mock_response.raise_for_status.return_value = None
+        mock_response.text = "<html><body></body></html>"
+        mock_get.return_value = mock_response
+
+        repositories = self.downloader.fetch_trending_repositories()
+        self.assertEqual(repositories, [])


⚠️ Potential issue | 🔴 Critical

Add missing @patch decorators.

Both test methods expect a mock_get parameter but lack the required @patch decorator.

Apply this diff:

+@patch('scripts.01-fetch-github.requests.get') def test_fetch_trending_repositories_request_exception(self, mock_get): """Test handling of request exceptions""" mock_get.side_effect = requests.RequestException("Network error") with self.assertRaises(requests.RequestException): self.downloader.fetch_trending_repositories() +@patch('scripts.01-fetch-github.requests.get') def test_fetch_trending_repositories_empty_anchors(self, mock_get):

🧰 Tools

🪛 GitHub Actions: ruff

[warning] 66-66: Q001: Double quotes found but single quotes preferred

[warning] 75-75: Q000: Double quotes found but single quotes preferred

🤖 Prompt for AI Agents

In test/scripts/test_fetch_github.py around lines 64 to 79, the two test methods declare a mock_get parameter but are missing the @patch decorator that supplies it; add @patch('requests.get') above both test_fetch_trending_repositories_request_exception and test_fetch_trending_repositories_empty_anchors so the mock_get argument is injected, and import patch from unittest.mock if not already imported.

coderabbitai · 2025-10-27T16:06:40Z

test/scripts/test_fetch_github.py

+    def test_clone_repository_success(self, mock_subprocess):
+        """Test successful repository cloning"""
+        mock_result = Mock()
+        mock_result.returncode = 0
+        mock_result.stderr = ""
+        mock_subprocess.return_value = mock_result
+        result = self.downloader.clone_repository(
+            "https://github.com/owner/repo.git",
+            "owner",
+            "repo"
+        )
+        self.assertTrue(result)
+        mock_subprocess.assert_called_once_with(
+            ['git', 'clone', 'https://github.com/owner/repo.git'],
+            cwd=self.test_output_dir / "owner",
+            capture_output=True,
+            text=True,
+            check=False
+        )
+
+    def test_clone_repository_failure(self, mock_subprocess):
+        """Test repository cloning failure"""
+        mock_result = Mock()
+        mock_result.returncode = 1
+        mock_result.stderr = "Authentication failed"
+        mock_subprocess.return_value = mock_result
+        result = self.downloader.clone_repository(
+            "https://github.com/owner/repo.git",
+            "owner",
+            "repo"
+        )
+        self.assertFalse(result)
+
+    def test_clone_repository_subprocess_error(self, mock_subprocess):
+        """Test handling of subprocess errors"""
+        mock_subprocess.side_effect = subprocess.SubprocessError("Git not found")
+        result = self.downloader.clone_repository(
+            "https://github.com/owner/repo.git",
+            "owner",
+            "repo"
+        )
+        self.assertFalse(result)


⚠️ Potential issue | 🔴 Critical

Add missing @patch decorators.

Three test methods expect a mock_subprocess parameter but lack the required @patch decorator for subprocess.run.

Apply this diff:

+@patch('subprocess.run') def test_clone_repository_success(self, mock_subprocess): """Test successful repository cloning""" mock_result = Mock() mock_result.returncode = 0 mock_result.stderr = "" mock_subprocess.return_value = mock_result ... +@patch('subprocess.run') def test_clone_repository_failure(self, mock_subprocess): """Test repository cloning failure""" ... +@patch('subprocess.run') def test_clone_repository_subprocess_error(self, mock_subprocess):

🧰 Tools

🪛 GitHub Actions: ruff

[warning] 85-85: Q000: Double quotes found but single quotes preferred

[warning] 88-88: Q000: Double quotes found but single quotes preferred

[warning] 89-89: Q000: Double quotes found but single quotes preferred

[warning] 90-90: Q000: Double quotes found but single quotes preferred

[warning] 95-95: Q000: Double quotes found but single quotes preferred

[warning] 105-105: Q000: Double quotes found but single quotes preferred

[warning] 116-116: Q000: Double quotes found but single quotes preferred

🤖 Prompt for AI Agents

In test/scripts/test_fetch_github.py around lines 81 to 122, the three tests take a mock_subprocess argument but do not patch subprocess.run; add @patch('subprocess.run') decorators above test_clone_repository_success, test_clone_repository_failure, and test_clone_repository_subprocess_error so the mock_subprocess parameter is provided by the test framework; ensure the patch import (from unittest.mock import patch) is present at the top of the file if missing.

coderabbitai · 2025-10-27T16:06:40Z

test/scripts/test_fetch_github.py

+    def test_download_repositories_success(self, mock_clone, mock_fetch):
+        """Test successful download of multiple repositories"""
+        mock_fetch.return_value = [
+            "https://github.com/owner1/repo1.git",
+            "https://github.com/owner2/repo2.git",
+            "https://github.com/owner3/repo3.git"
+        ]
+        mock_clone.return_value = True
+        self.downloader.download_repositories(max_repositories=2)
+        self.assertEqual(mock_clone.call_count, 2)
+        mock_fetch.assert_called_once()
+
+    def test_download_repositories_partial_success(self, mock_clone, mock_fetch):
+        """Test download with some repositories failing to clone"""
+        mock_fetch.return_value = [
+            "https://github.com/owner1/repo1.git",
+            "https://github.com/owner2/repo2.git",
+            "https://github.com/owner3/repo3.git"
+        ]
+        mock_clone.side_effect = [True, False, True]
+        self.downloader.download_repositories(max_repositories=3)
+        self.assertEqual(mock_clone.call_count, 3)
+
+    def test_download_repositories_fetch_fails(self, mock_fetch):
+        """Test download when fetching repository list fails"""
+        mock_fetch.side_effect = requests.RequestException("Fetch failed")
+        self.downloader.download_repositories(max_repositories=10)
+        mock_fetch.assert_called_once()
+
+    def test_download_repositories_invalid_urls(self, mock_clone, mock_fetch):
+        """Test handling of invalid repository URLs"""
+        mock_fetch.return_value = [
+            "https://github.com/owner1/repo1.git",
+            "invalid_url",
+            "https://github.com/singlepart",
+            "https://github.com/owner2/repo2.git"
+        ]
+        mock_clone.return_value = True
+        self.downloader.download_repositories(max_repositories=10)
+        self.assertEqual(mock_clone.call_count, 2)


⚠️ Potential issue | 🔴 Critical

Add missing @patch decorators.

All four test methods expect mock parameters but lack the required @patch decorators.

Apply this diff:

+@patch.object(RepositoryDownloader, 'fetch_trending_repositories') +@patch.object(RepositoryDownloader, 'clone_repository') def test_download_repositories_success(self, mock_clone, mock_fetch): """Test successful download of multiple repositories""" ... +@patch.object(RepositoryDownloader, 'fetch_trending_repositories') +@patch.object(RepositoryDownloader, 'clone_repository') def test_download_repositories_partial_success(self, mock_clone, mock_fetch): """Test download with some repositories failing to clone""" ... +@patch.object(RepositoryDownloader, 'fetch_trending_repositories') def test_download_repositories_fetch_fails(self, mock_fetch): """Test download when fetching repository list fails""" ... +@patch.object(RepositoryDownloader, 'fetch_trending_repositories') +@patch.object(RepositoryDownloader, 'clone_repository') def test_download_repositories_invalid_urls(self, mock_clone, mock_fetch):

Note: When using multiple decorators, they are applied bottom-up, so the parameter order matches the decorator order from bottom to top.

🧰 Tools

🪛 GitHub Actions: ruff

[warning] 140-140: Q000: Double quotes found but single quotes preferred

[warning] 141-141: Q000: Double quotes found but single quotes preferred

[warning] 152-152: Q000: Double quotes found but single quotes preferred

[warning] 162-162: Q000: Double quotes found but single quotes preferred

[warning] 169-169: Q000: Double quotes found but single quotes preferred

🤖 Prompt for AI Agents

In test/scripts/test_fetch_github.py around lines 137 to 176, the four test methods accept mock parameters but are missing the @patch decorators; add two decorators to each test so the bottom decorator patches the clone function and the top decorator patches the fetch function (e.g. top: @patch('scripts.fetch_github.fetch_repository_list'), bottom: @patch('scripts.fetch_github.clone_repository')) so the parameter order (mock_clone, mock_fetch) matches decorator order bottom-to-top; apply these two decorators to each of the four test methods.

coderabbitai · 2025-10-27T16:06:40Z

test/scripts/test_fetch_github.py

+class ArgumentParserTestCase(unittest.TestCase):
+    def test_parse_arguments_default(self):
+        """Test argument parser with default values"""
+        with patch('sys.argv', ['script_name']):
+            from your_module import parse_arguments
+            args = parse_arguments()
+            self.assertEqual(args.nrepos, 100)
+            self.assertEqual(args.output_dir, 'target/01')
+
+    def test_parse_arguments_custom(self):
+        """Test argument parser with custom values"""
+        test_args = [
+            'script_name',
+            '--nrepos', '50',
+            '--output-dir', 'custom/directory'
+        ]
+        with patch('sys.argv', test_args):
+            from your_module import parse_arguments
+            args = parse_arguments()            
+            self.assertEqual(args.nrepos, 50)
+            self.assertEqual(args.output_dir, 'custom/directory')


⚠️ Potential issue | 🔴 Critical

Replace placeholder module name and fix formatting.

The tests import from 'your_module' which is a placeholder and will cause import errors. Additionally, Line 220 has trailing whitespace.

Apply this diff:

def test_parse_arguments_default(self): """Test argument parser with default values""" with patch('sys.argv', ['script_name']): - from your_module import parse_arguments + from scripts['01-fetch-github'] import parse_arguments args = parse_arguments() self.assertEqual(args.nrepos, 100) self.assertEqual(args.output_dir, 'target/01') def test_parse_arguments_custom(self): """Test argument parser with custom values""" test_args = [ 'script_name', '--nrepos', '50', '--output-dir', 'custom/directory' ] with patch('sys.argv', test_args): - from your_module import parse_arguments - args = parse_arguments() + from scripts['01-fetch-github'] import parse_arguments + args = parse_arguments() self.assertEqual(args.nrepos, 50) self.assertEqual(args.output_dir, 'custom/directory')

Note: Adjust the import path based on your project structure (e.g., scripts.fetch_github if using Python-friendly naming).

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 GitHub Actions: flake8

[error] 220-220: W291 trailing whitespace.

🪛 GitHub Actions: ty

[error] 206-206: Cannot resolve imported module your_module.

[error] 219-219: Cannot resolve imported module your_module.

🤖 Prompt for AI Agents

In test/scripts/test_fetch_github.py around lines 202 to 222, the tests import from the placeholder 'your_module' which causes import errors and line 220 contains trailing whitespace; update the import to the real module path used by the project (for example replace "from your_module import parse_arguments" with "from scripts.fetch_github import parse_arguments" or the correct package/module path), ensure both test cases use that real import, and remove the trailing whitespace at line 220 so the file has no extraneous spaces.

coderabbitai · 2025-10-27T16:06:40Z

test/scripts/test_fetch_github.py

+def test_main_execution(mock_path, mock_downloader):
+    """Test main function execution"""
+    mock_args = Mock()
+    mock_args.nrepos = 50
+    mock_args.output_dir = 'test/output'    
+    with patch('your_module.parse_arguments', return_value=mock_args):
+        with patch('your_module.main') as mock_main:
+            from your_module import main
+            main()
+            mock_downloader.assert_called_once()


⚠️ Potential issue | 🔴 Critical

Fix multiple critical issues with test structure and logic.

This test has several problems that prevent it from working:

Not a proper test method: It's a standalone function, not a method of a test class, so unittest won't discover or run it

Placeholder module names: References to 'your_module' will cause import errors

Confused mocking logic: Patches main, imports main, calls main, but never verifies the mock

Undefined mock: Asserts mock_downloader but it's neither defined nor passed as a parameter

Missing decorators: mock_path and mock_downloader parameters have no corresponding @patch decorators

Formatting: Missing blank line before function definition and trailing whitespace on Line 228

Recommended approach - create a proper test method in the ArgumentParserTestCase class:

class ArgumentParserTestCase(unittest.TestCase): # ... existing tests ... @patch('scripts.01-fetch-github.RepositoryDownloader') @patch('scripts.01-fetch-github.Path') def test_main_execution(self, mock_path, mock_downloader_class): """Test main function execution""" mock_args = Mock() mock_args.nrepos = 50 mock_args.output_dir = 'test/output' mock_downloader_instance = Mock() mock_downloader_class.return_value = mock_downloader_instance with patch('scripts.01-fetch-github.parse_arguments', return_value=mock_args): from scripts['01-fetch-github'] import main main() # Verify RepositoryDownloader was instantiated correctly mock_downloader_class.assert_called_once() # Verify download_repositories was called mock_downloader_instance.setup_directories.assert_called_once() mock_downloader_instance.download_repositories.assert_called_once_with(50)

🧰 Tools

🪛 GitHub Actions: flake8

[error] 224-224: E302 expected 2 blank lines, found 1.

[error] 228-228: W291 trailing whitespace.

[error] 230-230: F841 local variable 'mock_main' is assigned to but never used.

🪛 GitHub Actions: ruff

[warning] 230-230: F841: Local variable mock_main is assigned to but never used

🪛 GitHub Actions: ty

[error] 231-231: Cannot resolve imported module your_module.

🤖 Prompt for AI Agents

In test/scripts/test_fetch_github.py around lines 224-233, the standalone test function should be converted into a proper unittest method inside the existing ArgumentParserTestCase class, remove trailing whitespace and add a blank line before the method; add @patch decorators to mock scripts.01-fetch-github.Path and scripts.01-fetch-github.RepositoryDownloader (order them so the mock args match the signature), create a mock_args object and make parse_arguments return it via patch('scripts.01-fetch-github.parse_arguments', return_value=mock_args), have the RepositoryDownloader mock return a mock instance and call main() imported from scripts.01-fetch-github, then assert the downloader class was instantiated once and that the instance methods (setup_directories and download_repositories) were called with the expected arguments (download_repositories called with mock_args.nrepos).

cqfn#683 Update 01-fetch-github.py

cb7a4a3

cqfn#683 Update 01-fetch-github.py

aed4fb2

coderabbitai bot reviewed Oct 13, 2025

View reviewed changes

kodsurfer added 4 commits October 14, 2025 02:37

cqfn#683 Update 01-fetch-github.py

8c3afc6

cqfn#683 Update 01-fetch-github.py

aa9b846

cqfn#683 Update 01-fetch-github.py

a969fb2

cqfn#683 Update 01-fetch-github.py

5126f0a

coderabbitai bot reviewed Oct 13, 2025

View reviewed changes

kodsurfer added 2 commits October 14, 2025 14:07

cqfn#683 Update 01-fetch-github.py

131672b

cqfn#683 Update 01-fetch-github.py

fc5fa0e

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

cqfn#683 Update 01-fetch-github.py

239e8d3

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

scripts/01-fetch-github.py Outdated Show resolved Hide resolved

cqfn#683 Update 01-fetch-github.py

c60fa36

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

scripts/01-fetch-github.py Outdated Show resolved Hide resolved

cqfn#683 Update 01-fetch-github.py

e396bee

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

cqfn#683 Update 01-fetch-github.py

9ac9c53

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

scripts/01-fetch-github.py Outdated Show resolved Hide resolved

scripts/01-fetch-github.py Outdated Show resolved Hide resolved

kodsurfer added 4 commits October 14, 2025 18:52

cqfn#683 Update 01-fetch-github.py

4e00864

cqfn#683 Update 01-fetch-github.py

8ef4522

cqfn#683 Update 01-fetch-github.py

75b1a21

cqfn#683 Update 01-fetch-github.py

91392da

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

kodsurfer added 3 commits October 27, 2025 18:57

Merge branch 'cqfn:master' into 683

82f6076

cqfn#683 add test_fetch_github.py

35f5b68

cqfn#683 Update test_fetch_github.py

5b14be9

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

Merge branch 'cqfn:master' into 683

759c426

Conversation

kodsurfer commented Oct 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

kodsurfer commented Oct 14, 2025

Uh oh!

deyna256 commented Oct 14, 2025

Uh oh!

kodsurfer commented Oct 15, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

kodsurfer commented Oct 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 13, 2025 •

edited

Loading