Skip to content

Improve scripts code#954

Open
kodsurfer wants to merge 20 commits intocqfn:masterfrom
kodsurfer:683
Open

Improve scripts code#954
kodsurfer wants to merge 20 commits intocqfn:masterfrom
kodsurfer:683

Conversation

@kodsurfer
Copy link
Contributor

@kodsurfer kodsurfer commented Oct 13, 2025

Summary by CodeRabbit

  • New Features

    • CLI tool to download trending GitHub repositories with options for number, output directory, and timeout; updated program entrypoint.
  • Improvements

    • Creates owner/repo folders, skips already-downloaded repos, reports progress/status, robustly handles network and subprocess errors, and normalizes URLs for reliable cloning.
  • Refactor

    • Downloader reorganized into a structured, reusable class-based flow.
  • Tests

    • Comprehensive unit tests for fetching, URL normalization, cloning behavior, CLI parsing, and end-to-end flow.

@coderabbitai
Copy link

coderabbitai bot commented Oct 13, 2025

Walkthrough

Refactors scripts/01-fetch-github.py into a class-based RepositoryDownloader with setup, fetching, cloning, and orchestration methods; adds CLI parsing (--nrepos, --output-dir) and main() entrypoint. Adds comprehensive unit tests for the downloader in test/scripts/test_fetch_github.py.

Changes

Cohort / File(s) Summary of changes
Downloader refactor and CLI
scripts/01-fetch-github.py
Replaced procedural script with RepositoryDownloader class (__init__(output, url, timeout)), added setup_directories(), fetch_trending_repositories() -> List[str], clone_repository(repo_url, owner, repo_name) -> bool, download_repositories(max_repositories), parse_arguments() and main(); updated __main__ to call main().
Unit tests for downloader
test/scripts/test_fetch_github.py
Added tests covering directory setup, HTTP fetch and normalization of trending repo URLs, cloning success/failure and subprocess behavior, download workflow with limits and partial failures, argument parsing, and main invocation; mocks network, subprocess, and filesystem side-effects.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as argparse CLI
  participant RD as RepositoryDownloader
  participant GH as GitHub (HTTP)
  participant GIT as git subprocess
  rect rgb(240,248,255)
    User->>CLI: run script (--nrepos, --output-dir)
    CLI->>RD: instantiate(output, url, timeout)
    RD->>RD: setup_directories()
  end
  rect rgb(245,255,240)
    RD->>GH: GET trending page
    GH-->>RD: HTML response
    RD->>RD: parse & normalize repo URLs (ensure full GitHub .git)
  end
  loop each repo (up to max_repositories)
    RD->>RD: derive owner and repo_name
    alt repo not present locally
      RD->>GIT: git clone <repo_url> into output/owner/repo
      GIT-->>RD: success / error
      RD->>RD: record result
    else repo exists
      RD->>RD: skip clone (log)
    end
  end
  RD-->>CLI: finish (summary)
  CLI-->>User: exit
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

  • Areas to review closely:
    • URL parsing/normalization and HTML selectors in fetch_trending_repositories().
    • Filesystem path handling and directory creation in setup_directories() and clone target logic.
    • Subprocess invocation, error handling, and idempotency in clone_repository().
    • Accuracy and isolation of mocks/assertions in test/scripts/test_fetch_github.py.

Poem

A twitch of whiskers, code in paw,
I hop through links with careful awe.
I fetch and clone, then snuggle tight,
Repos tucked in burrows for the night.
Bravo, friend — we did it right! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title Check ❓ Inconclusive The title "Improve scripts code" is too vague and generic to meaningfully convey the nature of the changes. While the changeset does contain improvements (refactoring to a class-based RepositoryDownloader architecture and adding comprehensive unit tests), the title uses non-descriptive language that could apply to nearly any enhancement. A teammate reviewing git history would not understand whether the PR involved refactoring, bug fixes, feature additions, or performance improvements, which is critical information for understanding the codebase evolution. Consider revising the title to be more specific and descriptive of the actual changes. For example, "Refactor repository downloader to class-based architecture" or "Restructure fetch-github script with RepositoryDownloader class and unit tests" would clearly communicate the primary changes to someone reading the history without requiring them to inspect the diff.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35f5b68 and 5b14be9.

📒 Files selected for processing (1)
  • test/scripts/test_fetch_github.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/scripts/test_fetch_github.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: latexmk

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
scripts/01-fetch-github.py (1)

74-77: URL parsing logic is fragile.

The string manipulation approach to extract owner and repo name is brittle. It assumes a specific URL format and doesn't handle edge cases like trailing slashes, different URL schemes, or malformed URLs.

Consider using urllib.parse or a regex pattern for more robust URL parsing:

+            from urllib.parse import urlparse
+            
-            path_parts = repo_url.replace('https://github.com/', '').replace('.git', '').split('/')
-            if len(path_parts) < 2:
-                continue
-            owner, repo_name = path_parts[0], path_parts[1]
+            parsed = urlparse(repo_url)
+            path_parts = parsed.path.strip('/').replace('.git', '').split('/')
+            if len(path_parts) != 2:
+                print(f"Skipping malformed URL: {repo_url}")
+                continue
+            owner, repo_name = path_parts[0], path_parts[1]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 540a5cc and cb7a4a3.

📒 Files selected for processing (1)
  • scripts/01-fetch-github.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/01-fetch-github.py (1)
stubs/bs4/__init__.pyi (1)
  • BeautifulSoup (7-37)
🪛 GitHub Actions: flake8
scripts/01-fetch-github.py

[error] 5-5: flake8: F401 'os' imported but unused


[error] 15-15: flake8: W291 trailing whitespace


[error] 20-20: flake8: W293 blank line contains whitespace


[error] 24-24: flake8: W293 blank line contains whitespace


[error] 33-33: flake8: W291 trailing whitespace


[error] 36-36: flake8: W291 trailing whitespace


[error] 38-38: flake8: W293 blank line contains whitespace


[error] 41-41: flake8: W291 trailing whitespace


[error] 44-44: flake8: W291 trailing whitespace


[error] 58-58: flake8: W291 trailing whitespace


[error] 62-62: flake8: W293 blank line contains whitespace


[error] 64-64: flake8: W291 trailing whitespace


[error] 69-69: flake8: W291 trailing whitespace


[error] 70-70: flake8: W291 trailing whitespace


[error] 73-73: flake8: W291 trailing whitespace


[error] 76-76: flake8: W291 trailing whitespace


[error] 77-77: flake8: W291 trailing whitespace


[error] 78-78: flake8: W291 trailing whitespace


[error] 80-80: flake8: W291 trailing whitespace


[error] 107-107: flake8: W291 trailing whitespace


[error] 1-1: flake8: General lint failure causing pipeline exit

🪛 GitHub Actions: pylint
scripts/01-fetch-github.py

[error] 15-15: Trailing whitespace (trailing-whitespace) in line 15


[error] 20-20: Trailing whitespace (trailing-whitespace) in line 20


[error] 24-24: Trailing whitespace (trailing-whitespace) in line 24


[error] 33-33: Trailing whitespace (trailing-whitespace) in line 33


[error] 36-36: Trailing whitespace (trailing-whitespace) in line 36


[error] 38-38: Trailing whitespace (trailing-whitespace) in line 38


[error] 41-41: Trailing whitespace (trailing-whitespace) in line 41


[error] 44-44: Trailing whitespace (trailing-whitespace) in line 44


[error] 58-58: Trailing whitespace (trailing-whitespace) in line 58


[error] 62-62: Trailing whitespace (trailing-whitespace) in line 62


[error] 64-64: Trailing whitespace (trailing-whitespace) in line 64


[error] 69-69: Trailing whitespace (trailing-whitespace) in line 69


[error] 70-70: Trailing whitespace (trailing-whitespace) in line 70


[error] 73-73: Trailing whitespace (trailing-whitespace) in line 73


[error] 76-76: Trailing whitespace (trailing-whitespace) in line 76


[error] 77-77: Trailing whitespace (trailing-whitespace) in line 77


[error] 78-78: Trailing whitespace (trailing-whitespace) in line 78


[error] 80-80: Trailing whitespace (trailing-whitespace) in line 80


[error] 107-107: Trailing whitespace (trailing-whitespace) in line 107


[error] 5-5: Unused import 'os' (unused-import) detected by pylint

🪛 GitHub Actions: ruff
scripts/01-fetch-github.py

[error] 5-5: F401: 'os' imported but unused


[error] 16-16: Q000: Double quotes found but single quotes preferred


[error] 18-18: Q000: Double quotes found but single quotes preferred


[error] 30-30: Q000: Double quotes found but single quotes preferred


[error] 36-36: Q000: Double quotes found but single quotes preferred


[error] 43-43: Q000: Double quotes found but single quotes preferred


[error] 54-54: Q000: Double quotes found but single quotes preferred


[error] 57-57: Q000: Double quotes found but single quotes preferred


[error] 60-60: Q000: Double quotes found but single quotes preferred


[error] 64-64: Q000: Double quotes found but single quotes preferred


[error] 68-68: Q000: Double quotes found but single quotes preferred


[error] 78-78: Q000: Double quotes found but single quotes preferred


[error] 81-81: Q000: Double quotes found but single quotes preferred

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: make
  • GitHub Check: latexmk
🔇 Additional comments (5)
scripts/01-fetch-github.py (5)

16-23: Good use of Path and sensible defaults.

The initialization and setup logic is clean. Using pathlib.Path for directory operations and providing sensible defaults for output directory and timeout is a solid choice.


39-61: Solid clone implementation with good error handling.

The repository cloning logic is well-structured:

  • Creates owner directories as needed
  • Checks for existing repositories to avoid redundant clones
  • Captures and reports both subprocess failures and exceptions
  • Returns boolean status for caller to track success

84-103: Clean argument parsing with sensible defaults.

The parse_arguments function is well-structured with clear help text and appropriate defaults for both the number of repositories and output directory.


106-110: Clean main function orchestration.

The main function effectively wires together argument parsing, downloader instantiation, and execution flow in a clear, linear manner.


113-114: Proper main entry point.

The if __name__ == '__main__' block correctly calls the main() function, following Python best practices.

print(f"Error fetching trending repositories: {e}", file=sys.stderr)
raise
soup = BeautifulSoup(response.text, 'html.parser')
repositories = []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove trailing whitespace.

Lines 33, 36, and 38 have trailing whitespace causing linter failures.

Also applies to: 36-36, 38-38

🧰 Tools
🪛 GitHub Actions: flake8

[error] 33-33: flake8: W291 trailing whitespace

🪛 GitHub Actions: pylint

[error] 33-33: Trailing whitespace (trailing-whitespace) in line 33

🤖 Prompt for AI Agents
In scripts/01-fetch-github.py around lines 33, 36 and 38, there are trailing
spaces causing linter failures; remove the trailing whitespace at the end of
those lines (and any other accidental trailing spaces in the file), save the
file, and re-run the linter to confirm the warnings are resolved.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
scripts/01-fetch-github.py (2)

15-19: Use a Requests Session and set a User-Agent.

Improves connection reuse and avoids UA-related blocks. Add Session in init and use it.

     def __init__(self, output_dir: str = "target/01"):
         self.output_dir = Path(output_dir)
-        self.trending_url = "https://github.com/trending/java?since=daily"
+        self.trending_url = 'https://github.com/trending/java?since=daily'
         self.request_timeout = 30
+        self.session = requests.Session()
+        self.session.headers.update(
+            {'User-Agent': 'aibolit/01-fetch-github (+https://github.com/cqfn/aibolit)'}
+        )
@@
-            response = requests.get(self.trending_url, timeout=self.request_timeout)
+            response = self.session.get(self.trending_url, timeout=self.request_timeout)

Based on learnings

Also applies to: 26-26


83-102: Validate CLI: ensure --nrepos is non-negative.

Prevents surprising early exit with negative values.

-    parser.add_argument(
+    parser.add_argument(
         '--nrepos',
-        type=int,
+        type=positive_int,
         required=False,
         default=100,
         help='Number of repositories to download (default: 100)'
     )

Add helper (outside this range):

def positive_int(value: str) -> int:
    ivalue = int(value)
    if ivalue < 0:
        raise argparse.ArgumentTypeError('nrepos must be >= 0')
    return ivalue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 540a5cc and aed4fb2.

📒 Files selected for processing (1)
  • scripts/01-fetch-github.py (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: flake8
scripts/01-fetch-github.py

[error] 14-14: flake8: W291 trailing whitespace.


[error] 19-19: flake8: W293 blank line contains whitespace.


[error] 23-23: flake8: W293 blank line contains whitespace.


[error] 32-32: flake8: W291 trailing whitespace.


[error] 35-35: flake8: W291 trailing whitespace.


[error] 37-37: flake8: W293 blank line contains whitespace.


[error] 40-40: flake8: W291 trailing whitespace.


[error] 43-43: flake8: W291 trailing whitespace.


[error] 57-57: flake8: W291 trailing whitespace.


[error] 61-61: flake8: W293 blank line contains whitespace.


[error] 63-63: flake8: W291 trailing whitespace.


[error] 68-68: flake8: W291 trailing whitespace.


[error] 69-69: flake8: W291 trailing whitespace.


[error] 72-72: flake8: W291 trailing whitespace.


[error] 75-75: flake8: W291 trailing whitespace.


[error] 76-76: flake8: W291 trailing whitespace.


[error] 79-79: flake8: W291 trailing whitespace.


[error] 106-106: flake8: W291 trailing whitespace.

🪛 GitHub Actions: pylint
scripts/01-fetch-github.py

[error] 14-14: Trailing whitespace (trailing-whitespace)


[error] 19-19: Trailing whitespace (trailing-whitespace)


[error] 23-23: Trailing whitespace (trailing-whitespace)


[error] 32-32: Trailing whitespace (trailing-whitespace)


[error] 35-35: Trailing whitespace (trailing-whitespace)


[error] 37-37: Trailing whitespace (trailing-whitespace)


[error] 40-40: Trailing whitespace (trailing-whitespace)


[error] 43-43: Trailing whitespace (trailing-whitespace)


[error] 57-57: Trailing whitespace (trailing-whitespace)


[error] 61-61: Trailing whitespace (trailing-whitespace)


[error] 63-63: Trailing whitespace (trailing-whitespace)


[error] 68-68: Trailing whitespace (trailing-whitespace)


[error] 69-69: Trailing whitespace (trailing-whitespace)


[error] 72-72: Trailing whitespace (trailing-whitespace)


[error] 75-75: Trailing whitespace (trailing-whitespace)


[error] 76-76: Trailing whitespace (trailing-whitespace)


[error] 77-77: Trailing whitespace (trailing-whitespace)


[error] 79-79: Trailing whitespace (trailing-whitespace)


[error] 106-106: Trailing whitespace (trailing-whitespace)

🪛 GitHub Actions: ruff
scripts/01-fetch-github.py

[error] 15-15: Q000: Double quotes found but single quotes preferred


[error] 17-17: Q000: Double quotes found but single quotes preferred


[error] 29-29: Q000: Double quotes found but single quotes preferred


[error] 35-35: Q000: Double quotes found but single quotes preferred


[error] 42-42: Q000: Double quotes found but single quotes preferred


[error] 53-53: Q000: Double quotes found but single quotes preferred


[error] 56-56: Q000: Double quotes found but single quotes preferred


[error] 59-59: Q000: Double quotes found but single quotes preferred


[error] 63-63: Q000: Double quotes found but single quotes preferred


[error] 67-67: Q000: Double quotes found but single quotes preferred


[error] 77-77: Q000: Double quotes found but single quotes preferred


[error] 80-80: Q000: Double quotes found but single quotes preferred

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: latexmk
  • GitHub Check: make
🔇 Additional comments (2)
scripts/01-fetch-github.py (2)

14-23: Nice refactor to a class with Path-based dirs.

Cleaner structure and directory setup look good.


10-11: Dependencies are declared and pinned
beautifulsoup4==4.14.2 and requests==2.32.5 are pinned in pyproject.toml.

Comment on lines 38 to 61
def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
owner_dir = self.output_dir / owner
owner_dir.mkdir(exist_ok=True)
if (owner_dir / repo_name).exists():
print(f"Repository {owner}/{repo_name} already exists, skipping...")
return True
try:
result = subprocess.run(
['git', 'clone', repo_url],
cwd=owner_dir,
capture_output=True,
text=True,
check=False
)
if result.returncode == 0:
print(f"Successfully cloned {owner}/{repo_name}")
return True
else:
print(f"Failed to clone {owner}/{repo_name}: {result.stderr}")
return False
except subprocess.SubprocessError as e:
print(f"Error cloning {owner}/{repo_name}: {e}", file=sys.stderr)
return False

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden cloning: shallow clone, timeout, parents=True, and handle missing git.

Prevents hangs, reduces bandwidth, and avoids crashes when git is absent.

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)        
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
-            print(f"Repository {owner}/{repo_name} already exists, skipping...")
+            print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True        
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
-                print(f"Successfully cloned {owner}/{repo_name}")
+                print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
-                print(f"Failed to clone {owner}/{repo_name}: {result.stderr}")
+                print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False                
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
-            print(f"Error cloning {owner}/{repo_name}: {e}", file=sys.stderr)
+            print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
owner_dir = self.output_dir / owner
owner_dir.mkdir(exist_ok=True)
if (owner_dir / repo_name).exists():
print(f"Repository {owner}/{repo_name} already exists, skipping...")
return True
try:
result = subprocess.run(
['git', 'clone', repo_url],
cwd=owner_dir,
capture_output=True,
text=True,
check=False
)
if result.returncode == 0:
print(f"Successfully cloned {owner}/{repo_name}")
return True
else:
print(f"Failed to clone {owner}/{repo_name}: {result.stderr}")
return False
except subprocess.SubprocessError as e:
print(f"Error cloning {owner}/{repo_name}: {e}", file=sys.stderr)
return False
def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
owner_dir = self.output_dir / owner
owner_dir.mkdir(parents=True, exist_ok=True)
if (owner_dir / repo_name).exists():
print(f'Repository {owner}/{repo_name} already exists, skipping...')
return True
try:
result = subprocess.run(
['git', 'clone', '--depth', '1', repo_url],
cwd=owner_dir,
capture_output=True,
text=True,
check=False,
timeout=300,
)
if result.returncode == 0:
print(f'Successfully cloned {owner}/{repo_name}')
return True
else:
print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
return False
except FileNotFoundError:
print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
return False
except subprocess.SubprocessError as e:
print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
return False
🧰 Tools
🪛 GitHub Actions: flake8

[error] 40-40: flake8: W291 trailing whitespace.


[error] 43-43: flake8: W291 trailing whitespace.


[error] 57-57: flake8: W291 trailing whitespace.


[error] 61-61: flake8: W293 blank line contains whitespace.

🪛 GitHub Actions: pylint

[error] 40-40: Trailing whitespace (trailing-whitespace)


[error] 43-43: Trailing whitespace (trailing-whitespace)


[error] 57-57: Trailing whitespace (trailing-whitespace)


[error] 61-61: Trailing whitespace (trailing-whitespace)

🪛 GitHub Actions: ruff

[error] 42-42: Q000: Double quotes found but single quotes preferred


[error] 53-53: Q000: Double quotes found but single quotes preferred


[error] 56-56: Q000: Double quotes found but single quotes preferred


[error] 59-59: Q000: Double quotes found but single quotes preferred

🤖 Prompt for AI Agents
In scripts/01-fetch-github.py around lines 38 to 61, the clone_repository method
should be hardened: create owner_dir with owner_dir.mkdir(parents=True,
exist_ok=True); perform a shallow clone by adding '--depth', '1' to the git args
to reduce bandwidth and time; pass a timeout (e.g. timeout=60) to subprocess.run
and catch subprocess.TimeoutExpired to handle hangs; also catch
FileNotFoundError to produce a clear error when git is not installed; keep
checking result.returncode and surface result.stderr on failure. Ensure you add
those exception handlers and use the parents=True mkdir flag.

Comment on lines 70 to 79
for repo_url in repositories:
if downloaded_count >= max_repositories:
break
path_parts = repo_url.replace('https://github.com/', '').replace('.git', '').split('/')
if len(path_parts) < 2:
continue
owner, repo_name = path_parts[0], path_parts[1]
print(f"Processing {owner}/{repo_name}...")
if self.clone_repository(repo_url, owner, repo_name):
downloaded_count += 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Parse owner/repo via urlparse for correctness across URL shapes.

String replace is brittle; use URL parsing.

-            path_parts = repo_url.replace('https://github.com/', '').replace('.git', '').split('/')
-            if len(path_parts) < 2:
-                continue                
-            owner, repo_name = path_parts[0], path_parts[1]            
-            print(f"Processing {owner}/{repo_name}...")            
+            parsed = urlparse(repo_url)
+            parts = [p for p in parsed.path.strip('/').split('/') if p]
+            if len(parts) < 2:
+                continue
+            owner, repo_name = parts[0], parts[1].split('.git', 1)[0]
+            print(f'Processing {owner}/{repo_name}...')
             if self.clone_repository(repo_url, owner, repo_name):
                 downloaded_count += 1        

Add import (outside this range):

from urllib.parse import urlparse
🧰 Tools
🪛 GitHub Actions: flake8

[error] 72-72: flake8: W291 trailing whitespace.


[error] 75-75: flake8: W291 trailing whitespace.


[error] 76-76: flake8: W291 trailing whitespace.


[error] 79-79: flake8: W291 trailing whitespace.

🪛 GitHub Actions: pylint

[error] 72-72: Trailing whitespace (trailing-whitespace)


[error] 75-75: Trailing whitespace (trailing-whitespace)


[error] 76-76: Trailing whitespace (trailing-whitespace)


[error] 77-77: Trailing whitespace (trailing-whitespace)


[error] 79-79: Trailing whitespace (trailing-whitespace)

🪛 GitHub Actions: ruff

[error] 77-77: Q000: Double quotes found but single quotes preferred

🤖 Prompt for AI Agents
In scripts/01-fetch-github.py around lines 70 to 79, parsing owner/repo by
string replace is brittle; use urllib.parse.urlparse to handle different GitHub
URL shapes. Update the loop to parse repo_url with urlparse, extract the path,
strip leading/trailing slashes, split on '/' and validate there are at least two
segments, then assign owner and repo_name from those segments (remove a trailing
.git from repo_name if present). Also add the import "from urllib.parse import
urlparse" at the top of the file.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aed4fb2 and 5126f0a.

📒 Files selected for processing (1)
  • scripts/01-fetch-github.py (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: pylint
scripts/01-fetch-github.py

[error] 32-32: Pylint: Line too long (104/100) (C0301). Command 'make pylint' failed with exit code 2.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: latexmk
  • GitHub Check: make
🔇 Additional comments (7)
scripts/01-fetch-github.py (7)

14-18: LGTM!

The class initialization is well-structured with sensible defaults and proper use of Path for filesystem operations.


20-21: LGTM!

The directory setup correctly uses parents=True to ensure intermediate directories are created.


23-42: Good error handling and URL normalization.

The method correctly handles HTTP errors and normalizes repository URLs. The CSS selectors properly target GitHub Trending entries.


68-86: Approve the overall coordination logic.

The method correctly orchestrates fetching and cloning with appropriate error handling and progress reporting. Once the URL parsing is fixed, this will be robust.


89-108: LGTM!

The argument parsing is well-structured with clear descriptions and sensible defaults.


111-115: LGTM!

The main function provides a clean entry point with proper initialization sequence.


118-119: LGTM!

The entry point correctly uses the standard Python idiom.

Comment on lines +44 to +66
def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
owner_dir = self.output_dir / owner
owner_dir.mkdir(exist_ok=True)
if (owner_dir / repo_name).exists():
print(f'Repository {owner}/{repo_name} already exists, skipping...')
return True
try:
result = subprocess.run(
['git', 'clone', repo_url],
cwd=owner_dir,
capture_output=True,
text=True,
check=False
)
if result.returncode == 0:
print(f'Successfully cloned {owner}/{repo_name}')
return True
else:
print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
return False
except subprocess.SubprocessError as e:
print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
return False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Address unresolved hardening concerns from previous review.

The clone operation still lacks several critical improvements requested in the previous review:

  1. Line 46: owner_dir.mkdir(exist_ok=True) should use parents=True to handle nested paths.
  2. Lines 52: Missing --depth 1 flag for shallow cloning (reduces bandwidth and time).
  3. Line 51: Missing timeout parameter on subprocess.run (prevents hangs on unresponsive clones).
  4. Line 64: Missing FileNotFoundError handler to gracefully handle missing git executable.

Apply this diff to address all concerns:

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
owner_dir = self.output_dir / owner
owner_dir.mkdir(exist_ok=True)
if (owner_dir / repo_name).exists():
print(f'Repository {owner}/{repo_name} already exists, skipping...')
return True
try:
result = subprocess.run(
['git', 'clone', repo_url],
cwd=owner_dir,
capture_output=True,
text=True,
check=False
)
if result.returncode == 0:
print(f'Successfully cloned {owner}/{repo_name}')
return True
else:
print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
return False
except subprocess.SubprocessError as e:
print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
return False
def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
owner_dir = self.output_dir / owner
owner_dir.mkdir(parents=True, exist_ok=True)
if (owner_dir / repo_name).exists():
print(f'Repository {owner}/{repo_name} already exists, skipping...')
return True
try:
result = subprocess.run(
['git', 'clone', '--depth', '1', repo_url],
cwd=owner_dir,
capture_output=True,
text=True,
check=False,
timeout=300,
)
if result.returncode == 0:
print(f'Successfully cloned {owner}/{repo_name}')
return True
else:
print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
return False
except FileNotFoundError:
print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
return False
except subprocess.TimeoutExpired:
print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
return False
except subprocess.SubprocessError as e:
print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
return False
🤖 Prompt for AI Agents
In scripts/01-fetch-github.py around lines 44 to 66, the clone_repository
implementation needs hardening: create owner_dir with
owner_dir.mkdir(parents=True, exist_ok=True) to handle nested paths; run git
with a shallow clone flag by adding '--depth', '1' to the git clone args; pass a
sensible timeout to subprocess.run (e.g., timeout=60) to avoid hanging; and add
a separate except FileNotFoundError handler to log a clear error when the git
executable is missing (return False), keeping the existing SubprocessError
handler for other subprocess errors.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
scripts/01-fetch-github.py (1)

48-70: Critical hardening issues remain unaddressed from previous review.

The clone_repository method still lacks several critical improvements that were requested in the previous review but have not been implemented:

  1. Line 50: owner_dir.mkdir(exist_ok=True) is missing parents=True, which will cause failures if the owner path has nested directories.
  2. Line 56: The git clone command is missing --depth 1 for shallow cloning, wasting bandwidth and time by cloning full repository history.
  3. Line 55: subprocess.run is missing the timeout parameter, allowing the process to hang indefinitely on unresponsive remotes.
  4. Line 68: Only catches SubprocessError, but missing a FileNotFoundError handler to provide a clear error message when the git executable is not installed.

Apply this diff to address all unresolved hardening concerns:

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5126f0a and fc5fa0e.

📒 Files selected for processing (1)
  • scripts/01-fetch-github.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/01-fetch-github.py (1)
stubs/bs4/__init__.pyi (1)
  • BeautifulSoup (7-37)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: make
  • GitHub Check: latexmk

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
scripts/01-fetch-github.py (1)

48-70: Address unresolved hardening concerns from previous review.

The clone operation still lacks critical improvements flagged in previous reviews:

  1. Line 50: owner_dir.mkdir(exist_ok=True) missing parents=True for nested paths
  2. Line 56: Missing --depth 1 flag for shallow cloning (reduces bandwidth/time)
  3. Line 55: Missing timeout parameter on subprocess.run (prevents hangs)
  4. Line 68: Missing FileNotFoundError handler for missing git executable

Apply this diff to address all concerns:

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False
🧹 Nitpick comments (1)
scripts/01-fetch-github.py (1)

115-119: LGTM! Consider making URL and timeout configurable.

The main function flow is correct. Optionally, you could add CLI arguments for --url and --timeout to make these configurable instead of hardcoded on line 117.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc5fa0e and 239e8d3.

📒 Files selected for processing (1)
  • scripts/01-fetch-github.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/01-fetch-github.py (1)
stubs/bs4/__init__.pyi (1)
  • BeautifulSoup (7-37)
🪛 GitHub Actions: flake8
scripts/01-fetch-github.py

[error] 15-15: SyntaxError: non-default argument follows default argument (E999) during linting. This caused 'flake8 --max-line-length=120' to fail.

🪛 GitHub Actions: pylint
scripts/01-fetch-github.py

[error] 15-15: Pylint parsing failed: 'non-default argument follows default argument' (syntax-error) in 01-fetch-github.py:15

🪛 GitHub Actions: ruff
scripts/01-fetch-github.py

[error] 15-15: Syntax error: Parameter without a default cannot follow a parameter with a default in init signature.


[error] 15-15: Syntax error: Parameter without a default cannot follow a parameter with a default in init signature.

🪛 GitHub Actions: ty
scripts/01-fetch-github.py

[error] 15-15: Step: uvx ty==0.0.1-alpha.8 check. Command failed with exit code 1. invalid-syntax: Parameter without a default cannot follow a parameter with a default in init(self, output_dir: str = 'target/01', trend_url, timeout).


[error] 15-15: Step: uvx ty==0.0.1-alpha.8 check. Command failed with exit code 1. invalid-syntax: Parameter without a default cannot follow a parameter with a default in init(self, output_dir: str = 'target/01', trend_url, timeout).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: latexmk
  • GitHub Check: make
🔇 Additional comments (4)
scripts/01-fetch-github.py (4)

20-21: LGTM!

The directory setup correctly uses parents=True to handle nested paths.


23-46: LGTM!

The implementation is robust with proper error handling, multiple CSS selector fallbacks, and defensive URL normalization. The timeout and error propagation ensure the caller can handle failures appropriately.


93-112: LGTM!

The argument parser is well-structured with sensible defaults and clear help text.


122-123: LGTM!

Standard and correct __main__ guard implementation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
scripts/01-fetch-github.py (1)

48-70: Critical: Address unresolved hardening concerns from previous reviews.

The clone operation still lacks several critical improvements requested in previous reviews:

  1. Line 50: owner_dir.mkdir(exist_ok=True) should use parents=True to handle nested directory paths safely.
  2. Line 56: Missing --depth 1 flag for shallow cloning (significantly reduces bandwidth and time).
  3. Line 55: Missing timeout parameter on subprocess.run (prevents hangs on unresponsive clones).
  4. Line 68: Missing FileNotFoundError handler to gracefully handle missing git executable.

Apply this diff to address all concerns:

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 239e8d3 and c60fa36.

📒 Files selected for processing (1)
  • scripts/01-fetch-github.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/01-fetch-github.py (1)
stubs/bs4/__init__.pyi (1)
  • BeautifulSoup (7-37)
🪛 GitHub Actions: pylint
scripts/01-fetch-github.py

[error] 117-117: Pylint: Line too long (C0301) 106/100 in scripts/01-fetch-github.py:117. Command that failed: 'make pylint' (uv run pylint aibolit test scripts --ignore=scripts/target).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: make
  • GitHub Check: latexmk
🔇 Additional comments (1)
scripts/01-fetch-github.py (1)

15-18: Critical: Fix parameter ordering in __init__ signature.

Parameters with default values must come after parameters without defaults. This syntax error is blocking all CI pipelines (flake8, pylint, ruff, ty).

Apply this diff to fix the parameter ordering:

-    def __init__(self, output, url, timeout):
+    def __init__(self, url, timeout, output='target/01'):

Alternatively, if all parameters should be explicit:

-    def __init__(self, output, url, timeout):
+    def __init__(self, output: str, url: str, timeout: int):

Then update the call site at line 117 to pass all three arguments explicitly.

Likely an incorrect or invalid review comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
scripts/01-fetch-github.py (1)

48-70: Critical: Implement all requested clone hardening improvements.

This method still lacks the critical improvements requested in multiple previous reviews:

  1. Line 50: mkdir missing parents=True — will fail on nested owner paths.
  2. Line 56: Missing --depth 1 flag — wastes bandwidth and time on full clones.
  3. Line 55: Missing timeout parameter — process can hang indefinitely on unresponsive remotes.
  4. Line 68: Missing FileNotFoundError handler — crashes ungracefully when git is not installed.

Apply this diff to implement all hardening improvements:

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False
🧹 Nitpick comments (1)
scripts/01-fetch-github.py (1)

15-18: Add type hints to __init__ parameters.

The parameters lack type annotations, reducing code clarity and preventing static type checkers from catching errors.

Apply this diff:

-    def __init__(self, output, url, timeout):
+    def __init__(self, output: str, url: str, timeout: int):
         self.output_dir = Path(output)
         self.trending_url = url
         self.request_timeout = timeout
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c60fa36 and e396bee.

📒 Files selected for processing (1)
  • scripts/01-fetch-github.py (1 hunks)
🔇 Additional comments (2)
scripts/01-fetch-github.py (2)

20-46: LGTM!

The directory setup and repository fetching logic are well implemented:

  • setup_directories correctly uses parents=True to handle nested paths.
  • fetch_trending_repositories has robust error handling and uses a fallback chain of DOM selectors.
  • Defensive parsing of href attributes prevents crashes on malformed HTML.

93-127: LGTM!

The CLI argument parsing, main function, and module guard are well structured:

  • parse_arguments properly configures defaults and help text.
  • main orchestrates the download workflow cleanly.
  • The __main__ guard correctly invokes main().

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
scripts/01-fetch-github.py (1)

48-70: Critical: Address unresolved hardening concerns from previous reviews.

The clone operation still lacks several critical improvements requested in multiple previous reviews:

  1. Line 50: owner_dir.mkdir(exist_ok=True) should use parents=True to handle nested paths.
  2. Line 56: Missing --depth 1 flag for shallow cloning (reduces bandwidth and time).
  3. Line 55: Missing timeout parameter on subprocess.run (prevents hangs on unresponsive clones).
  4. Line 68: Missing FileNotFoundError handler to gracefully handle missing git executable.
  5. Missing TimeoutExpired handler for the timeout parameter.

Apply this diff to address all concerns:

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e396bee and 9ac9c53.

📒 Files selected for processing (1)
  • scripts/01-fetch-github.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/01-fetch-github.py (1)
stubs/bs4/__init__.pyi (1)
  • BeautifulSoup (7-37)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: make
  • GitHub Check: latexmk

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
scripts/01-fetch-github.py (1)

49-71: Harden git clone: shallow clone, timeout, parents=True, and handle missing git/timeouts.

Prevents hangs, reduces bandwidth, and improves error handling. This was flagged previously and remains unresolved.

     def clone_repository(self, repo_url: str, owner: str, repo_name: str) -> bool:
         owner_dir = self.output_dir / owner
-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
         if (owner_dir / repo_name).exists():
             print(f'Repository {owner}/{repo_name} already exists, skipping...')
             return True
         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
-                print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
+                print(f'Failed to clone {owner}/{repo_name}: {result.stderr}', file=sys.stderr)
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False
🧹 Nitpick comments (4)
scripts/01-fetch-github.py (4)

24-47: Trending parsing: solid selectors and normalization. Minor hardening suggestion.

Consider adding a User-Agent to reduce chances of 403s from GitHub.

-            response = requests.get(self.trending_url, timeout=self.request_timeout)
+            response = requests.get(
+                self.trending_url,
+                timeout=self.request_timeout,
+                headers={'User-Agent': 'aibolit/1.0 (+https://github.com/cqfn/aibolit)'}
+            )

73-79: Minor: print fetch failure to stderr.

-            print('Failed to fetch repository list. Exiting.')
+            print('Failed to fetch repository list. Exiting.', file=sys.stderr)

8-8: Use built‑in generics (list[str]) to drop typing.List.

-from typing import List
+# from typing import List  # no longer needed with built-in generics

-    def fetch_trending_repositories(self) -> List[str]:
+    def fetch_trending_repositories(self) -> list[str]:
-        repositories: List[str] = []
+        repositories: list[str] = []

Also applies to: 24-25, 32-33


96-107: Validate --nrepos as positive integer.

-def parse_arguments() -> argparse.Namespace:
+def parse_arguments() -> argparse.Namespace:
+    def positive_int(value: str) -> int:
+        iv = int(value)
+        if iv < 1:
+            raise argparse.ArgumentTypeError('nrepos must be a positive integer')
+        return iv
@@
-        '--nrepos',
-        type=int,
+        '--nrepos',
+        type=positive_int,
         required=False,
         default=100,
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ef4522 and 75b1a21.

📒 Files selected for processing (1)
  • scripts/01-fetch-github.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/01-fetch-github.py (1)
stubs/bs4/__init__.pyi (1)
  • BeautifulSoup (7-37)
🪛 GitHub Actions: pylint
scripts/01-fetch-github.py

[warning] 12-12: C0411: standard import 'urllib.parse.urlparse' should be placed before third party imports 'requests', 'bs4.BeautifulSoup' (wrong-import-order)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: make
  • GitHub Check: latexmk
🔇 Additional comments (2)
scripts/01-fetch-github.py (2)

16-19: Constructor typing and order look good.


84-90: Good switch to urlparse; confirm Python baseline for removesuffix.

str.removesuffix needs Python 3.9+. If 3.8 must be supported, use rsplit('.git', 1)[0] instead.

Comment on lines 10 to 12
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix import order to unblock pylint (standard libs before third‑party).

Move urlparse above requests/bs4 and separate groups with a blank line.

-import requests
-from bs4 import BeautifulSoup
-from urllib.parse import urlparse
+from urllib.parse import urlparse
+
+import requests
+from bs4 import BeautifulSoup
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from urllib.parse import urlparse
import requests
from bs4 import BeautifulSoup
🧰 Tools
🪛 GitHub Actions: pylint

[warning] 12-12: C0411: standard import 'urllib.parse.urlparse' should be placed before third party imports 'requests', 'bs4.BeautifulSoup' (wrong-import-order)

🤖 Prompt for AI Agents
In scripts/01-fetch-github.py around lines 10 to 12, the standard-library import
from urllib.parse (urlparse) should be moved above third-party imports and
groups separated by a blank line to satisfy pylint; reorder so standard-library
imports come first (urllib.parse import), add a single blank line, then keep
requests and bs4 imports below.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
scripts/01-fetch-github.py (2)

56-71: Major: Add shallow cloning, timeout, and missing error handlers.

These hardening improvements were requested in previous reviews but remain unaddressed:

  1. Line 57: Missing --depth 1 for shallow cloning (wastes bandwidth and time downloading full history)
  2. Line 56: Missing timeout parameter (could hang indefinitely)
  3. Line 69: Missing FileNotFoundError handler (crashes unclearly if git not installed)

Apply this diff:

         try:
             result = subprocess.run(
-                ['git', 'clone', repo_url],
+                ['git', 'clone', '--depth', '1', repo_url],
                 cwd=owner_dir,
                 capture_output=True,
                 text=True,
-                check=False
+                check=False,
+                timeout=300,
             )
             if result.returncode == 0:
                 print(f'Successfully cloned {owner}/{repo_name}')
                 return True
             else:
                 print(f'Failed to clone {owner}/{repo_name}: {result.stderr}')
                 return False
+        except FileNotFoundError:
+            print('git not found on PATH; cannot clone repositories.', file=sys.stderr)
+            return False
+        except subprocess.TimeoutExpired:
+            print(f'Timeout cloning {owner}/{repo_name}', file=sys.stderr)
+            return False
         except subprocess.SubprocessError as e:
             print(f'Error cloning {owner}/{repo_name}: {e}', file=sys.stderr)
             return False

51-51: Critical: Add parents=True to prevent failures on nested paths.

This issue was flagged in previous reviews but remains unresolved. Line 51 uses mkdir(exist_ok=True) without parents=True, which will fail if the owner name requires intermediate directory creation.

Apply this diff:

-        owner_dir.mkdir(exist_ok=True)
+        owner_dir.mkdir(parents=True, exist_ok=True)
🧹 Nitpick comments (1)
scripts/01-fetch-github.py (1)

118-126: LGTM! Clean entrypoint with proper initialization.

The main function correctly:

  • Parses CLI arguments
  • Converts output path to Path object
  • Initializes downloader with appropriate parameters
  • Calls methods in correct order

Optional: Consider making the trending URL and timeout configurable via CLI arguments for flexibility:

     parser.add_argument(
         '--output-dir',
         type=str,
         required=False,
         default='target/01',
         help='Output directory for downloaded repositories (default: target/01)'
     )
+    parser.add_argument(
+        '--url',
+        type=str,
+        required=False,
+        default='https://github.com/trending/java?since=daily',
+        help='Trending URL to fetch repositories from'
+    )
+    parser.add_argument(
+        '--timeout',
+        type=int,
+        required=False,
+        default=30,
+        help='HTTP request timeout in seconds (default: 30)'
+    )
     return parser.parse_args()

Then update main:

     downloader = RepositoryDownloader(
         Path(args.output_dir),
-        'https://github.com/trending/java?since=daily',
-        30
+        args.url,
+        args.timeout
     )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ef4522 and 91392da.

📒 Files selected for processing (1)
  • scripts/01-fetch-github.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/01-fetch-github.py (1)
stubs/bs4/__init__.pyi (1)
  • BeautifulSoup (7-37)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: latexmk
  • GitHub Check: make
🔇 Additional comments (7)
scripts/01-fetch-github.py (7)

4-12: LGTM! Imports are clean and necessary.

All imports are used and properly organized (standard library first, then third-party).


15-19: LGTM! Constructor is well-typed and clear.

Type hints are present and correct, parameter order is valid, and attribute names are descriptive.


21-22: LGTM! Directory creation is robust.

Using parents=True and exist_ok=True together ensures the directory structure is created safely without failing on existing paths.


24-47: LGTM! Robust repository fetching with good error handling.

The implementation includes:

  • Timeout on HTTP request
  • Proper exception handling with re-raise
  • Multiple CSS selector fallbacks for different GitHub layouts
  • Defensive href extraction preventing KeyError/AttributeError
  • URL normalization ensuring consistent format

73-93: LGTM! Clean orchestration with proper URL parsing.

The method correctly:

  • Handles fetch errors gracefully
  • Uses urlparse for robust URL parsing (addressing previous brittleness concerns)
  • Validates path structure before extraction
  • Uses removesuffix('.git') for clean repo name extraction
  • Respects max_repositories limit

96-115: LGTM! Standard argparse usage with clear defaults.

The CLI configuration is straightforward with sensible defaults (100 repos, 'target/01' directory) and descriptive help text.


129-130: LGTM! Standard Python entrypoint.

Correct usage of the __main__ guard calling the main function.

@kodsurfer
Copy link
Contributor Author

@yegor256 @literally-bug-creator hello, please check PR

@deyna256
Copy link
Member

@kodsurfer First things first, we should cover the code with tests, and then we can patch it
@yegor256 wdyt?

@kodsurfer
Copy link
Contributor Author

@literally-bug-creator @yegor256 I can add tests for new code

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
test/scripts/test_fetch_github.py (2)

178-199: Consider testing production code directly instead of duplicating logic.

The _normalize_url helper method duplicates the URL normalization logic from fetch_trending_repositories. This creates a maintenance burden: if the production logic changes, this test helper must also be updated, and the test might pass even when production behavior is incorrect.

Consider one of these approaches:

Option 1: Test the normalization through fetch_trending_repositories itself (which you're already doing in lines 35-62).

Option 2: If URL normalization is complex enough to warrant separate testing, extract it as a standalone method in the production code:

In scripts/01-fetch-github.py:

@staticmethod
def normalize_github_url(href: str) -> str:
    """Normalize a GitHub repository URL."""
    if not href:
        return href
    if href.startswith('/'):
        href = f'https://github.com{href}'
    if not href.endswith('.git'):
        href = f'{href}.git'
    return href

Then test it directly:

def test_url_parsing_and_normalization(self):
    """Test URL parsing and .git suffix normalization"""
    test_cases = [...]
    for input_href, expected_url in test_cases:
        with self.subTest(href=input_href):
            normalized = RepositoryDownloader.normalize_github_url(input_href)
            self.assertEqual(normalized, expected_url)

1-237: Consider standardizing quote style for consistency.

The linter (ruff) flags multiple instances where double quotes are used instead of single quotes. While this doesn't affect functionality, consistent quote usage improves code readability.

If your project prefers single quotes (as indicated by the ruff configuration), you can run an automated formatter to fix all instances at once:

#!/bin/bash
# Use ruff to automatically fix quote style
ruff check --select Q --fix test/scripts/test_fetch_github.py
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 91392da and 35f5b68.

📒 Files selected for processing (1)
  • test/scripts/test_fetch_github.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
test/scripts/test_fetch_github.py (1)
scripts/01-fetch-github.py (6)
  • RepositoryDownloader (15-93)
  • setup_directories (21-22)
  • fetch_trending_repositories (24-47)
  • clone_repository (49-71)
  • download_repositories (73-93)
  • parse_arguments (96-115)
🪛 GitHub Actions: flake8
test/scripts/test_fetch_github.py

[error] 4-4: F401 'os' imported but unused.


[error] 8-8: F401 'unittest.mock.MagicMock' imported but unused.


[error] 220-220: W291 trailing whitespace.


[error] 224-224: E302 expected 2 blank lines, found 1.


[error] 228-228: W291 trailing whitespace.


[error] 230-230: F841 local variable 'mock_main' is assigned to but never used.

🪛 GitHub Actions: make
test/scripts/test_fetch_github.py

[error] 12-12: ModuleNotFoundError: No module named 'RepositoryDownloader' during test collection. Ensure the dependency is installed or adjust import path.

🪛 GitHub Actions: ruff
test/scripts/test_fetch_github.py

[error] 4-4: F401: 'os' imported but unused


[error] 8-8: F401: 'unittest.mock.MagicMock' imported but unused


[warning] 18-18: Q000: Double quotes found but single quotes preferred


[warning] 21-21: Q000: Double quotes found but single quotes preferred


[warning] 40-40: Q001: Double quotes found but single quotes preferred


[warning] 55-55: Q000: Double quotes found but single quotes preferred


[warning] 60-60: Q000: Double quotes found but single quotes preferred


[warning] 66-66: Q001: Double quotes found but single quotes preferred


[warning] 75-75: Q000: Double quotes found but single quotes preferred


[warning] 85-85: Q000: Double quotes found but single quotes preferred


[warning] 88-88: Q000: Double quotes found but single quotes preferred


[warning] 89-89: Q000: Double quotes found but single quotes preferred


[warning] 90-90: Q000: Double quotes found but single quotes preferred


[warning] 95-95: Q000: Double quotes found but single quotes preferred


[warning] 105-105: Q000: Double quotes found but single quotes preferred


[warning] 116-116: Q000: Double quotes found but single quotes preferred


[warning] 126-126: Q000: Double quotes found but single quotes preferred


[warning] 131-131: Q000: Double quotes found but single quotes preferred


[warning] 132-132: Q000: Double quotes found but single quotes preferred


[warning] 133-133: Q000: Double quotes found but single quotes preferred


[warning] 140-140: Q000: Double quotes found but single quotes preferred


[warning] 141-141: Q000: Double quotes found but single quotes preferred


[warning] 152-152: Q000: Double quotes found but single quotes preferred


[warning] 162-162: Q000: Double quotes found but single quotes preferred


[warning] 169-169: Q000: Double quotes found but single quotes preferred


[warning] 181-181: Q000: Double quotes found but single quotes preferred


[warning] 181-181: Q000: Double quotes found but single quotes preferred


[warning] 230-230: F841: Local variable mock_main is assigned to but never used

🪛 GitHub Actions: ty
test/scripts/test_fetch_github.py

[error] 12-12: Cannot resolve imported module RepositoryDownloader.


[error] 206-206: Cannot resolve imported module your_module.


[error] 219-219: Cannot resolve imported module your_module.


[error] 231-231: Cannot resolve imported module your_module.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: latexmk
🔇 Additional comments (4)
test/scripts/test_fetch_github.py (4)

15-33: LGTM!

The test class setup, teardown, and directory creation test are well-structured. Proper use of temporary directories ensures test isolation.


124-135: LGTM!

The test correctly validates that existing repositories are skipped without requiring mocks.


236-237: LGTM!

Standard unittest main guard is correctly implemented.


35-62: Add missing @patch decorator.

The test method expects a mock_get parameter but lacks the required @patch decorator.

Apply this diff:

+@patch('scripts.01-fetch-github.requests.get')
 def test_fetch_trending_repositories_success(self, mock_get):

Note: Adjust the patch path to match your actual module structure (e.g., scripts.fetch_github.requests.get if using Python-friendly naming).

Likely an incorrect or invalid review comment.

# SPDX-FileCopyrightText: Copyright (c) 2019-2025 Aibolit
# SPDX-License-Identifier: MIT

import os
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix critical import errors and remove unused imports.

Multiple import issues detected:

  • Line 4: os is imported but never used
  • Line 8: MagicMock is imported but never used
  • Line 12: The import statement is completely incorrect and will cause ModuleNotFoundError

Apply this diff to fix the imports:

-import os
 import tempfile
 import unittest
 from pathlib import Path
-from unittest.mock import Mock, patch, MagicMock
+from unittest.mock import Mock, patch
 import subprocess
 import requests

-import RepositoryDownloader
+from scripts['01-fetch-github'] import RepositoryDownloader

Note: The actual import path may need adjustment based on your project structure. Common alternatives:

  • from scripts.fetch_github import RepositoryDownloader (if the file is renamed)
  • Add the scripts directory to sys.path before importing

Also applies to: 8-8, 12-12

🧰 Tools
🪛 GitHub Actions: flake8

[error] 4-4: F401 'os' imported but unused.

🪛 GitHub Actions: ruff

[error] 4-4: F401: 'os' imported but unused

🤖 Prompt for AI Agents
In test/scripts/test_fetch_github.py around lines 4, 8 and 12 remove unused
imports and fix the broken import: delete the unused "import os" (line 4) and
remove the unused "MagicMock" import (line 8), then replace the incorrect module
import on line 12 with a correct path to the tested class (for example use "from
scripts.fetch_github import RepositoryDownloader" or the proper relative import
for your package); if the test runner cannot find the scripts package, add a
short sys.path append before importing to include the project root (e.g. insert
sys.path manipulation above the import) so the import resolves.

Comment on lines +64 to +79
def test_fetch_trending_repositories_request_exception(self, mock_get):
"""Test handling of request exceptions"""
mock_get.side_effect = requests.RequestException("Network error")

with self.assertRaises(requests.RequestException):
self.downloader.fetch_trending_repositories()

def test_fetch_trending_repositories_empty_anchors(self, mock_get):
"""Test handling of empty repository list"""
mock_response = Mock()
mock_response.raise_for_status.return_value = None
mock_response.text = "<html><body></body></html>"
mock_get.return_value = mock_response

repositories = self.downloader.fetch_trending_repositories()
self.assertEqual(repositories, [])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add missing @patch decorators.

Both test methods expect a mock_get parameter but lack the required @patch decorator.

Apply this diff:

+@patch('scripts.01-fetch-github.requests.get')
 def test_fetch_trending_repositories_request_exception(self, mock_get):
     """Test handling of request exceptions"""
     mock_get.side_effect = requests.RequestException("Network error")

     with self.assertRaises(requests.RequestException):
         self.downloader.fetch_trending_repositories()

+@patch('scripts.01-fetch-github.requests.get')
 def test_fetch_trending_repositories_empty_anchors(self, mock_get):
🧰 Tools
🪛 GitHub Actions: ruff

[warning] 66-66: Q001: Double quotes found but single quotes preferred


[warning] 75-75: Q000: Double quotes found but single quotes preferred

🤖 Prompt for AI Agents
In test/scripts/test_fetch_github.py around lines 64 to 79, the two test methods
declare a mock_get parameter but are missing the @patch decorator that supplies
it; add @patch('requests.get') above both
test_fetch_trending_repositories_request_exception and
test_fetch_trending_repositories_empty_anchors so the mock_get argument is
injected, and import patch from unittest.mock if not already imported.

Comment on lines +81 to +122
def test_clone_repository_success(self, mock_subprocess):
"""Test successful repository cloning"""
mock_result = Mock()
mock_result.returncode = 0
mock_result.stderr = ""
mock_subprocess.return_value = mock_result
result = self.downloader.clone_repository(
"https://github.com/owner/repo.git",
"owner",
"repo"
)
self.assertTrue(result)
mock_subprocess.assert_called_once_with(
['git', 'clone', 'https://github.com/owner/repo.git'],
cwd=self.test_output_dir / "owner",
capture_output=True,
text=True,
check=False
)

def test_clone_repository_failure(self, mock_subprocess):
"""Test repository cloning failure"""
mock_result = Mock()
mock_result.returncode = 1
mock_result.stderr = "Authentication failed"
mock_subprocess.return_value = mock_result
result = self.downloader.clone_repository(
"https://github.com/owner/repo.git",
"owner",
"repo"
)
self.assertFalse(result)

def test_clone_repository_subprocess_error(self, mock_subprocess):
"""Test handling of subprocess errors"""
mock_subprocess.side_effect = subprocess.SubprocessError("Git not found")
result = self.downloader.clone_repository(
"https://github.com/owner/repo.git",
"owner",
"repo"
)
self.assertFalse(result)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add missing @patch decorators.

Three test methods expect a mock_subprocess parameter but lack the required @patch decorator for subprocess.run.

Apply this diff:

+@patch('subprocess.run')
 def test_clone_repository_success(self, mock_subprocess):
     """Test successful repository cloning"""
     mock_result = Mock()
     mock_result.returncode = 0
     mock_result.stderr = ""
     mock_subprocess.return_value = mock_result
     ...

+@patch('subprocess.run')
 def test_clone_repository_failure(self, mock_subprocess):
     """Test repository cloning failure"""
     ...

+@patch('subprocess.run')
 def test_clone_repository_subprocess_error(self, mock_subprocess):
🧰 Tools
🪛 GitHub Actions: ruff

[warning] 85-85: Q000: Double quotes found but single quotes preferred


[warning] 88-88: Q000: Double quotes found but single quotes preferred


[warning] 89-89: Q000: Double quotes found but single quotes preferred


[warning] 90-90: Q000: Double quotes found but single quotes preferred


[warning] 95-95: Q000: Double quotes found but single quotes preferred


[warning] 105-105: Q000: Double quotes found but single quotes preferred


[warning] 116-116: Q000: Double quotes found but single quotes preferred

🤖 Prompt for AI Agents
In test/scripts/test_fetch_github.py around lines 81 to 122, the three tests
take a mock_subprocess argument but do not patch subprocess.run; add
@patch('subprocess.run') decorators above test_clone_repository_success,
test_clone_repository_failure, and test_clone_repository_subprocess_error so the
mock_subprocess parameter is provided by the test framework; ensure the patch
import (from unittest.mock import patch) is present at the top of the file if
missing.

Comment on lines +137 to +176
def test_download_repositories_success(self, mock_clone, mock_fetch):
"""Test successful download of multiple repositories"""
mock_fetch.return_value = [
"https://github.com/owner1/repo1.git",
"https://github.com/owner2/repo2.git",
"https://github.com/owner3/repo3.git"
]
mock_clone.return_value = True
self.downloader.download_repositories(max_repositories=2)
self.assertEqual(mock_clone.call_count, 2)
mock_fetch.assert_called_once()

def test_download_repositories_partial_success(self, mock_clone, mock_fetch):
"""Test download with some repositories failing to clone"""
mock_fetch.return_value = [
"https://github.com/owner1/repo1.git",
"https://github.com/owner2/repo2.git",
"https://github.com/owner3/repo3.git"
]
mock_clone.side_effect = [True, False, True]
self.downloader.download_repositories(max_repositories=3)
self.assertEqual(mock_clone.call_count, 3)

def test_download_repositories_fetch_fails(self, mock_fetch):
"""Test download when fetching repository list fails"""
mock_fetch.side_effect = requests.RequestException("Fetch failed")
self.downloader.download_repositories(max_repositories=10)
mock_fetch.assert_called_once()

def test_download_repositories_invalid_urls(self, mock_clone, mock_fetch):
"""Test handling of invalid repository URLs"""
mock_fetch.return_value = [
"https://github.com/owner1/repo1.git",
"invalid_url",
"https://github.com/singlepart",
"https://github.com/owner2/repo2.git"
]
mock_clone.return_value = True
self.downloader.download_repositories(max_repositories=10)
self.assertEqual(mock_clone.call_count, 2)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add missing @patch decorators.

All four test methods expect mock parameters but lack the required @patch decorators.

Apply this diff:

+@patch.object(RepositoryDownloader, 'fetch_trending_repositories')
+@patch.object(RepositoryDownloader, 'clone_repository')
 def test_download_repositories_success(self, mock_clone, mock_fetch):
     """Test successful download of multiple repositories"""
     ...

+@patch.object(RepositoryDownloader, 'fetch_trending_repositories')
+@patch.object(RepositoryDownloader, 'clone_repository')
 def test_download_repositories_partial_success(self, mock_clone, mock_fetch):
     """Test download with some repositories failing to clone"""
     ...

+@patch.object(RepositoryDownloader, 'fetch_trending_repositories')
 def test_download_repositories_fetch_fails(self, mock_fetch):
     """Test download when fetching repository list fails"""
     ...

+@patch.object(RepositoryDownloader, 'fetch_trending_repositories')
+@patch.object(RepositoryDownloader, 'clone_repository')
 def test_download_repositories_invalid_urls(self, mock_clone, mock_fetch):

Note: When using multiple decorators, they are applied bottom-up, so the parameter order matches the decorator order from bottom to top.

🧰 Tools
🪛 GitHub Actions: ruff

[warning] 140-140: Q000: Double quotes found but single quotes preferred


[warning] 141-141: Q000: Double quotes found but single quotes preferred


[warning] 152-152: Q000: Double quotes found but single quotes preferred


[warning] 162-162: Q000: Double quotes found but single quotes preferred


[warning] 169-169: Q000: Double quotes found but single quotes preferred

🤖 Prompt for AI Agents
In test/scripts/test_fetch_github.py around lines 137 to 176, the four test
methods accept mock parameters but are missing the @patch decorators; add two
decorators to each test so the bottom decorator patches the clone function and
the top decorator patches the fetch function (e.g. top:
@patch('scripts.fetch_github.fetch_repository_list'), bottom:
@patch('scripts.fetch_github.clone_repository')) so the parameter order
(mock_clone, mock_fetch) matches decorator order bottom-to-top; apply these two
decorators to each of the four test methods.

Comment on lines 202 to 222
class ArgumentParserTestCase(unittest.TestCase):
def test_parse_arguments_default(self):
"""Test argument parser with default values"""
with patch('sys.argv', ['script_name']):
from your_module import parse_arguments
args = parse_arguments()
self.assertEqual(args.nrepos, 100)
self.assertEqual(args.output_dir, 'target/01')

def test_parse_arguments_custom(self):
"""Test argument parser with custom values"""
test_args = [
'script_name',
'--nrepos', '50',
'--output-dir', 'custom/directory'
]
with patch('sys.argv', test_args):
from your_module import parse_arguments
args = parse_arguments()
self.assertEqual(args.nrepos, 50)
self.assertEqual(args.output_dir, 'custom/directory')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Replace placeholder module name and fix formatting.

The tests import from 'your_module' which is a placeholder and will cause import errors. Additionally, Line 220 has trailing whitespace.

Apply this diff:

 def test_parse_arguments_default(self):
     """Test argument parser with default values"""
     with patch('sys.argv', ['script_name']):
-        from your_module import parse_arguments
+        from scripts['01-fetch-github'] import parse_arguments
         args = parse_arguments()
         self.assertEqual(args.nrepos, 100)
         self.assertEqual(args.output_dir, 'target/01')

 def test_parse_arguments_custom(self):
     """Test argument parser with custom values"""
     test_args = [
         'script_name',
         '--nrepos', '50',
         '--output-dir', 'custom/directory'
     ]
     with patch('sys.argv', test_args):
-        from your_module import parse_arguments
-        args = parse_arguments()            
+        from scripts['01-fetch-github'] import parse_arguments
+        args = parse_arguments()
         self.assertEqual(args.nrepos, 50)
         self.assertEqual(args.output_dir, 'custom/directory')

Note: Adjust the import path based on your project structure (e.g., scripts.fetch_github if using Python-friendly naming).

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 GitHub Actions: flake8

[error] 220-220: W291 trailing whitespace.

🪛 GitHub Actions: ty

[error] 206-206: Cannot resolve imported module your_module.


[error] 219-219: Cannot resolve imported module your_module.

🤖 Prompt for AI Agents
In test/scripts/test_fetch_github.py around lines 202 to 222, the tests import
from the placeholder 'your_module' which causes import errors and line 220
contains trailing whitespace; update the import to the real module path used by
the project (for example replace "from your_module import parse_arguments" with
"from scripts.fetch_github import parse_arguments" or the correct package/module
path), ensure both test cases use that real import, and remove the trailing
whitespace at line 220 so the file has no extraneous spaces.

Comment on lines 224 to 233
def test_main_execution(mock_path, mock_downloader):
"""Test main function execution"""
mock_args = Mock()
mock_args.nrepos = 50
mock_args.output_dir = 'test/output'
with patch('your_module.parse_arguments', return_value=mock_args):
with patch('your_module.main') as mock_main:
from your_module import main
main()
mock_downloader.assert_called_once()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix multiple critical issues with test structure and logic.

This test has several problems that prevent it from working:

  1. Not a proper test method: It's a standalone function, not a method of a test class, so unittest won't discover or run it
  2. Placeholder module names: References to 'your_module' will cause import errors
  3. Confused mocking logic: Patches main, imports main, calls main, but never verifies the mock
  4. Undefined mock: Asserts mock_downloader but it's neither defined nor passed as a parameter
  5. Missing decorators: mock_path and mock_downloader parameters have no corresponding @patch decorators
  6. Formatting: Missing blank line before function definition and trailing whitespace on Line 228

Recommended approach - create a proper test method in the ArgumentParserTestCase class:

class ArgumentParserTestCase(unittest.TestCase):
    # ... existing tests ...

    @patch('scripts.01-fetch-github.RepositoryDownloader')
    @patch('scripts.01-fetch-github.Path')
    def test_main_execution(self, mock_path, mock_downloader_class):
        """Test main function execution"""
        mock_args = Mock()
        mock_args.nrepos = 50
        mock_args.output_dir = 'test/output'
        mock_downloader_instance = Mock()
        mock_downloader_class.return_value = mock_downloader_instance

        with patch('scripts.01-fetch-github.parse_arguments', return_value=mock_args):
            from scripts['01-fetch-github'] import main
            main()

            # Verify RepositoryDownloader was instantiated correctly
            mock_downloader_class.assert_called_once()
            # Verify download_repositories was called
            mock_downloader_instance.setup_directories.assert_called_once()
            mock_downloader_instance.download_repositories.assert_called_once_with(50)
🧰 Tools
🪛 GitHub Actions: flake8

[error] 224-224: E302 expected 2 blank lines, found 1.


[error] 228-228: W291 trailing whitespace.


[error] 230-230: F841 local variable 'mock_main' is assigned to but never used.

🪛 GitHub Actions: ruff

[warning] 230-230: F841: Local variable mock_main is assigned to but never used

🪛 GitHub Actions: ty

[error] 231-231: Cannot resolve imported module your_module.

🤖 Prompt for AI Agents
In test/scripts/test_fetch_github.py around lines 224-233, the standalone test
function should be converted into a proper unittest method inside the existing
ArgumentParserTestCase class, remove trailing whitespace and add a blank line
before the method; add @patch decorators to mock scripts.01-fetch-github.Path
and scripts.01-fetch-github.RepositoryDownloader (order them so the mock args
match the signature), create a mock_args object and make parse_arguments return
it via patch('scripts.01-fetch-github.parse_arguments', return_value=mock_args),
have the RepositoryDownloader mock return a mock instance and call main()
imported from scripts.01-fetch-github, then assert the downloader class was
instantiated once and that the instance methods (setup_directories and
download_repositories) were called with the expected arguments
(download_repositories called with mock_args.nrepos).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants