Skip to content

Conversation

@PawsFunctions
Copy link
Contributor

Bug fix: #79 - Failed to get all the existing links when entire response is filled with duplicate

Edit: updated get_existing_links to use new search api and changed cursor logic to use nextCursor from response.
new api https://docs.linkwarden.app/api/search-links
old links api deprecated 6 months ago
linkwarden/docs@4846ad7

ADD: delete_links function to delete all duplicate links (OPT_DELETE_DUPLICATE).
ADD: DEBUG enviroment variable to replicate -d.

…e response is filled with duplicate

Edit: updated get_existing_links to use new search api and changed cursor
      logic to use nextCursor from response.
ADD: delete_links function to delete all duplicate links (OPT_DELETE_DUPLICATE).
ADD: DEBUG enviroment variable to replicate -d.
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In delete_links, you catch Timeout but it’s not defined in this file unless imported elsewhere; consider either importing requests.exceptions.Timeout explicitly or catching requests.Timeout/requests.RequestException consistently to avoid a runtime NameError.
  • get_existing_links performs duplicate deletion only after the generator has been fully consumed, which makes the side effect dependent on the caller exhausting the iterator; consider exposing duplicate cleanup as an explicit function or making the behavior non-lazy to avoid surprising partial-iteration behavior.
  • When duplicate deletion is enabled, logging the full list of duplicate IDs at info level (duplicate_link_ids) can become very noisy for large collections; consider logging only counts or truncating the list, and reserve detailed IDs for debug logging.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `delete_links`, you catch `Timeout` but it’s not defined in this file unless imported elsewhere; consider either importing `requests.exceptions.Timeout` explicitly or catching `requests.Timeout`/`requests.RequestException` consistently to avoid a runtime `NameError`.
- `get_existing_links` performs duplicate deletion only after the generator has been fully consumed, which makes the side effect dependent on the caller exhausting the iterator; consider exposing duplicate cleanup as an explicit function or making the behavior non-lazy to avoid surprising partial-iteration behavior.
- When duplicate deletion is enabled, logging the full list of duplicate IDs at info level (`duplicate_link_ids`) can become very noisy for large collections; consider logging only counts or truncating the list, and reserve detailed IDs for debug logging.

## Individual Comments

### Comment 1
<location> `starwarden/linkwarden_api.py:13-22` </location>
<code_context>

-def get_existing_links(linkwarden_url, linkwarden_token, collection_id):
-    url = f"{linkwarden_url.rstrip('/')}/api/v1/links"
+def get_existing_links(linkwarden_url, linkwarden_token, collection_id, delete_duplicate=False):
+    """
+    Get existing links from a collection with optional duplicate detection and deletion.
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Duplicate deletion behavior depends on the caller fully exhausting the generator, which could be surprising.

Because duplicate detection and `delete_links` run only after the `while` loop finishes, they only occur if the caller fully consumes the generator. Any caller that stops early (e.g., via `break`) will never trigger deletion, even with `delete_duplicate=True`. To avoid this implicit coupling, consider either making `get_existing_links` eager (non-generator) or moving duplicate deletion into a separate, explicit API.

Suggested implementation:

```python
def get_existing_links(linkwarden_url, linkwarden_token, collection_id, delete_duplicate=False):
    """
    Get all existing links from a collection, with optional duplicate detection and deletion.

    This function eagerly fetches all pages of links into memory before performing
    any duplicate detection or deletion. This guarantees that duplicate handling
    does not depend on the caller exhausting a generator.

    Args:
        linkwarden_url: Base URL of Linkwarden instance.
        linkwarden_token: Authorization token for the API.
        collection_id: Collection ID to fetch links from.
        delete_duplicate: If True, automatically delete duplicate links found.

    Returns:
        A list of all link URLs (both unique and duplicates). When
        delete_duplicate=True, the list still contains the URLs as originally
        returned by the API; duplicates are deleted after collection, not filtered
        out of the returned list.

```

To fully implement the eager behavior and remove the dependency on exhausting a generator, you should also:

1. Inside `get_existing_links`:
   - Initialize an in-memory list before the paging loop, e.g. `all_links = []`.
   - Replace any `yield` statements in the `while`/paging loop with code that appends to this list (e.g. `all_links.append(url)` or `all_links.extend(urls)`).
   - Ensure the paging loop runs to completion unconditionally (i.e., do not return early from inside the loop).
2. After the paging loop finishes:
   - Run your duplicate detection on `all_links` to determine which links are duplicates.
   - If `delete_duplicate` is True, call your existing deletion helper (e.g. `delete_links(...)`) with the set/list of duplicate link IDs.
3. At the end of the function:
   - `return all_links` instead of using a generator return pattern.
4. If there are callers that rely on streaming behavior for very large collections, consider:
   - Introducing a separate streaming helper (e.g. `iter_existing_links(...)`) that exposes the generator behavior but does **not** perform any deletion, while keeping `get_existing_links` as the eager, side-effecting API.
</issue_to_address>

### Comment 2
<location> `README.md:12` </location>
<code_context>
 - Export GitHub stars to Linkwarden
 - Export to existing collection or create new colletion
 - Keep Linkwarden collection up-to-date with your GitHub stars using the docker image.
-
</code_context>

<issue_to_address>
**issue (typo):** Fix the spelling of "colletion" to "collection".

```suggestion
- Export to existing collection or create new collection
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Owner

@rtuszik rtuszik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR!

Please address my comments and current failing pytest/lint errors so I can merge this.

Comment on lines 107 to 195
"""
Fetch all collections from the Linkwarden instance.
Args:
linkwarden_url: Base URL of the Linkwarden instance
linkwarden_token: Authorization token for the API
Returns:
list: List of collection objects
Raises:
Exception: If there's an error fetching collections from the API
"""
url = f"{linkwarden_url.rstrip('/')}/api/v1/collections"
headers = {
"Authorization": f"Bearer {linkwarden_token}",
"Content-Type": "application/json",
}
try:
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
data = response.json()
collections = data.get("response", [])
logger.debug(f"Fetched collections: {json.dumps(collections, indent=2)}")
return collections
except requests.RequestException as e:
raise Exception(f"Error fetching collections from Linkwarden: {str(e)}") from e


def create_collection(linkwarden_url, linkwarden_token, name, description=""):
"""
Create a new collection in Linkwarden.
Args:
linkwarden_url: Base URL of the Linkwarden instance
linkwarden_token: Authorization token for the API
name: Name of the collection to create
description: Optional description for the collection (default: empty string)
Returns:
dict: The created collection object if successful, None otherwise
"""
url = f"{linkwarden_url.rstrip('/')}/api/v1/collections"
headers = {
"Authorization": f"Bearer {linkwarden_token}",
"Content-Type": "application/json",
}
data = {"name": name, "description": description}
try:
logger.debug(f"Attempting to create collection: {name}")
response = requests.post(
url,
headers=headers,
json=data,
timeout=30,
)
logger.debug(f"Response status code: {response.status_code}")
logger.debug(f"Response content: {response.text}")
response.raise_for_status()
created_collection = response.json().get("response", {})
logger.info(f"Created collection: {created_collection}")
return created_collection
except Timeout:
logger.error("Request timed out while creating new collection in Linkwarden")
return None
except requests.RequestException as e:
logger.error(f"Error creating new collection in Linkwarden: {str(e)}")
if hasattr(e, "response") and e.response is not None:
logger.error(f"Response status code: {e.response.status_code}")
logger.error(f"Response content: {e.response.text}")
return None


def upload_link(linkwarden_url, linkwarden_token, collection_id, repo, tags):
"""
Upload a repository as a link to Linkwarden.
Args:
linkwarden_url: Base URL of the Linkwarden instance
linkwarden_token: Authorization token for the API
collection_id: ID of the collection to add the link to
repo: Repository object containing repository information
tags: List of tags to associate with the link
Returns:
str or None: The ID of the created link if successful,
LINK_EXISTS_GLOBALLY if link already exists,
None if there was an error
"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get rid of all AI docstring comments added. They don't add any value and are inconsistent with the rest of the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will send new PR tomorrow 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Failed to get all the existing links when have too many duplicate links

2 participants