Skip to content

feat: Google Docs, Files, PDF URLs need to be converted for their export URL #312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

vtempest
Copy link

@vtempest vtempest commented Jun 2, 2025

See here about embeded pdfs and google docs: docling-project/docling#1682 (comment)

Also take a look at youtube to transcript https://airesearch.js.org/functions/extractor/url-to-content/youtube-to-text

Copy link

mergify bot commented Jun 2, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Enforce conventional commit

This rule is failing.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@vtempest vtempest changed the title Google Docs, Files, PDF URLs need to be converted for their export URL feat: Google Docs, Files, PDF URLs need to be converted for their export URL Jun 2, 2025
@dolfim-ibm
Copy link
Contributor

@vtempest nice PR, I tried it out and it worked, thanks!

I left a comment about the export format, and I was wondering if we shouldn't extend it already also for docs.google.com/spreadsheets/ and docs.google.com/presentation/.

Regarding the CI tests, please apply the pre-commit styling with

uv run pre-commit run --all-files

@vtempest
Copy link
Author

vtempest commented Jun 6, 2025

@vtempest nice PR, I tried it out and it worked, thanks!

I left a comment about the export format, and I was wondering if we shouldn't extend it already also for docs.google.com/spreadsheets/ and docs.google.com/presentation/.

Regarding the CI tests, please apply the pre-commit styling with

uv run pre-commit run --all-files

Great thinking Dolfim!

I added support for the 4 main google drive types.

I linted i to ensure proper str to anyhttpurl casting

 # Google Docs, Files, PDF URLs, Spreadsheets, Presentations: convert to export URL
        google_doc_id = re.search(
            r"google\.com\/(file|document|spreadsheets|presentation)\/d\/([\w-]+)",
            str(http_url),
        )
        if google_doc_id:
            doc_type = google_doc_id.group(1)
            doc_id = google_doc_id.group(2)

            if doc_type == "file":
                http_url = TypeAdapter(AnyHttpUrl).validate_python(
                    f"https://drive.google.com/uc?export=download&id={doc_id}"
                )
            elif doc_type == "document":
                http_url = TypeAdapter(AnyHttpUrl).validate_python(
                    f"https://docs.google.com/document/d/{doc_id}/export?format=docx"
                )
            elif doc_type == "spreadsheets":
                http_url = TypeAdapter(AnyHttpUrl).validate_python(
                    f"https://docs.google.com/spreadsheets/d/{doc_id}/export?format=xlsx"
                )
            elif doc_type == "presentation":
                http_url = TypeAdapter(AnyHttpUrl).validate_python(
                    f"https://docs.google.com/presentation/d/{doc_id}/export/pptx"
                )

@dolfim-ibm
Copy link
Contributor

Cool! But I think there is an issue with your commits. The snippet you posted above is not shown in the PR.

@vtempest vtempest closed this Jun 7, 2025
@vtempest vtempest deleted the patch-1 branch June 7, 2025 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants