Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Trying to run crawl4ai on https://platform.openai.com/docs outputs nothing #972

Closed
aliak00 opened this issue Apr 10, 2025 · 4 comments
Labels

Comments

@aliak00
Copy link

aliak00 commented Apr 10, 2025

crawl4ai version

0.5.0.post8

Expected Behavior

Hi! I was running crawl4ai on a few things, and I happened to try https://platform.openai.com/docs and I would expect to be able to get the markdown.

Current Behavior

Empty response.

❯ .venv/bin/crwl "https://platform.openai.com/docs/overview" -o markdown



~/dev/quillia/apps/backend/products/web-crawler-py main* 
❯

Is this reproducible?

Yes

Inputs Causing the Bug

Just defaults, I don't think I have any configuration for anything:


.venv/bin/crwl "https://platform.openai.com/docs/overview" -o markdown

Steps to Reproduce

Run:

.venv/bin/crwl "https://platform.openai.com/docs/overview" -o markdown

Code snippets

OS

macOS

Python version

3.13

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

@aliak00 aliak00 added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Apr 10, 2025
@ntohidi
Copy link
Collaborator

ntohidi commented Apr 11, 2025

Hi @aliak00
OpenAI has implemented very strong bot detection, so we need some other approaches to bypass it. One idea is to use an identity-based browser setup, essentially, create a browser instance, log in to the page, and save the user data profile directory. Then, launch a new browser instance that uses that same profile to preserve your identity.

The CLI doesn’t support that kind of setup at the moment.

@ntohidi
Copy link
Collaborator

ntohidi commented Apr 11, 2025

@aliak00 I will close the issue, but feel free to continue asking questions if you have any.

@ntohidi ntohidi closed this as completed Apr 11, 2025
@ntohidi ntohidi added ❓ Question Q&A and removed 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Apr 11, 2025
@aliak00
Copy link
Author

aliak00 commented Apr 11, 2025

Aha! Ok yeah that makes sense. Thank you :)

Follow up question - is there a way to just be told that. ... "yeah this website is not crawl able because of their own security measures or something" ?

@ntohidi
Copy link
Collaborator

ntohidi commented Apr 11, 2025

Aha! Ok yeah that makes sense. Thank you :)

Follow up question - is there a way to just be told that. ... "yeah this website is not crawl able because of their own security measures or something" ?

Nice idea! You can now check result.success, and if it’s false, that means something went wrong, though it won’t explicitly say it’s due to bot detection. Still, it’s a good starting point.

@unclecode @aravindkarnam , we can add this to the backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants