You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello everyone!
I'm a Java developer who has recently tried writing a few web crawlers in Java to scrape news articles and official announcement documents. To be frank, I'm still a novice when it comes to web crawling. As a beginner, I can currently only manage to scrape basic information and files successfully. However, when running the crawler multiple times in succession, it will encounter temporary connection failures and be unable to proceed with scraping.
I've tried using some low-cost proxy pools, but the results have been less than satisfactory.
Now, on recommendation, I'm looking to experiment with the powerful tool crawl4AI to tackle a task in my development workflow: scraping and downloading annual reports from the official website of Bursa Malaysia (https://www.bursamalaysia.com/) for analytical purposes. However, I have no idea where to start with this.
If implementing this in Java, here are the steps I would follow:
Step 1: Call the list retrieval API endpoint
https://disclosure.bursamalaysia.com/FileAccess/viewHtml?e=3604910
Here, the parameter 3604910 corresponds to the ann_id=3604910 returned by the list API in Step 1.
This endpoint returns an HTML page, which contains the following snippet (with embedded download links):
These three steps enable the querying and downloading of the corresponding annual reports. I'm wondering if crawl4AI offers a simpler, more streamlined solution for this task. Thank you very much!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone!
I'm a Java developer who has recently tried writing a few web crawlers in Java to scrape news articles and official announcement documents. To be frank, I'm still a novice when it comes to web crawling. As a beginner, I can currently only manage to scrape basic information and files successfully. However, when running the crawler multiple times in succession, it will encounter temporary connection failures and be unable to proceed with scraping.
I've tried using some low-cost proxy pools, but the results have been less than satisfactory.
Now, on recommendation, I'm looking to experiment with the powerful tool crawl4AI to tackle a task in my development workflow: scraping and downloading annual reports from the official website of Bursa Malaysia (https://www.bursamalaysia.com/) for analytical purposes. However, I have no idea where to start with this.
If implementing this in Java, here are the steps I would follow:
Step 1: Call the list retrieval API endpoint
https://www.bursamalaysia.com/api/v1/announcements/search?ann_type=company&company=0129&keyword=&dt_ht=&dt_lt=&cat=AR%2CARCO&sub_type=&mkt=&sec=&subsec=&per_page=20&page=1&_=1767088639073
The 13-digit number at the very end of the URL is the current 13-digit timestamp.
This list API returns the following response:
{
"recordsTotal": 22,
"recordsFiltered": 22,
"category_message": "",
"data": [
[
1,
"\u003cdiv class='d-lg-none'\u003e31 Oct\u003cbr/\u003e2025\u003c/div\u003e\u003cdiv class='d-lg-inline-block d-none'\u003e31 Oct 2025\u003c/div\u003e",
"\u003ca href='/trade/trading_resources/listing_directory/company-profile?stock_code=0129' target=_blank\u003eSILVER RIDGE HOLDINGS BHD\u003c/a\u003e",
"\u003ca href='/market_information/announcements/company_announcement/announcement_details?ann_id=3604910' target=_blank\u003eAnnual Report \u0026 CG Report - 2025\u003c/a\u003e"
],
[
2,
"\u003cdiv class='d-lg-none'\u003e30 Oct\u003cbr/\u003e2024\u003c/div\u003e\u003cdiv class='d-lg-inline-block d-none'\u003e30 Oct 2024\u003c/div\u003e",
"\u003ca href='/trade/trading_resources/listing_directory/company-profile?stock_code=0129' target=_blank\u003eSILVER RIDGE HOLDINGS BHD\u003c/a\u003e",
"\u003ca href='/market_information/announcements/company_announcement/announcement_details?ann_id=3496498' target=_blank\u003eAnnual Report \u0026 CG Report - 2024\u003c/a\u003e"
]...
]
}
Step 2: Call the detail retrieval API endpoint
https://disclosure.bursamalaysia.com/FileAccess/viewHtml?e=3604910
Here, the parameter 3604910 corresponds to the ann_id=3604910 returned by the list API in Step 1.
This endpoint returns an HTML page, which contains the following snippet (with embedded download links):
Attachments
Step 3: Call the download API endpoint
https://disclosure.bursamalaysia.com/FileAccess/apbursaweb/download?id=247313&name=EA_DS_ATTACHMENTS
The URL for this download endpoint is constructed from the return value from Step 2: /FileAccess/apbursaweb/download?id=247312&name=EA_DS_ATTACHMENTS.
These three steps enable the querying and downloading of the corresponding annual reports. I'm wondering if crawl4AI offers a simpler, more streamlined solution for this task. Thank you very much!
Beta Was this translation helpful? Give feedback.
All reactions