- Python 3.9+
- Open a terminal or command prompt;
- Clone this repository:
git clone [email protected]:oxybrain/scraper-api-setup.git
- Navigate to your project directory:
cd scraper-api-setup/
- Create a virtual environment using Python 3.9:
python3.9 -m venv venv
- Activate the virtual environment:
- On Mac/Linux:
source venv/bin/activate
- On Windows:
venv\Scripts\activate
- On Mac/Linux:
- Ensure your virtual environment is activated
- Install the required packages:
pip install -r requirements.txt
The only thing you need to set it up is to prepare a list of URLs and payload of Scraper API parameters which can be generated in our playground. Below you will find general steps to run our Scraper API and a few examples of different options you have to run it.
Steps:
- Generate payload by describing your scraping needs in Oxy playground (https://dashboard.oxylabs.io/?route=/api-playground).
- Copy payload and store it either in
runtime_files/payload.json
or as a new file on your device. - Store URLs you wish to scrape in
runtime_files/urls.txt
or as a new file on your device. Each url must be separated by newline. - Run script which registers/performs your scraping jobs (ensure you have completed installation steps stated above and have environment activated).
python3.9 run.py
- This will start a setup wizard, where you have to provide information about your account and some other.
- Enter your Oxylabs username and password. It will be cached into file
.env
. If you later wish to change it, just remove this file.
- Initiate
run.py
script.python3.9 run.py
- When asked
Select where you wish to store the results.
enter1
to chooseLocally
. - When asked
Full path to existing directory where results should be stored:
enter path to directory where completed jobs should be stored. This directory must already exist on your device. - That's it, completed jobs will be stored in desired location after some time, depending on the number of URLs you asked to scrape.
Image of wizard for reference
We do support only AWS S3 or Google Cloud Storage for now.
- Initiate
run.py
script.python3.9 run.py
- When asked
Select where you wish to store the results.
enter2
to chooseCloud
. - When asked
Path to cloud bucket and directory/partition where results should be stored:
enter path to your cloud directory where completed jobs should be stored. Bucket permissions should be adjust by using these instructions - https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/cloud-storage. - When asked
Do you want to schedule urls to be scraped repetitively?(y/n)
entern
as now we just want to scrape all URLs once.
Image of wizard for reference
- Initiate
run.py
script.python3.9 run.py
- When asked
Select where you wish to store the results.
enter2
to chooseCloud
. - When asked
Path to cloud bucket and directory/partition where results should be stored:
enter path to your cloud directory where completed jobs should be stored. Bucket permissions should be adjust by using these instructions - https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/cloud-storage. - When asked
Do you want to schedule urls to be scraped repetitively?(y/n)
entery
. - Next you have to select frequency.
- If you choose
Daily
, you will need to enter hour scraping jobs will run each day. - If you choose
Weekly
, you will need to select weekday scraping jobs will run each week. - If you choose
Monthly
, you will need to enter day scraping jobs will run each month.
- If you choose
- Lastly, you will be asked to specify date and time when scheduler should stop. You MUST enter end time in format of
2032-12-21 12:34:45
. You will be able to stop scheduler before end time. Example of how to do it is in next section. - You will get schedule id, which you should save.
Image of wizard for reference
- Initiate
run.py
script.python3.9 stop.py
- Enter your username and password.
- Select schedule id you wish to deactivate.
Image of wizard for reference