Note
Beginner Learning Project focused on web scraping skills and foundational scraping logic
- Python 3.11
- Selenium
- WebDriver Manager
- BeautifulSoup4
- Pandas
This project is an initial iteration of a Python-based web scraper designed to gather data on used car and motorcycle listings from OLX Portugal (OLX.pt). It leverages modern web scraping techniques t[...]
- Multi-Page Scraping: Effectively navigates through multiple search result pages using URL parameter manipulation (
?page=N
), demonstrating pagination handling. - Efficient Browser Automation: Utilizes Selenium and WebDriver Manager to control a Chrome instance, crucially reusing a single browser session across multiple page requests for improved efficien[...]
- Robust Waiting Strategy: Implements Selenium's
WebDriverWait
to intelligently wait for essential page elements (like the main listing grid) to load, minimizing errors caused by fixed delays an[...] - Adaptive Data Extraction: Employs BeautifulSoup4 with CSS selectors (prioritizing
data-testid
attributes where available, with class-based fallbacks) to parse the rendered HTML and extract key[...] - Structured Output: Organizes scraped data using Pandas DataFrames and saves the results to a CSV file for easy access and further analysis.
- Forward-Thinking Design: Includes analysis of OLX.pt's URL structure, identifying parameters for filtering (price, year, fuel type, etc.) and sorting (price, date). This lays the groundwork for [...]
Developing this scraper involved overcoming challenges common to scraping large commercial platforms:
- Initial Tooling/Environment Compatibility: Early exploration involved evaluating different browser automation tools (like Playwright) within specific server environments on Windows. Encountere[...]
- Dynamic Content & Structure Variations: OLX.pt required browser automation (Selenium) to ensure all listing content was loaded correctly. Variations in HTML structure between different ad type[...]
- Efficient Pagination: Dynamically detecting the total number of pages proved unreliable due to how controls were rendered. Solution: Adopted a pragmatic approach using a configurable page [...]
- Status: Initial working version. Successfully scrapes core data fields from a defined number of pages within the OLX.pt cars category. Saves combined data to CSV.
- Next Steps:
- Implement dynamic URL generation based on user-defined filters and sorting options.
- Add comprehensive data cleaning and parsing logic (e.g., numeric price, year/km extraction, date parsing).
- Refine selectors further for edge cases.
- Integrate database storage for persistent data and tracking changes.
- Develop logic for more efficient scraping runs (e.g., only fetching newest ads).
- Clone the repository.
- Navigate into the project directory (
autoscannerpt
). - Create and activate a Python virtual environment (e.g.,
python -m venv myenv
,myenv\Scripts\activate
). - Install dependencies:
pip install -r requirements.txt
- Run the scraper:
python src/autoscannerpt/scraper.py
- (Modify configuration constants like
MAX_PAGES_TO_SCRAPE
directly insrc/autoscannerpt/scraper.py
for now).
- (Modify configuration constants like