⚠️ SECURITY ALERT: This repository had exposed secrets in history. See SECURITY.md for immediate actions.🔀 FORK NOTICE: This is a heavily modified fork. Original: ardakaraosmanoglu/kibris-emlak-101evler-scraper
Changes: +2000 lines, Telegram bot, notifications, Docker support, comprehensive scanning, professional architecture.
This project scrapes property listing data from 101evler.com (specifically for Northern Cyprus) and extracts the details into a structured CSV file.
- Scrapes listing URLs from search result pages.
- Uses Playwright for search pages to handle dynamic content.
- Saves individual listing pages as HTML files.
- Avoids re-scraping already saved listings and search pages.
- Extracts detailed information from saved HTML listing pages using BeautifulSoup.
- Handles potential errors during scraping and extraction.
- Calculates approximate monthly rent in TL (based on a 14x multiplier and current exchange rates from the Turkish Central Bank).
- Outputs extracted data to a CSV file (
property_details.csv). - Includes continuous run mode for
extract_data.py. - Automatically detects the total number of pages and listings.
- Pauses for 10 minutes when blocked by access controls, then automatically resumes.
- Stops when two consecutive blocking attempts are detected.
- Python 3.8+
- pip (Python package installer)
-
Clone the repository:
git clone <your-repository-url> cd <repository-directory>
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Install Playwright browsers: The
crawl4ailibrary uses Playwright. You might need to install its browser binaries the first time:playwright install
This script crawls the search result pages to find listing URLs and then scrapes each listing's HTML content.
- Configuration:
- Open
src/scraper/main.py. * Modifybase_search_url: Change the path (e.g.,/magusa,/girne,/lefkosa) to target different areas or property types (e.g.,kiralik-daire,satilik-villa). The base should look likehttps://www.101evler.com/kibris/<listing_type>/<area>. * You can also adjust theoutput_dir(default:listings) andpages_dir(default:pages) if needed. - Command-line Arguments:
--max-pages: Specify the maximum number of search pages to scrape.
python -m scraper.main --max-pages 15 ``` If not specified, the script will automatically detect and use the total number of pages from the website.
- Run the scraper:
python -m scraper.main
```
The script will:
* Fetch the first search page and automatically determine the total number of pages and listings.
* Fetch search result pages (using Playwright) up to the determined maximum number of pages.
* Save search page HTML to the pages/ directory.
* Extract listing URLs from these pages.
* Fetch individual listing pages (without Playwright).
* Save listing HTML to the `listings/` directory.
* Skip already downloaded search pages and listings.
* Log progress and delays to the console.
* When access is blocked, wait for 10 minutes and retry automatically.
* If blocked on second attempt, the script will stop.
* Failed listing URLs are saved to `listings/failed/failed_urls.txt` for potential retries.
This script parses the saved HTML files in the listings/ directory and extracts property details into a CSV file.
- Run the extractor:
python -m scraper.extract_data
```
The script will:
* Read all .html files from the `listings/` directory.
* Skip listings already present in the output CSV.
* Parse HTML using BeautifulSoup to extract details like price, location, features, dates, agency, etc.
* Fetch current TRY exchange rates for price conversion.
* Calculate an estimated 14x monthly rent in TL (`price_tl_14x`).
* Append the extracted data to `property_details.csv`.
* Update existing TL prices in the CSV based on current exchange rates.
- Continuous Mode:
To run the extractor periodically (e.g., if the scraper runs in the background or via cron):
python -m scraper.extract_data --continuous [INTERVAL_MINUTES] [MAX_RUNS]
```
* INTERVAL_MINUTES: Wait time in minutes between runs (default: 30).
* `MAX_RUNS`: Maximum number of times to run (default: 10).
listings/: Directory containing the raw HTML of individual property listings.pages/: Directory containing the raw HTML of search result pages.property_details.csv: CSV file containing the extracted and structured property data.
The script now automatically determines the total number of pages and listings by:
- Making an API request to mimic the website's JavaScript behavior
- Analyzing HTML content to find pagination information
- Calculating the total pages based on total listings (assuming 30 listings per page)
- Selecting the most reliable source of information
This means you no longer need to manually set the maximum number of pages to scrape. The script will:
- Print the detected total listings and total pages
- Use the detected values for scraping
- Allow you to override with the
--max-pagesargument if needed - Stop automatically when reaching empty pages
The script implements a smart cooldown system when access is blocked:
-
When a blocking page is detected (Cloudflare or other access controls), the script:
- Displays a message:
!!! Erişim engellendi. 10 dakika bekleniyor ve tekrar denenecek... !!! - Waits for 10 minutes (cooldown period)
- Automatically retries the same request
- Displays a message:
-
This allows temporary IP restrictions or rate-limiting to expire before continuing.
-
If blocked again on the second attempt, the script will stop to prevent further blocking.
This feature makes the scraper more resilient against temporary access restrictions and allows for unattended operation.
See requirements.txt.
The easiest way to run this scraper is using Docker, which handles all dependencies including Playwright browsers automatically.
- Docker installed (version 20.10+)
- Docker Compose installed (version 1.29+)
-
Build the Docker image:
docker-compose build
-
Run the scraper (one-time execution):
docker-compose run --rm scraper python -m scraper.main
-
Run data extraction:
docker-compose run --rm scraper python -m scraper.extract_data
-
Generate reports:
docker-compose run --rm scraper python -m scraper.report
-
Run orchard analysis:
docker-compose run --rm scraper python -m scraper.orchard_analysis
-
Generate Word report for agents:
docker-compose run --rm scraper python -m scraper.generate_agent_report
-
Search examples:
# Basic search docker-compose run --rm scraper python -m scraper.search basic "guzelyurt arsa" --out reports/search_results.xlsx # Advanced search docker-compose run --rm scraper python -m scraper.search advanced --city guzelyurt --property-type arsa --min-donum 5
End-to-end example to scrape, extract, and report all Lefkoşa rentals with max ₺30,000:
# 1) Scrape Lefkoşa rental listings (adjust max pages as needed)
docker-compose run --rm scraper python -m scraper.main --city lefkosa --listing-type kiralik --property-type daire --max-pages 15
# 2) Extract data from saved HTML into CSV
docker-compose run --rm scraper python -m scraper.extract_data
# 3) Generate filtered rental report (Markdown/Excel output under reports/)
docker-compose run --rm scraper python -m scraper.report lefkosa-rent --max-price-try 30000Notes:
- Step 1 parameters depend on your CLI in
scraper.main(city/listing-type/subtype). If not available, run the default scrape and rely on step 3 filter. - The report command will focus on KKTC Lefkoşa rentals and include only entries where normalized price in TRY ≤ 30,000.
All important data is persisted via Docker volumes:
property_details.csv- Main databasepages/- Search result HTML pageslistings/- Individual listing HTML filesreports/- Generated reports (MD, XLSX, DOCX)temp/- Temporary files
These directories are mapped to your local filesystem, so data persists even if containers are removed.
To run the scraper continuously in the background:
docker-compose up -d scraperView logs:
docker-compose logs -f scraperStop the service:
docker-compose downTo enable automated scheduling with cron:
- Edit
crontabfile to configure your schedule - Start the scheduler service:
docker-compose --profile scheduler up -d scraper-scheduler
Example cron schedule:
- Daily scraping at 2 AM
- Data extraction at 2:30 AM
- Report generation at 3 AM
- Weekly orchard analysis on Mondays at 4 AM
Default resource limits:
- CPU: 1-2 cores
- Memory: 1-2 GB
Adjust in docker-compose.yml under deploy.resources if needed.
For debugging or manual operations:
docker-compose run --rm scraper /bin/bashAdd custom environment variables in docker-compose.yml under the environment section:
environment:
- PYTHONUNBUFFERED=1
- CUSTOM_VAR=value# Build image
docker-compose build
# Run specific script
docker-compose run --rm scraper python <script.py>
# Start service in background
docker-compose up -d
# View logs
docker-compose logs -f [service-name]
# Stop all services
docker-compose down
# Remove volumes (WARNING: deletes data)
docker-compose down -v
# Shell access
docker-compose run --rm scraper /bin/bash
# Check container status
docker-compose ps
# Restart service
docker-compose restart scraperPlaywright browser issues:
- Browsers are pre-installed in the Docker image
- If issues occur, rebuild:
docker-compose build --no-cache
Permission errors:
- Ensure the mounted directories are writable
- On Linux:
chmod -R 777 pages listings reports temp
Out of memory:
- Increase memory limits in docker-compose.yml
- Or restart Docker Desktop and allocate more resources
Network timeouts:
- Check your internet connection
- Increase timeout values in scripts if needed
Data not persisting:
- Verify volume mounts in docker-compose.yml
- Check that local directories exist before running
.
├─ src/
│ └─ scraper/
│ ├─ __init__.py
│ ├─ main.py # Scraper entrypoint (python -m scraper.main)
│ ├─ extract_data.py # HTML → CSV extractor (python -m scraper.extract_data)
│ ├─ report.py # Reports + CLI (general, guzelyurt-land, lefkosa-rent)
│ ├─ excel_report.py # Excel aggregations
│ ├─ search.py # Basic/advanced search + export
│ ├─ orchard_analysis.py # Orchard/land analysis
│ └─ generate_agent_report.py # Agent-facing DOCX
├─ listings/ # Saved listing HTML files (persisted)
├─ pages/ # Saved search page HTML files (persisted)
├─ reports/ # Generated reports (MD/XLSX/DOCX)
├─ temp/ # Temporary files
├─ docker-compose.yml # Orchestration
├─ Dockerfile # Multi-stage Docker image (PYTHONPATH=/app/src)
├─ crontab # Optional cron jobs (module-based)
├─ requirements.txt # Python dependencies
└─ README.md # This file
python -m scraper.orchard_analysis --city guzelyurt --property-type arsa --listing-type Sale --min-donum 1 --export-json reports/guzelyurt_orchard_summary.jsonpython -m scraper.orchard_analysis --min-donum 10 --core-city-token guzelyurt --core-district-tokens piyalepasa,merkez --export-xlsx reports/guzelyurt_orchard_pricing_core10.xlsx
python -m scraper.generate_agent_reportpython -m scraper.search advanced --city guzelyurt --property-type arsa --min-donum 5 --sort price_per_donum_try:asc --out reports/arama_guzelyurt_arsa_5donum.xlsxpython -m scraper.report
python -m scraper.excel_reportYou have a remote configured at:
origin https://github.com/ardakaraosmanoglu/kibris-emlak-101evler-scraper (fetch)
origin https://github.com/ardakaraosmanoglu/kibris-emlak-101evler-scraper (push)
If you can already push with your Git credentials, simply run a normal push from the repo root:
git push -u origin mainIf you prefer using a Personal Access Token (PAT) without storing credentials globally, use the helper scripts in scripts/:
-
Create a classic GitHub PAT with the scope:
repo -
Push using the PAT (Basic auth header under the hood):
# Usage
PowerShell -ExecutionPolicy Bypass -File scripts/git_push_with_token.ps1 -Username <github-username> -Token <your_pat> -RemoteUrl "https://github.com/<owner>/<repo>.git" -Branch mainTo validate remote access with a PAT without pushing:
PowerShell -ExecutionPolicy Bypass -File scripts/git_lsremote_with_token.ps1 -Username <github-username> -Token <your_pat> -RemoteUrl "https://github.com/<owner>/<repo>.git"Notes:
- For fine-grained PATs, ensure permissions include “Contents: Read and Write” and “Administration: Read and Write” if you plan to create repos via API.
- Interactive alternative: install Git Credential Manager or GitHub CLI (
gh auth login) and thengit pushwill open a browser window to authenticate. - Never commit your PAT or include it in URLs stored in git history.
- Do NOT commit secrets (token, API key, password, cookies) — even in comments.
- Store secrets in
.env(gitignored) or CI secrets. Use.env.exampleas a template. - Local protection: pre-commit runs Gitleaks on staged changes.
- Install once:
pip install pre-committhenpre-commit install. - Optional full run:
pre-commit run --all-files.
- Install once:
- CI protection: GitHub Actions runs Gitleaks on every push/PR.
- If a leak happens: revoke/rotate the key, remove from files, rewrite history with
git-filter-repo, and force-push.
Detaylı rehber için docs/SECURITY.md dosyasına bakın.