Enhanced LinkedIn Scraper

A comprehensive LinkedIn scraper designed to extract ALL posts since account creation with enhanced text cleaning, engagement metrics, image downloads, and professional report generation. Features advanced post detection, comprehensive scrolling strategies, and intelligent deduplication.

Features

Comprehensive post detection: Multiple selectors to catch all post types
Advanced scrolling strategy: Ensures ALL posts are loaded since account creation
Intelligent deduplication: Prevents duplicate posts using multiple methods
Enhanced text cleaning: Superior UI element removal and formatting
Progress tracking: Real-time updates during scraping process
Time-based limits: Prevents infinite scrolling with configurable timeouts
Better engagement extraction: More accurate metrics collection
Robust error handling: Continues scraping despite individual post errors
Multiple output formats: Automatically generates JSON, Markdown, and PDF reports
Professional reports: Beautiful reports with statistics, images, and comprehensive data
Image downloads: Downloads and processes post images with optimization
Authentication support: Email/password and cookie-based login
Configurable limits: Control post count and scroll behavior

Setup

1. Create and activate virtual environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

2. Install dependencies

pip install -r requirements.txt

3. Install Playwright browsers

playwright install

Configuration

Environment Variables (Optional)

Set these environment variables for authentication:

export LINKEDIN_EMAIL="your-email@example.com"
export LINKEDIN_PASSWORD="your-password"
export LINKEDIN_COOKIES="path/to/cookies.txt"
export LINKEDIN_STATE="auth_state.json"

Cookie Authentication

You can use browser cookies by saving them in Netscape format to your_linkedin_cookies.txt.

Usage

Basic Usage

# Scrape ALL posts from an account (comprehensive mode) - Generates JSON, Markdown, and PDF
python enhanced_linkedin_scraper.py --url "https://www.linkedin.com/in/username/recent-activity/" --headed

# Scrape with time limit (30 minutes default)
python enhanced_linkedin_scraper.py --url "URL" --max-time 45 --headed

# Scrape specific number of posts
python enhanced_linkedin_scraper.py --url "URL" --max-posts 100 --headed

# Generate only specific output formats
python enhanced_linkedin_scraper.py --url "URL" --output-format pdf --headed
python enhanced_linkedin_scraper.py --url "URL" --output-format markdown --headed

# With authentication
python enhanced_linkedin_scraper.py --url "URL" --email "your@email.com" --password "password" --headed

Command Line Arguments

Argument	Description	Default
`--url`	LinkedIn profile activity URL	Required
`--max-posts`	Maximum posts to scrape	None (all)
`--max-time`	Maximum time in minutes	30
`--headed`	Run with visible browser	False
`--email`	LinkedIn email	From env var
`--password`	LinkedIn password	From env var
`--cookies`	Netscape cookies file	From env var
`--state`	Playwright storage state file	auth_state.json
`--no-images`	Skip image downloads	False
`--output-format`	Output format: json, markdown, pdf, all	all

Output Files

The scraper generates multiple output formats:

linkedin_posts_enhanced_YYYYMMDD-HHMMSS.json - Complete raw data
linkedin_posts_report_YYYYMMDD-HHMMSS.md - Markdown report with statistics
linkedin_posts_report_YYYYMMDD-HHMMSS.pdf - Professional PDF report with images
linkedin_images/ - Downloaded images directory
auth_state.json - Saved authentication state

Report Features

Markdown Reports

Comprehensive statistics (post count, engagement, word count)
Chronologically ordered posts
Author information and engagement metrics
Embedded images with proper linking
Clean formatting suitable for documentation

PDF Reports

Professional layout with title page
Summary statistics table
Post content with proper formatting
Embedded images with automatic scaling
Page breaks for optimal readability
Clickable LinkedIn URLs

Key Improvements

1. Comprehensive Post Detection

Uses 9+ different selectors to catch all post types
Handles various LinkedIn post formats and layouts
Detects posts that basic scrapers might miss

2. Advanced Scrolling Strategy

Intelligent scrolling: Multiple techniques to trigger content loading
Time-based limits: Prevents infinite scrolling (30-minute default)
Alternative loading methods: Tries multiple approaches when regular scrolling fails
Progress tracking: Real-time updates on posts found and time remaining

3. Enhanced Text Cleaning

50+ removal patterns: Removes UI elements, buttons, metadata
Smart deduplication: Prevents duplicate content using multiple methods
Better formatting: Preserves meaningful content while removing noise
Word/character counts: Provides content metrics

4. Robust Error Handling

Continues on errors: Individual post failures don't stop the entire process
Multiple fallbacks: Tries different extraction methods if one fails
Detailed logging: Comprehensive debugging information

Troubleshooting

If you encounter issues:

Run with --headed to see what's happening in the browser
Check if the URL is accessible and contains posts
Verify authentication credentials if scraping private content
LinkedIn may have updated their HTML structure - check for updates

Requirements

Python 3.7+
playwright
reportlab
Pillow
requests
asyncio-pool
typing-extensions

Legal Notice

This tool is for educational and research purposes. Ensure compliance with LinkedIn's Terms of Service and applicable laws when using this scraper.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
enhanced_linkedin_scraper.py		enhanced_linkedin_scraper.py
requirements.txt		requirements.txt
test_report_generation.py		test_report_generation.py
test_scraper.py		test_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhanced LinkedIn Scraper

Features

Setup

1. Create and activate virtual environment

2. Install dependencies

3. Install Playwright browsers

Configuration

Environment Variables (Optional)

Cookie Authentication

Usage

Basic Usage

Command Line Arguments

Output Files

Report Features

Markdown Reports

PDF Reports

Key Improvements

1. Comprehensive Post Detection

2. Advanced Scrolling Strategy

3. Enhanced Text Cleaning

4. Robust Error Handling

Troubleshooting

Requirements

Legal Notice

About

Uh oh!

Releases

Packages

Languages

MasterCraftArc/Scrappy

Folders and files

Latest commit

History

Repository files navigation

Enhanced LinkedIn Scraper

Features

Setup

1. Create and activate virtual environment

2. Install dependencies

3. Install Playwright browsers

Configuration

Environment Variables (Optional)

Cookie Authentication

Usage

Basic Usage

Command Line Arguments

Output Files

Report Features

Markdown Reports

PDF Reports

Key Improvements

1. Comprehensive Post Detection

2. Advanced Scrolling Strategy

3. Enhanced Text Cleaning

4. Robust Error Handling

Troubleshooting

Requirements

Legal Notice

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages