A comprehensive LinkedIn scraper designed to extract ALL posts since account creation with enhanced text cleaning, engagement metrics, image downloads, and professional report generation. Features advanced post detection, comprehensive scrolling strategies, and intelligent deduplication.
- Comprehensive post detection: Multiple selectors to catch all post types
- Advanced scrolling strategy: Ensures ALL posts are loaded since account creation
- Intelligent deduplication: Prevents duplicate posts using multiple methods
- Enhanced text cleaning: Superior UI element removal and formatting
- Progress tracking: Real-time updates during scraping process
- Time-based limits: Prevents infinite scrolling with configurable timeouts
- Better engagement extraction: More accurate metrics collection
- Robust error handling: Continues scraping despite individual post errors
- Multiple output formats: Automatically generates JSON, Markdown, and PDF reports
- Professional reports: Beautiful reports with statistics, images, and comprehensive data
- Image downloads: Downloads and processes post images with optimization
- Authentication support: Email/password and cookie-based login
- Configurable limits: Control post count and scroll behavior
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtplaywright installSet these environment variables for authentication:
export LINKEDIN_EMAIL="your-email@example.com"
export LINKEDIN_PASSWORD="your-password"
export LINKEDIN_COOKIES="path/to/cookies.txt"
export LINKEDIN_STATE="auth_state.json"You can use browser cookies by saving them in Netscape format to your_linkedin_cookies.txt.
# Scrape ALL posts from an account (comprehensive mode) - Generates JSON, Markdown, and PDF
python enhanced_linkedin_scraper.py --url "https://www.linkedin.com/in/username/recent-activity/" --headed
# Scrape with time limit (30 minutes default)
python enhanced_linkedin_scraper.py --url "URL" --max-time 45 --headed
# Scrape specific number of posts
python enhanced_linkedin_scraper.py --url "URL" --max-posts 100 --headed
# Generate only specific output formats
python enhanced_linkedin_scraper.py --url "URL" --output-format pdf --headed
python enhanced_linkedin_scraper.py --url "URL" --output-format markdown --headed
# With authentication
python enhanced_linkedin_scraper.py --url "URL" --email "your@email.com" --password "password" --headed| Argument | Description | Default |
|---|---|---|
--url |
LinkedIn profile activity URL | Required |
--max-posts |
Maximum posts to scrape | None (all) |
--max-time |
Maximum time in minutes | 30 |
--headed |
Run with visible browser | False |
--email |
LinkedIn email | From env var |
--password |
LinkedIn password | From env var |
--cookies |
Netscape cookies file | From env var |
--state |
Playwright storage state file | auth_state.json |
--no-images |
Skip image downloads | False |
--output-format |
Output format: json, markdown, pdf, all | all |
The scraper generates multiple output formats:
linkedin_posts_enhanced_YYYYMMDD-HHMMSS.json- Complete raw datalinkedin_posts_report_YYYYMMDD-HHMMSS.md- Markdown report with statisticslinkedin_posts_report_YYYYMMDD-HHMMSS.pdf- Professional PDF report with imageslinkedin_images/- Downloaded images directoryauth_state.json- Saved authentication state
- Comprehensive statistics (post count, engagement, word count)
- Chronologically ordered posts
- Author information and engagement metrics
- Embedded images with proper linking
- Clean formatting suitable for documentation
- Professional layout with title page
- Summary statistics table
- Post content with proper formatting
- Embedded images with automatic scaling
- Page breaks for optimal readability
- Clickable LinkedIn URLs
- Uses 9+ different selectors to catch all post types
- Handles various LinkedIn post formats and layouts
- Detects posts that basic scrapers might miss
- Intelligent scrolling: Multiple techniques to trigger content loading
- Time-based limits: Prevents infinite scrolling (30-minute default)
- Alternative loading methods: Tries multiple approaches when regular scrolling fails
- Progress tracking: Real-time updates on posts found and time remaining
- 50+ removal patterns: Removes UI elements, buttons, metadata
- Smart deduplication: Prevents duplicate content using multiple methods
- Better formatting: Preserves meaningful content while removing noise
- Word/character counts: Provides content metrics
- Continues on errors: Individual post failures don't stop the entire process
- Multiple fallbacks: Tries different extraction methods if one fails
- Detailed logging: Comprehensive debugging information
If you encounter issues:
- Run with
--headedto see what's happening in the browser - Check if the URL is accessible and contains posts
- Verify authentication credentials if scraping private content
- LinkedIn may have updated their HTML structure - check for updates
- Python 3.7+
- playwright
- reportlab
- Pillow
- requests
- asyncio-pool
- typing-extensions
This tool is for educational and research purposes. Ensure compliance with LinkedIn's Terms of Service and applicable laws when using this scraper.