Skip to content

linkedin scrapper outputs pdf and markdown still under development

Notifications You must be signed in to change notification settings

MasterCraftArc/Scrappy

Repository files navigation

Enhanced LinkedIn Scraper

A comprehensive LinkedIn scraper designed to extract ALL posts since account creation with enhanced text cleaning, engagement metrics, image downloads, and professional report generation. Features advanced post detection, comprehensive scrolling strategies, and intelligent deduplication.

Features

  • Comprehensive post detection: Multiple selectors to catch all post types
  • Advanced scrolling strategy: Ensures ALL posts are loaded since account creation
  • Intelligent deduplication: Prevents duplicate posts using multiple methods
  • Enhanced text cleaning: Superior UI element removal and formatting
  • Progress tracking: Real-time updates during scraping process
  • Time-based limits: Prevents infinite scrolling with configurable timeouts
  • Better engagement extraction: More accurate metrics collection
  • Robust error handling: Continues scraping despite individual post errors
  • Multiple output formats: Automatically generates JSON, Markdown, and PDF reports
  • Professional reports: Beautiful reports with statistics, images, and comprehensive data
  • Image downloads: Downloads and processes post images with optimization
  • Authentication support: Email/password and cookie-based login
  • Configurable limits: Control post count and scroll behavior

Setup

1. Create and activate virtual environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

2. Install dependencies

pip install -r requirements.txt

3. Install Playwright browsers

playwright install

Configuration

Environment Variables (Optional)

Set these environment variables for authentication:

export LINKEDIN_EMAIL="your-email@example.com"
export LINKEDIN_PASSWORD="your-password"
export LINKEDIN_COOKIES="path/to/cookies.txt"
export LINKEDIN_STATE="auth_state.json"

Cookie Authentication

You can use browser cookies by saving them in Netscape format to your_linkedin_cookies.txt.

Usage

Basic Usage

# Scrape ALL posts from an account (comprehensive mode) - Generates JSON, Markdown, and PDF
python enhanced_linkedin_scraper.py --url "https://www.linkedin.com/in/username/recent-activity/" --headed

# Scrape with time limit (30 minutes default)
python enhanced_linkedin_scraper.py --url "URL" --max-time 45 --headed

# Scrape specific number of posts
python enhanced_linkedin_scraper.py --url "URL" --max-posts 100 --headed

# Generate only specific output formats
python enhanced_linkedin_scraper.py --url "URL" --output-format pdf --headed
python enhanced_linkedin_scraper.py --url "URL" --output-format markdown --headed

# With authentication
python enhanced_linkedin_scraper.py --url "URL" --email "your@email.com" --password "password" --headed

Command Line Arguments

Argument Description Default
--url LinkedIn profile activity URL Required
--max-posts Maximum posts to scrape None (all)
--max-time Maximum time in minutes 30
--headed Run with visible browser False
--email LinkedIn email From env var
--password LinkedIn password From env var
--cookies Netscape cookies file From env var
--state Playwright storage state file auth_state.json
--no-images Skip image downloads False
--output-format Output format: json, markdown, pdf, all all

Output Files

The scraper generates multiple output formats:

  • linkedin_posts_enhanced_YYYYMMDD-HHMMSS.json - Complete raw data
  • linkedin_posts_report_YYYYMMDD-HHMMSS.md - Markdown report with statistics
  • linkedin_posts_report_YYYYMMDD-HHMMSS.pdf - Professional PDF report with images
  • linkedin_images/ - Downloaded images directory
  • auth_state.json - Saved authentication state

Report Features

Markdown Reports

  • Comprehensive statistics (post count, engagement, word count)
  • Chronologically ordered posts
  • Author information and engagement metrics
  • Embedded images with proper linking
  • Clean formatting suitable for documentation

PDF Reports

  • Professional layout with title page
  • Summary statistics table
  • Post content with proper formatting
  • Embedded images with automatic scaling
  • Page breaks for optimal readability
  • Clickable LinkedIn URLs

Key Improvements

1. Comprehensive Post Detection

  • Uses 9+ different selectors to catch all post types
  • Handles various LinkedIn post formats and layouts
  • Detects posts that basic scrapers might miss

2. Advanced Scrolling Strategy

  • Intelligent scrolling: Multiple techniques to trigger content loading
  • Time-based limits: Prevents infinite scrolling (30-minute default)
  • Alternative loading methods: Tries multiple approaches when regular scrolling fails
  • Progress tracking: Real-time updates on posts found and time remaining

3. Enhanced Text Cleaning

  • 50+ removal patterns: Removes UI elements, buttons, metadata
  • Smart deduplication: Prevents duplicate content using multiple methods
  • Better formatting: Preserves meaningful content while removing noise
  • Word/character counts: Provides content metrics

4. Robust Error Handling

  • Continues on errors: Individual post failures don't stop the entire process
  • Multiple fallbacks: Tries different extraction methods if one fails
  • Detailed logging: Comprehensive debugging information

Troubleshooting

If you encounter issues:

  1. Run with --headed to see what's happening in the browser
  2. Check if the URL is accessible and contains posts
  3. Verify authentication credentials if scraping private content
  4. LinkedIn may have updated their HTML structure - check for updates

Requirements

  • Python 3.7+
  • playwright
  • reportlab
  • Pillow
  • requests
  • asyncio-pool
  • typing-extensions

Legal Notice

This tool is for educational and research purposes. Ensure compliance with LinkedIn's Terms of Service and applicable laws when using this scraper.

About

linkedin scrapper outputs pdf and markdown still under development

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages