Skip to content

Conversation

@chrishamant
Copy link

@chrishamant chrishamant commented Jun 5, 2025

I know this is just unprompted feature and this is total AI slop but this project was well written enough that it was super easy to whip out what I needed. My goal was to cheaply scrape some content on a site that doesn't offer an API nor takes kindly to bots/scraping and requires logging in (not trying to violate TOS). I figured I could just kind of manually click through the site and let devtools collect all the requests, save the har and process it later. I guess this is slightly grey area? my intentions are pure... I figured there had to be a har library and I found this but didn't quite do what I wanted (hargo dump serves a different purpose upon closer inspection). So I cloned this and asked my buddy Claude to help and was able to one-shot this addition and worked fine for what I needed... Sorry for the unprompted PR - I mean no offense if hate the feature, the means by which it was authored or the code itself. Thought I'd submit this in case you or someone else deemed useful.

Thank you for your work!

  • Add new 'extract' command with alias 'e' to extract response content from HAR files
  • Support two organization modes: by domain (default) and by content type (--sort flag)
  • Content type organization groups files into directories: images/, json/, html/, css/, javascript/, fonts/, etc.
  • Smart filename generation with proper extensions based on MIME types
  • Handle filename collisions with incremental naming (image_001.jpg, posts_002.json)
  • Special handling for API responses (posts.json, api_response.json)
  • Generate CSV manifest file mapping original URLs to extracted file paths
  • Base64 decode response content when needed
  • Add comprehensive VS Code debug configurations for all commands
  • Update README with extract command documentation

Copy link
Owner

@mrichman mrichman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a suite of unit tests and useful comments and I'll consider merging your AI slop.

- Add new 'extract' command with alias 'e' to extract response content from HAR files
- Support two organization modes: by domain (default) and by content type (--sort flag)
- Content type organization groups files into directories: images/, json/, html/, css/, javascript/, fonts/, etc.
- Smart filename generation with proper extensions based on MIME types
- Handle filename collisions with incremental naming (image_001.jpg, posts_002.json)
- Special handling for API responses (posts.json, api_response.json)
- Generate CSV manifest file mapping original URLs to extracted file paths
- Base64 decode response content when needed
- Add comprehensive VS Code debug configurations for all commands
- Update README with extract command documentation
- Add initial test harness/structure with simple stab at CI
@chrishamant
Copy link
Author

I spend a few $$ and added some more comments (their usefulness/utility is for sure a matter of taste/questionable) and fixed the Makefile to add coverage and run newly added tests for the extract functionality... I also sent a YOLO mode addition of using github action to run tests in CI. I'm more familiar with gitlab so can't judge the quality of this approach. https://github.com/chrishamant/hargo/actions/runs/15482304554 (I'd have used your Dockerfile and ran the integration stuff you have setup if I had my druthers but somewhat limited on time atm).

Since I opened the request from my forked master branch I didn't see a way to update the PR to source from the work branch I made to do the follow on request so I instead just amended my previous commit and force pushed FYI. (thinking about it now you could have squashed when you merged instead of accepting two commits but 🤷‍♀️ )

@chrishamant chrishamant requested a review from mrichman June 6, 2025 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants