Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bergamot #1064

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Add bergamot #1064

wants to merge 7 commits into from

Conversation

mlduff
Copy link
Contributor

@mlduff mlduff commented Apr 16, 2024

This scraper adds support for dashboard.bergamot.app. Would of liked a few more test URLs, but used the following in testing.

Had to implement a legacy test - not sure if I used the right method to specify the origin URL (which is different to the actual request URL). I ended out having a yield for the first URL (that the user would pass in), with the associated testhtml not actually being used.

Resolves #986

@jayaddison
Copy link
Collaborator

jayaddison commented Apr 17, 2024

Had to implement a legacy test - not sure if I used the right method to specify the origin URL (which is different to the actual request URL). I ended out having a yield for the first URL (that the user would pass in), with the associated testhtml not actually being used.

That seems OK to me @mlduff - a web browser would load the testhtml HTML page and then load any related JavaScript, and subsequently should make the JSON request you found previously. In addition, people using this library would likely expect to pass HTML from a webpage in -- meaning that they've made a HTTP GET request to request the relevant dashboard page, and then the scraper will make the subsequent HTTP GET. So the test describes the expected usage and behaviour, which is what we want.

@jayaddison
Copy link
Collaborator

This generally looks good to me, thank you @mlduff - a few small suggestions/questions and then I'll re-review.

Comment on lines 30 to 31
def author(self):
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to find some basic authorship information when we can, even if it is not perfect/ideal.

When a source_url is found, I think we could return the domain name.. not ideal, but better than nothing. I've asked in the bugreport (#986) whether we know of any example recipes that contain more author info.

@mlduff
Copy link
Contributor Author

mlduff commented Apr 17, 2024

I had a look through the JS on the site and I have added cook_time as I believe it should be implemented based on:

{className:"recipe__time_title"},n("Prep")),a.a.createElement("div",{className:"recipe__time_content"},Ce(r.time.prepTime))),!!r.time.cookTime&&a.a.createElement("div",{className:"recipe__time column"},a.a.createElement("div",{className:"recipe__time_title"},n("Cook")),a.a.createElement("div",{className:"recipe__time_content"},Ce(r.time.cookTime))),!!r.time.totalTime&&a.a.createElement("div",{className:"recipe__time column"},a.a.createElement("div",

Couldn't find author using the same method @jayaddison

@jayaddison
Copy link
Collaborator

Ok, thanks @mlduff. I think retaining the authorship info is fairly important, so I'd like to pause until we get some more ideas / suggestions.

@mlduff
Copy link
Contributor Author

mlduff commented Apr 19, 2024

Ok, thanks @mlduff. I think retaining the authorship info is fairly important, so I'd like to pause until we get some more ideas / suggestions.

@jayaddison would it be completely off the cards to do an extra request to the origin website? That's the only way I can see getting any extra information as it seems bergamot does not support author at all.

If you are satisfied that bergamot doesn't support author whatsoever, would you merge the PR? Or does this mean we have to decline it?

@jayaddison
Copy link
Collaborator

@mlduff hmm, I'm not sure that I'd be entirely happy with making an additional request either; it would seem kinda unexpected for some code that appears to scrape from HTML at a URL to make requests to other domains -- and fairly arbitrarily, based on whatever the response of Bergamot included. I don't expect Bergamot would intentionally link to non-related sites for any reason, but that kind of thing can happen and can cause security/privacy vulnerabilities in web browsers; so I'd prefer to maintain simplicity and avoid that.

@mlduff
Copy link
Contributor Author

mlduff commented Apr 20, 2024

@jayaddison yeah that makes sense. Is there anything I can do to help get this one in, or do we just have to wait for bergamot to reply and see if they will add author? If they say no, can we proceed with just the source domain?

@mlduff
Copy link
Contributor Author

mlduff commented Apr 25, 2024

@jayaddison Sorry, just bumping on the above - I have implemented the source domain for the author field - is it okay to proceed with this PR?

@jayaddison
Copy link
Collaborator

Thanks @mlduff - no, I don't want to proceed with this given that it isn't providing attribution for an author name that we know exists on an origin recipe. Source domain could arguably provide some of that, but I'd prefer to wait until we can retrieve the author credit from the page we're scraping.

@mlduff
Copy link
Contributor Author

mlduff commented May 4, 2024

Should I closethis PR for now @jayaddison? Or leave it up?

@jayaddison
Copy link
Collaborator

Should I closethis PR for now @jayaddison? Or leave it up?

I've been on the other side of this situation a few times now -- providing a contribution to a project that it is unclear will ever be accepted, sometimes for a reason that they've communicated, or sometimes simply because maintainers are unavailable -- and my usual preference is to keep the pull request open, because it's:

  • More likely that other people will notice and not repeat the work.
  • More likely that other people could base their own contributions on my work so far.
  • More likely that the maintainers may reappear, provide feedback, and/or figure out a way to get the changes into an acceptable state, or reject them.
  • Less likely for me to forget and re-implement it myself.

However: it can be frustrating sometimes if your PR doesn't get merged for a long time. Usually I ping the maintainers once every so often (say once per 3 months or so) if I think the PR is still valid -- but sometimes I just close them to reduce my mental overhead if it's been a long time with no progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scraper request: bergamot.app
2 participants