Skip to content

Conversation

@ongzexuan
Copy link

@ongzexuan ongzexuan commented Jun 10, 2020

_extract_post_id uses the class _5pcq to get the postId. This sometimes fails when there is a long list of items with that class, some of which have an effectively empty href tag. Propose a fix where we check if the href tag contains a URL (beginning with '/') before extracting.

Sample list of prints from item.find_all(class_="_5pcq") below. The last item on the list has '#' in the href field. This causes the resultant post_id to be '#'.

To replicate:
python scraper.py -p TheStraitsTimes -l 1

<a class="_5pcq" href="/TheStraitsTimes/posts/10157114673327115?__xts__%5B0%5D=68.ARAQBh0K_NMj_mQAANUH_3XvHEDd3zLc83FLEcu4VcfDAdkM6z1PAP4Izat-cL4tQmNTMr_W875cfYO3vqYneCqXcjuRt9Q1tiYK64NKoaEUtHoyIyAjcZi6jHtUrCB60YZfPvwidqL6Aw6Vm7yIdE7amIjP-yTjI25iMi-EH7xYHzCLxG1U83eUuG-L4xX73BaqcA8MtjD6aeI-EFfelvwRVHDV5GlwwgN2cGDrcv5_--KTGPV8mNO9UFtcj4BdxBG45bb4QZrpTE-PxmdnjHAIjbauy89o3zXPRG8t5LsfThBfy5UYs0M3PcVsiJi8UJswS-_QJDDTwFMnozEp&amp;__tn__=-R" target=""><abbr class="_5ptz timestamp livetimestamp" data-shorten="1" data-utime="1591800911" title="Wednesday, June 10, 2020 at 7:55 AM"><span class="timestampContent" id="js_6">1 hr</span></abbr></a>

<a aria-label="Public" class="uiStreamPrivacy inlineBlock fbStreamPrivacy fbPrivacyAudienceIndicator _5pcq" data-hover="tooltip" data-tooltip-content="Public" href="#" role="img"><i class="lock img sp_sG2S1OTONin sx_b0665c"></i></a>

<a class="_5pcq" href="/TheStraitsTimes/posts/10157114743227115?__xts__%5B0%5D=68.ARCAUPaCRHFNlhpvP2W3jDKjTebqzmTZplSSOjw7Q6sLY5VjEDPitgFQ1kYPbbGkhEiMNdN4ZLR2BjaCFdWQe5V3pDbTZ73LXDRBsFjuGX_WX2BFnx0r1xDjP2OXiYNx9B1YJOEmnVbPJg6M817WmCRTUmSjsCECgHKDAaLin8z7bP3s0XjTqaEXxtmINF7Beqwi4lqMhx8D8HQG5rgZqFzCjMOpo8s_glZV36SHwX1z2fFLpF4iudosAK-005XvhBBIfs66n5UZe9AsQmvd0QsMbjfVQIN_JqGY4-mn8VjW8XjZRzBKFEUCir2efcX5bAitc0MVnQ2Fdn0Pdzyj&amp;__tn__=-R" target=""><abbr class="_5ptz timestamp livetimestamp" data-shorten="1" data-utime="1591803018" title="Wednesday, June 10, 2020 at 8:30 AM"><span class="timestampContent" id="js_9">1 hr</span></abbr></a>

<a aria-label="Public" class="uiStreamPrivacy inlineBlock fbStreamPrivacy fbPrivacyAudienceIndicator _5pcq" data-hover="tooltip" data-tooltip-content="Public" href="#" role="img"><i class="lock img sp_sG2S1OTONin sx_b0665c"></i></a>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant