Skip to content

Commit 62e2064

Browse files
committed
Check language of VOT document
If an English translation isn’t yet available, requesting the English translation will return the French original. This checks whether the returned document actually is the English version. If not, it raises a `NoWorkingUrlError` exception, which will be treated in the same way as if the EP website had returned a 404 (i.e. the scraper will be retried later). Fixes #1167
1 parent b5a2de3 commit 62e2064

File tree

3 files changed

+37
-2
lines changed

3 files changed

+37
-2
lines changed

backend/howtheyvote/scrapers/votes.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
VotePosition,
2020
VoteResult,
2121
)
22-
from .common import BeautifulSoupScraper, RequestCache, ScrapingError
22+
from .common import BeautifulSoupScraper, NoWorkingUrlError, RequestCache, ScrapingError
2323
from .helpers import (
2424
fill_missing_by_reference,
2525
normalize_name,
@@ -363,6 +363,23 @@ def _url(self) -> str:
363363
return f"{self.BASE_URL}/PV-{self.term}-{date}-VOT_EN.xml"
364364

365365
def _extract_data(self, doc: BeautifulSoup) -> Iterator[Fragment | None]:
366+
root = doc.select_one("file")
367+
368+
if not root:
369+
raise ScrapingError("Missing root element `file` in VOT list")
370+
371+
# https://github.com/python/typeshed/issues/8755
372+
language = cast(str, root["language"]).lower()
373+
374+
if language != "en":
375+
# If an English translation isn’t yet available, requesting the English translation
376+
# will return the French original. In case a French document is returned, we raise
377+
# `NoWorkingUrlError`. Pipelines catching this exception will usually be re-run
378+
# later (rather than being marked as permanently failed).
379+
raise NoWorkingUrlError(
380+
"Request English version of document, but received language {language}."
381+
)
382+
366383
for vote_tag in doc.select("votes vote"):
367384
# The source data often contains sections with additional information (such as
368385
# corrections). These are also modeled as "votes" (even though there was no

0 commit comments

Comments
 (0)