Skip to content

Commit bdf7738

Browse files
committed
Check language of VOT document
If an English translation isn’t yet available, requesting the English translation will return the French original. This checks whether the returned document actually is the English version. If not, it raises a `NoWorkingUrlError` exception, which will be treated in the same way as if the EP website had returned a 404 (i.e. the scraper will be retried later). Fixes #1167
1 parent b5a2de3 commit bdf7738

File tree

3 files changed

+31
-2
lines changed

3 files changed

+31
-2
lines changed

backend/howtheyvote/scrapers/votes.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
VotePosition,
2020
VoteResult,
2121
)
22-
from .common import BeautifulSoupScraper, RequestCache, ScrapingError
22+
from .common import BeautifulSoupScraper, NoWorkingUrlError, RequestCache, ScrapingError
2323
from .helpers import (
2424
fill_missing_by_reference,
2525
normalize_name,
@@ -363,6 +363,17 @@ def _url(self) -> str:
363363
return f"{self.BASE_URL}/PV-{self.term}-{date}-VOT_EN.xml"
364364

365365
def _extract_data(self, doc: BeautifulSoup) -> Iterator[Fragment | None]:
366+
language = doc.select_one("file")["language"].lower()
367+
368+
if language != "en":
369+
# If an English translation isn’t yet available, requesting the English translation
370+
# will return the French original. In case a French document is returned, we raise
371+
# `NoWorkingUrlError`. Pipelines catching this exception will usually be re-run
372+
# later (rather than being marked as permanently failed).
373+
raise NoWorkingUrlError(
374+
"Request English version of document, but received language {language}."
375+
)
376+
366377
for vote_tag in doc.select("votes vote"):
367378
# The source data often contains sections with additional information (such as
368379
# corrections). These are also modeled as "votes" (even though there was no

0 commit comments

Comments
 (0)