The purpose of discours_scraper.py is to gather in the form of a CSV file the recent speeches delivered by the French government from the following link: speech
The result is extracted on the basis of the following columns:
titre,date,discours
You can choose the number of pages to be scraped by filling the class variables pages_begin
and pages_end
. It is quite pratical when your internet connection breaks during the scrapping.
08/06/2023: I remark that some of the speeches contains <br>
and not <p/>
. Thus I take all the inner html contents and remove the tags afterwards.
- Selenium 4
- ChromeDriverManager to avoid managing incompatibilities between the current version (114, on 8th June 2023) of Chrome and the driver version.
- You may complicate your life if you run the notebook with colab. I have only tested it with JupyterLab :-)
- The script is based, like all scrapers, on the architecture of a website. It is possible that this architecture changes, or that the css selectors need to be updated. Adjustments may therefore be necessary to collect the data.