griesheim-transparent.de

Repository for http://griesheim-transparent.de - A search engine for documents of the Griesheim (Germany, Hessen, Darmstadt-Dieburg) city parliament.

Modules

scraper: Scrapy-based webscraper for the "Ratsinformationssystem" aka. sessionnet https://sessionnet.owl-it.de/griesheim/bi/info.asp
frontend: Django Frontend and management jobs for analysis
solr: Solr Search Platform configuration

The full service is build on several microservices required at indexing time and for data storage (see docker compose file for details):

The scraper docker image runs a cron job to scrape the sessionnet regulary and stores metadata+binary files stored to postgresql and the datastore.
The frontend management task is also run in a cron job to periodically sync the scraped data into the solr index for searching. This includes:
- Combining metadata from scraped meetings, meeting agendas, consultations etc.
- Converting non-pdfs to pdf
- Extracting document metadata and content from pdfs with pdfact, tika and/or tika+tesseract (ocr)
- Generating preview images with the preview-service
The frontend django app makes the data available to the user by queyring the solr search platform

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 603 Commits
.github		.github
deployment		deployment
experiments		experiments
frontend		frontend
scraper		scraper
solr		solr
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md