The purpose of this API is to expose alle the data that were available to scrape from https://www.urparts.com/index.cfm/page/catalogue/.
We're using uv to manage our dependencies.
There is a Make file with a couple of handy recipes. Let's go through them:
make formatreformats whole codebase usingruff;make checkdoes static code analysis usingruff&mypy;make testruns test suite for whole application;make buildbuilds a Docker image for the application;make migrateappliesalembicmigrations to the database;make scraperuns the data scraping process;make upruns the API;make downputs down all Docker containers.
We're using aiohttp for requesting all the pages in parts catalogue,
then we use BeautifulSoup4 to parse the data of interest.
Everything is being run in an asynchronous context, although not every possible aspect is parallelized.
Each manufacturer is parsed imperatively (one by one), and then, the categories, models & parts are parsed
in parallel.
There is a semaphore used to prevent abusing the website too much (it was raising Timeouts when).
Between 6 - 7 minutes for 4.4M parts (and much fewer manufacturers, categories & models) on 2021 M1 Pro.
The task is CPU-bound as of now, so it will consume a lot of CPU. Because of the parallelism it's not so IO-bound anymore.
The OpenAPI specification can be found here: http://localhost:8080/docs after starting the application.
Overall structure of the repository & API is based on domains. Each domain has its own category & directory.
For the simplicity of the solution - service/controller layer wasn't introduced.
For demonstration purposes, there's just one test written.
It won't work if you have any category in the database (there's no schema separation).