Flickr is an easy-to-access data source for user-generated pictures. The pictures contain textual data, e.g. the title, a description or comments. Furthermore, using the transformer package and a huggingface model, it is possible to generate a short description of a picture.
Apart from that, districts in Vienna also have associated textual data. This could be for example their wikipedia entries or descriptions of points of interest in each district.
Using spacy and gensim Doc2Vec, we can use all of these textual clues to create an embedding space. Then for each pictures texts the district with the highest similarity associated with the picture. We can verify the experiment by using the geographic coordinates associated with the pictures.
A more complex apporach with using a random subset of the geotagged pictures as training data first is explored at the end.
Viennese boundaries and the POI information is from data.gv.at.
Pictures are from Flickr, queried via API.
The model to generate image captions is from huggingface
- The concrete model is Salesforce/blip-image-captioning-large