A CNN-LSTM-based image captioning model trained on the Flickr30k dataset. To learn more about how this project works, check out the documentation.
videoplayback.mp4
-
clone the repository:
https://github.com/sorohere/capage.git
-
download weights and dataset:
- run
run.sh
script, or - download manually from flickr30k
- run
-
install dependencies:
pip install -r requirement.txt
the dataset used for this project is Flickr30k, which consists of:
- around 31,000 unique images.
- each image is paired with 5 captions, resulting in approximately 151,000 image-caption pairs.
i. images folder: contains all the images used for training and evaluation.
ii. captions file:
captions.txt
: A text file mapping each image to its corresponding caption.- each line follows the format:
image_name, caption
To start training the model: python scripts/train.py
, the vocabulary and trained model will be saved in scripts/checkpoints/
.
Ensure you have a trained model and vocabulary saved. If not, train the model yourself(checkout the scripts), to generate captions for images: python inference.py