We are tired of scrolling through Netflix aimlessly for hours on end hoping to come across a movie that interests us. While several movie recommendation systems exist out there, they are largely based on previously-collected data and are not equipped to process real-time parameters like current mood. We set out to create a comprehensive movie recommender that engages in an interactive conversation with the user to output the optimal movie suggestion.
The project began with a dataset of IMDB’s top 1,000 movies from 1920–2019 that contained the movie duration, genre, cast, etc., for each entry (https://www.kaggle.com/datasets/omarhanyy/imdb-top-1000?resource=download). Another dataset containing tags from a list of 69 words for over 14,000 movie, such as “feel good”, “gut-wrenching”, and “psychedelic”, was combined with the other dataset (https://www.kaggle.com/datasets/cryptexcode/mpst-movie-plot-synopses-with-tags). A four-feature vector was then created for each of the 1,000 movies. The first row contains a value from 0–9 that represents the decade it was produced; the second row contains a value from 0–2 that represents whether the movie is short (<60 min), medium (60 min <= length <= 120 min), or long (>120 min); the third row contains a one-hot encoding of twenty 0s and 1s based on whether the movie belongs to a certain genre; and the final row contains a one-hot encoding of sixty-nine 0s and 1s that correspond to tags.
The web app was built using Python Flask. Most of our team had never used Flask before so it was a large learning curve. JavaScript and HTML were used to create our front end, an interactive chatbot that prompts and responds to user input. We used the OpenAI library and ChatGPT to generate unique and engaging responses to the user’s choices. A Profanity Filter is used to refine ChatGPT outputs to limit controversial or offensive speech.
The user’s inputs are classified as either constraints or preferences. The selection process begins with the constraints, such as maturity rating, director preference, and cast preference, narrowing down the dataset. BreadBot offers the user the opportunity to enter open-ended text to describe anything they wish to express (e.g., “I want something with a lot of explosions”, “give me something gut-wrenching and intimate”, “I like long montages”). We used natural language processing (NLP) tools of the Cohere API to analyze the genres and tags best expressed in the user’s text. The classifier tool trained a model on 40 examples to classify one of the 20 standard genres. The embedding tool was used to calculate embeddings for the 69 subjective tags and the textbox input, and a cosine similarity is used to determine the 4 best tags. Asking for “explosions”, for example, will map directly to action movies by using Cohere’s classifying example tools to train the model. Other subjective inputs, like genre and duration preferences, are then used in combination with the NLP results to create a unique input vector for the user’s submission that is identically-structured to the movie-specific vectors created earlier.
A cosine similarity is calculated between each of the movies and the input vector and the film with highest similarity is conveyed as the output to the front end. We used Beautiful Soup to web scrape the poster of the output movie from a Google images query.
Using the Cohere API posed a challenge as it was our first time using external APIs for machine learning. Learning how to deal with errors and bad requests took time. Moreover managing communication between the front end and back end, especially when dealing with large amounts of variable user input and data, was challenging. We used Ajax in the development of the front end to make this process easier.
We are proud to have pulled off what we thought was too ambitious of an idea. It is really cool how we were able to integrate so many different tools to create this project as a whole. We learned to take advantage of the tools out there to facilitate the process of creating a large-scale project.
Given the time constraints for this hackathon project, there are several improvements to our project that we wish to implement in the future. One of the issues with our algorithm is its run time because we iterate excessively through the dataset when calculating similarities for each movie. Although essential, there is perhaps a more optimized method of delivering the movie suggestion. We love our sleek UI, but given more time, we would like to integrate more graphics, designs, colours, and perhaps an audio feature into the chatbot. Furthermore, it would be great to extend this project to allow groups of people to choose an optimal for them by combining individual preferences. Lastly, we would want to extend our dataset to beyond just 1,000 of the most popular movies on IMDB.