First, we start with a brief description of the problems developed in a recommender system, and the existing algorithms to solve it. Then, we describe our innovative algorithm for preference elicitation.
Let us assume that we have
Thus we have a matrix
For example, on Mangaki,
We implemented this algorithm for our recommendations.
- Compute a similarity score between users
- Determine
$k$ nearest neighbors of a certain user$u$ - Recommend to user
$u$ what his nearest neighbors liked and that he did not watch yet.
The similarity score is the dot product of the ratings, where
All our algorithms are compared using cross-validation.
This is another algorithm for matrix completion. We used the implementation provided in the open source Python library scikit-learn for machine learning.
For which works should we ask a newcomer “Did you like this movie?” in order to build his profile efficiently?
- Items should be popular, in order to be known by the user, so he can provide a rating.
- Items should be controversial, because for example, the fact that you liked Star Wars does not provide a lot of information about your tastes.
Our idea is to find items that bisect the best the userbase into three sets of roughly the same probability mass, see \ref{test}.
Python is a solid and well-known language in the landscape, used successfully by a huge number of companies and organizations. This choice was also the most relevant for us as it is our favorite language, for its readability and simplicity.
Most actions (rating, loading of works metadata such as title and poster) are made using AJAX calls. We plan to move to React.js.
Django is described as a "framework for perfectionists with deadlines", it revealed to be an empowering tool which gets rid of all the boilerplate that we could encounter while developing Mangaki.
Django powers Pinterest or Instagram, and also Facebook for various behind-the-scenes utilities. As a natural consequence for this success, we have a bunch of Django packages available to speed up the development. Also, a really good point of Django is its crystal-clear documentation.
Mangaki project currently uses two big Django applications:
- Mangaki, which is the core of the web application.
- IRL, which manages real-life matter that happens around the Mangaki community (partners and events).
Django enables us to:
- use its own ORM (object-relational mapping) which makes our life easier to model our data
- use PostgreSQL databases
- create our own admin commands to analyze user data and apply our algorithms on it
(findneighbors.py
for example) - easily version control our database with migrations
Every Mangaki instance is easily configured with a settings.py
file, together with a secret.py
file that contains all sensible data.
We provide initial, anonymous seed data (called fixtures, in JSON format), which are loaded when a new Mangaki instance is provisioned.
Thank to the django-extensions
package, it is possible to use IPython notebooks to keep tracking of ideas and make reproducible demos and visualizations.
A Vagrantfile allows anyone to create Mangaki instances easily with Vagrant and a provisioning folder which contains a Ansible playbook.
Azure enables a system to easily create virtual machines, which is an interesting point for us because we can spawn virtual machines on the fly or act on the infrastructure (advanced security policies with JSON Web Token) through a RESTful API, which is quite handy.
We can do anything we want while letting our production website alive.
Azure is the key to enable this feature: when we spawn virtual machines, we have a front Nginx web server which will serve our requests:
- api.mangaki.fr to the API server
- mangaki.fr to the production server
- dev.mangaki.fr to the development server
- research.mangaki.fr to the research server
The idea is to isolate our platform in different chunks of services. By doing this, we can enable Nginx to use the failover feature in the following way:
upstream mangaki_production {
server {{ main_mangaki_production_server_ip }};
server {{ fallback_mangaki_production_server_ip }} backup;
}
Thanks to this, we can connect a confd-like to provision our Nginx config in real-time with the Azure API.
As open source service contributors, we want to test and peer review our new features before putting them into production. It is difficult to coordinate all the changes though, we want a pipeline where we can aggregate a set of features, check their impact and push them to production.
Basically, we would like to have different environments of Mangaki, linked to a copy or a subset of the production database.
This is possible with Azure, we can dynamically spawn new routes in Nginx router or just aggregate them in a dev / staging environment, check the changes, talk with contributors, push them to production.
Thus, Azure is really a powerful infrastructure which enables our organization to gain more agility by writing smart integrations, scripts and triggers in order to make our life easier and let us focus on the code.