Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Latest commit

 

History

History
40 lines (32 loc) · 2.82 KB

Readme.md

File metadata and controls

40 lines (32 loc) · 2.82 KB

Opinion minding service on a Persian dataset

The idea was proposed and implemented during two days in Snapp Hackton event. The output is a complete product capable of doing opinion minding. open source is holly, so we're going to publish our experience with as much detail as we remember hoping to help anyone who may need it.

Our current service design

design

This service can easily become real-time by adding a scrapper instead of the simulator which was not our goal in the MVP.

Process

  1. Data: In any ML project the first and most important factor is a rich dataset. The dataset can lead a project to success or failure and that's where our trouble began finding a dataset in persian is not easy nor free. Our dataset contained 21 thousand comments of people about different subjects like hotels and cellphone it also included some news, so I think we were not bias on a single subject. We bought the dataset, so we are not allowed to publish it publicly but in case you're doing a similar project email me for the dataset.
    flaws and challenges: Your dataset should represent the real data you want to do prediction on. The data we used to train our model with was so clean without any emojis or marks so was not a good representation. Pay attention that so many people comment ironically, so the words are nice, but the meaning is completely the opposite, your model should be complex enough to find the meaning in these cases.

  2. Model: Once you have the data you can start implementing and training your model, you don't need to implement the model from scratch, we used the code from this famous repository.
    flaws and challenges: As far as I know one of the best models to use in opinion minding is conditional random fields, but it just doesn't work on Persian dataset cause nltk doesn't support Persian, and we don't have such a power full library either, so we had to give up on that and train a simple SVM model (considering tf-idf to reduce the weights of common words with no value). On the other hand we didn't have time to train another model but never loose the profit of ensemble models. I should mention that we implemented an RNN model too but couldn't train it because the kernel of the jupyter notebook died.

  3. save trained model: We had trained the model in my local server and wanted to use the model on the cloud. We didn't have enough resource to train it there, and even if we had it obviously was not wise to do so. We used "pickle" to save the trained model in a file with ".sav" extension then loaded the model on the cloud and used it.

  4. Software engineer:

Result

Here is the result on our test data.

results