-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
28th of June Update #5
Comments
Company Names: Clustering using Semi Supervised Learning I did a proof of concept of the usage of a clustering model that uses semi supervised learning to group Company names by looking at their name and address (further variables could be included in the future). I used the data set that we have been manually cleaning on the spreadsheet to perform a test. Here I trained the model over a subset of the manually validated names:
Overall, the model saw 3910 rows of data such as:
Which correspond to 276 manually identified companies. Then, out of those 3910 rows (which produce 15.288.100 pairs) I trained on a semisupervised setting the model, by reviewing manually 130 pairs of rows and marking them as (114) "the same company" or (17) "not the same company". After this, the model applied the clustering and outputted for each row, a cluster ID, to which it belongs. It found 331 clusters. In the following table, the output of the model, with the cluster id and confidence score is shown.
Now to assess the result, the precision and recall of the process were calculated against the ground truth that we manually created on the spreadsheet.
To further test the model and check if it's not overfitting, the same test was applied on data that didn't belong to the training set, which consisted of 9000 companies, which were grouped into 61 clusters, but were actually identified as 30 companies. Here the precision and recall went a bit lower, but still not bad.
The goal now is to make this scale, so it finds more clusters. This was done on RAM and using a CSV, but the library allows for it to connect to a postgres database and work with more rows. Consider that this only is taking a look at 9000 companies, and the entire dataset has around 10.000.000 company names only for consignee names.
|
1.1 Question ID = 1 (6,231 questions of type "How much did
1.2 Question ID = 2 (20,000 out of 46,872 questions of type "How much did
1.3 Question ID = 3 (15,000 out of 34,955 questions of type "How much
For next week (on my side):
Then for the week after: |
Fine-tuning:
The text was updated successfully, but these errors were encountered: