This is the codebase for ShopeeSense submission for Shopee Data Science Competition. The competition requires the participants to extract attributes from the image and title description of a listed product. The product was from three categories - Mobile, Beauty, and Fashion.
For each of the categories, we built four algorithms which we ensemble to output the final prediction. The four algorithms are listed as follows:
- Term Frequency - Inverse Document Frequency, Single Value Decomposition, and Gradient Boosted Decision Tree for product title.
- Recurrent Neural Network with ASGD Weighted-Dropped Long Short Term Memory architecture for product title.
- Convolution Neural Network with pretrained ResNet34 for product image.
- Heuristic Algorithm based on the product title.
The architecture of the model is listed as follows:
These instructions will get you a copy of the project up and running on your local machine for development and deployment purposes.
What things you need to install the software and how to install them.
We use pipenv for managing project dependencies
and Python environments (i.e. virtual environments). All direct packages
dependencies (e.g. NumPy may be used in a User Defined Function),
as well as all the packages used during development (e.g. PySpark,
flake8 for code linting, IPython for interactive console sessions, etc.),
are described in the Pipfile
. Their precise downstream dependencies
are described in Pipfile.lock
.
sudo apt install pyenv
Other installation instructions here.
echo eval "$(pyenv init -)" >> ~/.bash_profile
source ~/.bash_profile
To see all available python versions
pyenv install --list
To install Python 3.7.2
pyenv install 3.7.2
To install Python 3.7.2 in this directory
cd sparkflow
pyenv install 3.7.2
pyenv local 3.7.2
To get started with Pipenv, first of all download it - assuming that there is a global version of Python available on your system and on the PATH, then this can be achieved by running the following command,
pip install pipenv
Setting up your pipenv for the first time
pipenv --python 3.7.2
For more information, including advanced configuration options, see the official pipenv documentation.
To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. Then go to the 'Account' tab of your user profile (https://www.kaggle.com/<username>/account
) and select 'Create API Token'. This will trigger the download of kaggle.json
, a file containing your API credentials. Place this file in the location ~/.kaggle/kaggle.json
(on Windows in the location C:\Users\<Windows-username>\.kaggle\kaggle.json
- you can check the exact location, sans drive, with echo %HOMEPATH%
). You can define a shell environment variable KAGGLE_CONFIG_DIR
to change this location to $KAGGLE_CONFIG_DIR/kaggle.json
(on Windows it will be %KAGGLE_CONFIG_DIR%\kaggle.json
).
For your security, ensure that other users of your computer do not have read access to your credentials. On Unix-based systems you can do this with the following command:
chmod 600 ~/.kaggle/kaggle.json
You can also choose to export your Kaggle username and token to the environment:
export KAGGLE_USERNAME=datadinosaur
export KAGGLE_KEY=xxxxxxxxxxxxxx
In addition, you can export any other configuration value that normally would be in
the $HOME/.kaggle/kaggle.json
in the format 'KAGGLE_' (note uppercase).
For example, if the file had the variable "proxy" you would export KAGGLE_PROXY
and it would be discovered by the client.
Make sure that you're in the project's root directory (the same one in which the Pipfile
resides), and then run,
make setup
This single command will ensure that pyenv and pipenv is installed within your computer, and it installs all of the package dependencies for the project.
make prepare-data
This will do the following:
- Creates
/data
folder - Download all the data from Shopee Kaggle Dataset into
/data
, and unzip all of them - Renaming all files that ends with
_info_val_competition.csv
to_info_test_competition.csv
- Splitting the training files as into Development dataset and Validation dataset,
_info_dev_competition.csv
and_info_val_competition.csv
Development Dataset and Validation Dataset is a subset of the Training Dataset, where we will train the model on Development Dataset, and Use the Validation Dataset to get the estimated Performance of the Model
bash bins/dl_large_files.sh
This script will try to download the Beauty, Fashion, and Mobile image files.
├── bins
│ ├── data.sh
│ ├── dl_large_files.sh
│ ├── fastai.sh
│ ├── img.sh
│ ├── lgb.sh
│ └── setup_linux.sh
├── config.sh
├── data
│ └── fono_api
├── __init__.py
├── LICENSE
├── Makefile
├── model
│ ├── common
│ │ ├── __init__.py
│ │ ├── split_data.py
│ │ └── topic.py
│ ├── heuristic
│ │ ├── fashion_library.json
│ │ ├── __init__.py
│ │ └── mobile
│ │ ├── enricher.py
│ │ ├── extractor.py
│ │ └── __init__.py
│ ├── image
│ │ ├── fastai
│ │ │ ├── __init__.py
│ │ │ ├── main.py
│ │ │ └── ml_model.py
│ │ ├── __init__.py
│ │ └── pytorch
│ │ ├── dataset.py
│ │ ├── __init__.py
│ │ ├── test.py
│ │ ├── train.py
│ │ └── validation.py
│ ├── __init__.py
│ ├── leak
│ │ ├── __init__.py
│ │ └── main.py
│ └── text
│ ├── common
│ │ ├── __init__.py
│ │ └── prediction.py
│ ├── fastai
│ │ ├── class_model.py
│ │ ├── __init__.py
│ │ ├── lm_model.py
│ │ └── main.py
│ ├── __init__.py
│ ├── lgb
│ │ ├── config.py
│ │ ├── eta_zoo.py
│ │ ├── __init__.py
│ │ ├── lgb_model.py
│ │ ├── main.py
│ │ └── tuning.py
│ └── utils
│ ├── common.py
│ └── __init__.py
├── notebooks
│ ├── adi
│ │ ├── combine_answer.ipynb
│ │ ├── concat_submission.ipynb
│ │ ├── fastai_clasifier.ipynb
│ │ ├── fastai_lm.ipynb
│ │ ├── ida_beauty_prediction.ipynb
│ │ ├── ida_fashion_prediction.ipynb
│ │ ├── image_fastai.ipynb
│ │ ├── image_fastai_pycharm.ipynb
│ │ ├── image_folder_management.ipynb
│ │ ├── __init__.py
│ │ ├── kyle_enricher.ipynb
│ │ ├── Kyle LM.ipynb
│ │ ├── kyle_prediction.ipynb
│ │ ├── kyle_validation.ipynb
│ │ ├── leak_answer.ipynb
│ │ ├── lgb_model.ipynb
│ │ └── submission.ipynb
│ ├── ida
│ │ └── fashion_heuristics_generate_review_heuristics_library.ipynb
│ ├── kyle
│ │ ├── first.py
│ │ └── __init__.py
│ └── placeholder
├── output
│ ├── logs
│ │ └── placeholder
│ ├── model_checkpoint
│ │ └── placeholder
│ ├── result
│ │ └── placeholder
│ └── result_metadata
│ └── placeholder
├── Pipfile
├── Pipfile.lock
├── README.md
├── try.png
└── utils
├── api_keys.py
├── common.py
├── envs.py
├── fonAPI.py
├── __init__.py
├── logger.py
└── pytorch
├── callbacks
│ ├── callback.py
│ ├── __init__.py
│ ├── loss.py
│ └── optim.py
├── __init__.py
└── utils
├── checkpoint.py
├── common.py
├── __init__.py
└── lr_finder.py
All of the main algorithms is located under model/
directory. Each of the main algorithms
(model/image/fastai
, model/image/pytorch
, model/text/lgb
, model/text/fastai
) has a main.py
which can be ran directly to obtain the final predictions. config.sh
is needed to be run before running each of the main
python script to initialize the environment variables
Ensembling the predictions from each of the ML model can be done using the following jupyter notebook: notebooks/adi/concat_submission.ipynb
Source codes to Mobile, Beauty and Fashion heuristics algorithms under model/heuristics
.
- Run mobile heuristics on your dataset using notebooks/adi/kyle_prediction.ipynb.
- Run beauty heuristics on your dataset using model/heuristics/beauty_heuristics.py.
- Run fashion heuristics on your dataset using model/heuristics/fashion_heuristics.py.
- Numpy - NumPy is the fundamental package for scientific computing with Python.
- tqdm - A fast, extensible progress bar for Python and CLI
- pandas - powerful Python data analysis toolkit
- seaborn - statistical data visualization
- Pytorch - Deep Learning Framework for Python
- torchvision - Image Toolkit for Pytorch
- ipykernel - IPython Kernel for Jupyter
- kaggle - Official Kaggle API
- scikit-learn - machine Learning in Python
- scikit-image - image processing in python
- category-encoders - Categorical Encoding for Python
- xgboost - optimized distributed gradient boosting library
- lightgbm - gradient boosting framework that uses tree based learning algorithms
- argparse - Parser for command-line options, arguments and sub-commands
- fastai - making neural nets uncool again
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
- Aditya Kelvianto Sidharta AdityaSidharta
- Kyle Tan kyleissuper
- Idawati Bustan idawatibustan
- Nikolas Lee nykznykz
This project is licensed under the MIT License - see the LICENSE file for details
A huge thank you to Shopee Data Science team for organizing this hackathon.