Predict indonesia word based on single speech audio. Built using python speech libraries (librosa, noisereduce, and scipy), MFCC (Mel Frequency Cepstral Coefficient), and Random forest model with hyperparameter tuning.
The dataset used in this repository is an audio dataset and retrieved from kaggle link below.
This dataset contains seven classes with approximately 210-213 audio wav file for each class. The classes are single word speech that is pronounced in the audio. The words are BEGAL, KEBAKARAN, KECELAKAAN, MALING, PENCURI, RAMPOK, TABRAKAN. Below is the visualization for audio distribution between each classes.
Each class contains audio file in wav format. Below are the example audios of the dataset.
The dataset also has two types of ambience noise, rain and road ambience. Moreover, the audio also has silent at the start and end of the audio which makes the audio is not clean to be processed further. so, preprocessing is needed to clean the audio from noise and trim the audio silent part.
The preprocessing technique that used in this dataset are noise reduction using noisereduce library and silent trim using librosa. Below are the example result of the audio after being preprocessed.
MFCC is a feature extraction technique that represents the audio frequency in a human way of hearing. The major reason why we use MFCC is because interpretation of each frequency is different in human hearing. MFCC consists of 6 steps.
- Pre-emphasis
- Audio framing
- Audio windowing
- Mel filterbank
- Audio Log
- Discrete Cosine Transform
Data splitting used to split the dataset into 2 parts, training set and testing set with ratio 8:2. Then, because the data has high dimentionality, PCA (principal component analysis were used to reduce the difficulty for the model to understand the data.
The model that used are random forest classifier and the evaluatioon that used are basic evaluation such as accuracy, f1-score, recall, and precision. Overall the model gains accuracy >= 80%.



