DATA SCIENCE FINAL MOTTO
BERLIN JUNE 2025, 8:16 AM
Preface: It is up to you to find the tools that suit you. You are free to use any language of your choice for this module.
The role of the data scientist is to predict "the future" with automatic learning models on past data, he must be a force of proposal to explain the possible interest to the implementation of his models, create tools to help decision making.
Work with SQL - dockerized PSQL environment for local engineering.
Visualize the table via PGAdmin, wrangle tables, create tables manually, other 101s.
========================
ETL - extract, transform, load.
Due to the size of datasets, there were many issues. In the end, this was quite a time consuming challenge. The dataset had cca. 20m+ product records for various actions, such as 'view', 'put_to_cart' or 'purchase'. The data spanned a period of 5 months from Oct 2022 - Feb 2023.
Key takeaways: import, fusion, transformation, advanced PSQL engineering.
Graphs, Charts, presentation of data.
Pandas, Seaborn, mp.
More PSQL engineering, data presentation and pattern identification - the clustering.
=======================
- Dataset exploration (visual) - seaborn, polars (pandas)
- Finding correlating trends using visual tools - charts (seaborn), plots, histograms.
Related activities:
- Standardization = Scaling
- Preparation for training of models
PCA (Principal Component Analysis)
-
Analysing existing model's competence using Confusion Matrix
-
Multicollinearity reduction (Dimension reduction) of a set and its features - understanding where is the inflation through VIF score
-
Relation between multicollinearity and correlation coefficiant
-
Evaluating and working with regression / classification models
-
Identifying Convergence issues
-
Working with Voting Classifier
Note: although denoted as days, these projects were spread out over 7 days.