A Hackathon ML Product
12th September 2020 - 14th September 2020 (Online)
Team Name: Maven
Team Members:
This dataset is a Covid-19 Dataset given as three parts:
- Variable_Description.xlsx:
- This file contains description of all the variables available in the dataset
- Training_data.xlsx:
- This is the training dataset on which model has to be trained, which contains parameters of a city on 1st September 2020
- Test_data.xlsx:
- This is the test data on which accuracy of the model will be computed. It also contains Time Series data of Foreign Visitors to be used for Part – 02
Null Values
We have found that there are null values present in the following Fields
- Population [2011]
- Popuation [2001]
- Sex Ratio
- Median Age
- Avg Temp
- SWM
- Toilets Avl
- Water Purity
- H Index
- Female Population
- # of hospitals
- Foreign Visitors
Categorical Variable Fields
- swm - 3 categories
- city_type - 21 categories
Continuous Variable Fields
- population2011 - 739 non-null float64
- population2001 - 295 non-null float64
- sex_ratio - 777 non-null float64
- median_age - 769 non-null float64
- avg_temp - 770 non-null float64
- toilets_avl - 761 non-null float64
- water_purity - 629 non-null float64
- h_index - 647 non-null float64
- fem_population - 646 non-null float64
- hospitals - 772 non-null float64
- foreign_visitors - 697 non-null float64
- covid_cases - 787 non-null float64
We have made the following preprocessing to the data to make the data suitable for producing the required results
Null value Handling
Two types of null values are present:
- Trailing null values - We have removed them directly from the dataset
- Null values in particular fields - These values are handled by considering some properties
Converting Categorical values
- We need to convert the
swmfield andcity_typefield categorical values to numerical values - This is done by making a dictionary and then mapping them to the dataset
Correlation Matrix comparison before the preprocessing and after the preprocessing
Aim - To predict the growth rate at any consecutive year month or day of the population of any given city using mathematical and statistical approach.
Technique Used and Reason:
Technique used are basically a formula based approach and determination of population at a given time period and also making use of available techniques like that of Annuity technique that is been employed at financial institutions in order to predict or assess the pay that a person had to be paying the institution after a given period of time. The main reason that a part of Annuity method was used is because of the following reasons
- This method produces a promising results that are most probably with a difference in the results to be Plus (or) Minus 0.5 to 3 %
- The values that are produced can be checked at a regular intervals and also the fact that it can be assessed for both long period(Years) and a short periods(Days)
- Many aspects of this technique can be tailored for a variety of purposes.
- The End results that are thus obtained can be much similar to the population rate as most analysis use this formula in assessing the counts in bigger cities like that of Chennai, Bangalore, Mumbai, Etc.
Other techniques Considered:
To directly predict the Covid-19 Pandemic by using growth ratio formulas and as well as doubling rate formulas. But the main issue is that the growth rate formula can only be applied to 2 corresponding values like that of first day and the next day and so on. But in this case, the consecutive date and time has not been given yet. Therefore, this method cannot be well utilized over here.
Formulas Used and Explanation:
- To calculate the growth rate over a period of time(2001 – 2011)(in %)
- Formula Used:
- Population Rate over 10 Years = ((Population of 2011 / Population of 2001)/ Population of 2001)*100)
- To calculate the growth rate over a period of 1 year(in %)
- Formula Used:
- Population Rate of 1 year = Population Rate over 10 Years / 10
- To calculate the growth rate for 1 month(in %)
- Formula Used:
- Population Rate for 1 month = Population Rate of 1 year / 12
1) Death Rate Hypothesis
A death rate hypothesis. By using the relation between the female and male population and the covid cases we can arguably find a death rate in the ratio of male and female. This hypothesis can bring a factor that will minimize the difference between the population and the covid cases. This could result as a factor when seen in the correlation matrix.
But this hypothesis fails because :
- There are other factors depending on it that are not present like the deceased rate due to other factors/not covid.
- The disease has not been there long enough to get more data on it.
- Using categorical data to find a growth rate would result better in providing accurate results.
2) Monthly Covid cases Hypothesis
The monthly covid cases was found from the train data, it was highly correlated with the train data but then the feature could not be used in the test data because the monthly covid data calculation required the covid data of the cities in the test data and the test data contains different cities than the train data
COVID cases prediction PART 01
- Linear Regression: Linear regression is used at the starting to model the relationship between independent and dependent attributes by fitting the linear equation. As Number of independent attributes are more in a given dataset, the model doesn’t fit well.
- Decision Tree: The given training dataset is multi-dimensional data. The decision tree is used to predict COVID cases with multiple information across various branches. This could be relatively inaccurate and unstable when additional data is given to predict.
- Random forest classification: RandomForest classifier aims to overcome the issues of Decision tree algorithm and to obtain accurate results. Since this is an ensembling learning model by constructing a multitude of the decision tree, effective results are achieved. Complexity is more.
- AdaBoost Regressor: The AdaBoost classifier is chosen to improve accuracy by adjusting the weights iteratively. But Overfitting occurs as the major issue in this case.
- Gradient Boosting Model: The Gradient boosting algorithm is chosen to improve accuracy because of powerful ensembling techniques from weaker algorithms.
- MLP Regression: The neural network approach is implemented with Multi-Layer Perceptron. Regularisation is done to avoid the overfitting issue.
