Flipr Hackathon 6.0 - 2020

A Hackathon ML Product

12th September 2020 - 14th September 2020 (Online)

Team Name: Maven

Team Members:

DataSet

This dataset is a Covid-19 Dataset given as three parts:

Variable_Description.xlsx:
- This file contains description of all the variables available in the dataset
Training_data.xlsx:
- This is the training dataset on which model has to be trained, which contains parameters of a city on 1st September 2020
Test_data.xlsx:
- This is the test data on which accuracy of the model will be computed. It also contains Time Series data of Foreign Visitors to be used for Part – 02

Variables Description

Data Review

Null Values

We have found that there are null values present in the following Fields

Population [2011]
Popuation [2001]
Sex Ratio
Median Age
Avg Temp
SWM
Toilets Avl
Water Purity
H Index
Female Population
# of hospitals
Foreign Visitors

Categorical Variable Fields

swm - 3 categories
city_type - 21 categories

Continuous Variable Fields

population2011 - 739 non-null float64
population2001 - 295 non-null float64
sex_ratio - 777 non-null float64
median_age - 769 non-null float64
avg_temp - 770 non-null float64
toilets_avl - 761 non-null float64
water_purity - 629 non-null float64
h_index - 647 non-null float64
fem_population - 646 non-null float64
hospitals - 772 non-null float64
foreign_visitors - 697 non-null float64
covid_cases - 787 non-null float64

Data Preprocessing

We have made the following preprocessing to the data to make the data suitable for producing the required results

Null value Handling

Two types of null values are present:

Trailing null values - We have removed them directly from the dataset
Null values in particular fields - These values are handled by considering some properties

Converting Categorical values

We need to convert the swm field and city_type field categorical values to numerical values
This is done by making a dictionary and then mapping them to the dataset

Correlation Matrix comparison before the preprocessing and after the preprocessing

Population Growth Rate Computation

Aim - To predict the growth rate at any consecutive year month or day of the population of any given city using mathematical and statistical approach.

Technique Used and Reason:

Technique used are basically a formula based approach and determination of population at a given time period and also making use of available techniques like that of Annuity technique that is been employed at financial institutions in order to predict or assess the pay that a person had to be paying the institution after a given period of time. The main reason that a part of Annuity method was used is because of the following reasons

This method produces a promising results that are most probably with a difference in the results to be Plus (or) Minus 0.5 to 3 %
The values that are produced can be checked at a regular intervals and also the fact that it can be assessed for both long period(Years) and a short periods(Days)
Many aspects of this technique can be tailored for a variety of purposes.
The End results that are thus obtained can be much similar to the population rate as most analysis use this formula in assessing the counts in bigger cities like that of Chennai, Bangalore, Mumbai, Etc.

Other techniques Considered:

To directly predict the Covid-19 Pandemic by using growth ratio formulas and as well as doubling rate formulas. But the main issue is that the growth rate formula can only be applied to 2 corresponding values like that of first day and the next day and so on. But in this case, the consecutive date and time has not been given yet. Therefore, this method cannot be well utilized over here.

Formulas Used and Explanation:

To calculate the growth rate over a period of time(2001 – 2011)(in %)

Formula Used:
- Population Rate over 10 Years = ((Population of 2011 / Population of 2001)/ Population of 2001)*100)

To calculate the growth rate over a period of 1 year(in %)

Formula Used:
- Population Rate of 1 year = Population Rate over 10 Years / 10

To calculate the growth rate for 1 month(in %)

Formula Used:
- Population Rate for 1 month = Population Rate of 1 year / 12

Hypothesis

1) Death Rate Hypothesis

A death rate hypothesis. By using the relation between the female and male population and the covid cases we can arguably find a death rate in the ratio of male and female. This hypothesis can bring a factor that will minimize the difference between the population and the covid cases. This could result as a factor when seen in the correlation matrix.

But this hypothesis fails because :

There are other factors depending on it that are not present like the deceased rate due to other factors/not covid.
The disease has not been there long enough to get more data on it.
Using categorical data to find a growth rate would result better in providing accurate results.

2) Monthly Covid cases Hypothesis

The monthly covid cases was found from the train data, it was highly correlated with the train data but then the feature could not be used in the test data because the monthly covid data calculation required the covid data of the cities in the test data and the test data contains different cities than the train data

Regression Models

COVID cases prediction PART 01

Linear Regression: Linear regression is used at the starting to model the relationship between independent and dependent attributes by fitting the linear equation. As Number of independent attributes are more in a given dataset, the model doesn’t fit well.

Decision Tree: The given training dataset is multi-dimensional data. The decision tree is used to predict COVID cases with multiple information across various branches. This could be relatively inaccurate and unstable when additional data is given to predict.

Random forest classification: RandomForest classifier aims to overcome the issues of Decision tree algorithm and to obtain accurate results. Since this is an ensembling learning model by constructing a multitude of the decision tree, effective results are achieved. Complexity is more.

AdaBoost Regressor: The AdaBoost classifier is chosen to improve accuracy by adjusting the weights iteratively. But Overfitting occurs as the major issue in this case.

Gradient Boosting Model: The Gradient boosting algorithm is chosen to improve accuracy because of powerful ensembling techniques from weaker algorithms.

MLP Regression: The neural network approach is implemented with Multi-Layer Perceptron. Regularisation is done to avoid the overfitting issue.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Prediction_plot		Prediction_plot
csv_files		csv_files
Data_Preprocessing_Final.ipynb		Data_Preprocessing_Final.ipynb
Flipr_Model_Predictions.ipynb		Flipr_Model_Predictions.ipynb
Growth_Rate_Test_Data.ipynb		Growth_Rate_Test_Data.ipynb
Growth_Rate_Train_Data.ipynb		Growth_Rate_Train_Data.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flipr Hackathon 6.0 - 2020

DataSet

Data Review

Data Preprocessing

Population Growth Rate Computation

Hypothesis

Regression Models

About

Uh oh!

Releases

Packages

Languages

J-Kiruthika/Prediction-of-Covid-cases

Folders and files

Latest commit

History

Repository files navigation

Flipr Hackathon 6.0 - 2020

DataSet

Data Review

Data Preprocessing

Population Growth Rate Computation

Hypothesis

Regression Models

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages