Assignment 2019B-3: Mining over datasets

In this exercise you are called to apply data mining practices to existing datasets. You need to provide:

one report, containing individual sections per dataset, which will describe the method, tools and results of the analysis, based on the (per-dataset) questions and requirements, elaborated below. Each group will be assigned
one (link to a) file which will contain all needed information to reproduce the analyses.
one presentation, summarizing the work so that it can be presented in 10min.

The report should end with a brief mention (in bullet-points) of who in the group worked on what part of the analyses (and the corresponding time taken).

My evaluation will take into account:

the clarity of writing and coherence of the report
the completeness of the application parameters of the method(s) used
the explanation of why the selected methods were appropriate
the reproducibility of the process

Dataset 1 link

and especially the file "Στατιστικά στοιχεία εγκληματικότητας 2016"

Tasks

Cluster the types of crimes based on the success of police in facing/solving them.
Cluster the types of crimes and explain what each cluster represents.
Identify outliers in crime types and explain what they represent/why they are outliers.
Try to predict the super-category (e.g. ΕΠΙΚΡΑΤΕΙΑ/ΚΛΟΠΕΣ-ΔΙΑΡΡΗΞΕΙΣ, ...) of a record given only its numeric fields (τελ/να, απόπειρες, εξιχνιάσεις, ημεδαποί, αλλοδαποί), providing an explanation of the main factors for the decision and report the performance on a cross-validation evaluation.

Dataset 2 link

Tasks

Provide an overview of the dataset size, features, and distribution of feature values.
Select a random subset of 1000 instances from the dataset. Then identify the 20 features that are least useful in predicting the class and report them.
Having removed the useless features, search for the top 3 associations with support of at least 0.1 and confidence of at least 0.5 and report them.

Dataset 3 link

Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska

Provide an overview of the dataset size, features, and distribution of feature values.
Describe the average delays per airport/airline.
Identify and report the most prominent rules of association between delays and point of origin AND/OR point of arrival.
Try to predict the delay given all other features and report the appropriate performance on cross-validation.
Identify patterns/rules regarding delays and try to explain when delays should be expected, based on these patterns.

Dataset 4 link

Description: link

Tasks

Summarize the data to help understand the overall picture of religious groups over the US.
Which are the counties with the highest per person ratio of Orthodox Christian members?
Can you find the 3 most extreme (outlier) counties with respect to the distribution of their churches across religions?
Where would you create a cross-religion centre of discussion between religions to maximize its impact? Support the proposal based on data analysis results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Assignment 2019B-3: Mining over datasets

Dataset 1 link

Tasks

Dataset 2 link

Tasks

Dataset 3 link

Dataset 4 link

Tasks

Files

README.md

Latest commit

History

README.md

File metadata and controls

Assignment 2019B-3: Mining over datasets

Dataset 1 link

Tasks

Dataset 2 link

Tasks

Dataset 3 link

Dataset 4 link

Tasks