Skip to content

Latest commit

 

History

History
61 lines (37 loc) · 3.67 KB

README.md

File metadata and controls

61 lines (37 loc) · 3.67 KB

Assignment 2019B-3: Mining over datasets

In this exercise you are called to apply data mining practices to existing datasets. You need to provide:

  • one report, containing individual sections per dataset, which will describe the method, tools and results of the analysis, based on the (per-dataset) questions and requirements, elaborated below. Each group will be assigned

  • one (link to a) file which will contain all needed information to reproduce the analyses.

  • one presentation, summarizing the work so that it can be presented in 10min.

The report should end with a brief mention (in bullet-points) of who in the group worked on what part of the analyses (and the corresponding time taken).

My evaluation will take into account:

  • the clarity of writing and coherence of the report

  • the completeness of the application parameters of the method(s) used

  • the explanation of why the selected methods were appropriate

  • the reproducibility of the process

Dataset 1 link

and especially the file "Στατιστικά στοιχεία εγκληματικότητας 2016"

Tasks

  • Cluster the types of crimes based on the success of police in facing/solving them.
  • Cluster the types of crimes and explain what each cluster represents.
  • Identify outliers in crime types and explain what they represent/why they are outliers.
  • Try to predict the super-category (e.g. ΕΠΙΚΡΑΤΕΙΑ/ΚΛΟΠΕΣ-ΔΙΑΡΡΗΞΕΙΣ, ...) of a record given only its numeric fields (τελ/να, απόπειρες, εξιχνιάσεις, ημεδαποί, αλλοδαποί), providing an explanation of the main factors for the decision and report the performance on a cross-validation evaluation.

Dataset 2 link

Tasks

  • Provide an overview of the dataset size, features, and distribution of feature values.
  • Select a random subset of 1000 instances from the dataset. Then identify the 20 features that are least useful in predicting the class and report them.
  • Having removed the useless features, search for the top 3 associations with support of at least 0.1 and confidence of at least 0.5 and report them.

Dataset 3 link

Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska

  • Provide an overview of the dataset size, features, and distribution of feature values.
  • Describe the average delays per airport/airline.
  • Identify and report the most prominent rules of association between delays and point of origin AND/OR point of arrival.
  • Try to predict the delay given all other features and report the appropriate performance on cross-validation.
  • Identify patterns/rules regarding delays and try to explain when delays should be expected, based on these patterns.

Dataset 4 link

Description: link

Tasks

  • Summarize the data to help understand the overall picture of religious groups over the US.
  • Which are the counties with the highest per person ratio of Orthodox Christian members?
  • Can you find the 3 most extreme (outlier) counties with respect to the distribution of their churches across religions?
  • Where would you create a cross-religion centre of discussion between religions to maximize its impact? Support the proposal based on data analysis results.