Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usecase clustering #90

Open
wants to merge 8 commits into
base: gh-pages
Choose a base branch
from
Open

Usecase clustering #90

wants to merge 8 commits into from

Conversation

juliambr
Copy link
Collaborator

wrote a usecase for clustering - please review it :)

An overview over all learners can be found [here](http://mlr-org.github.io/mlr-tutorial/devel/html/integrated_learners/index.html). You can also call the \texttt{listLearners} command for our specific task.


```{r, warning=FALSE, eval = FALSE}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave warning=FALSE out, because travis has all packages and no warning will be produced

Copy link
Contributor

@schiffner schiffner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, looks very good already.
I just skimmed over it very quickly and commented on some technical stuff.

set.seed(1234)
```

This is a use case for clustering with the [%mlr] package. We consider the [agriculture](https://www.rdocumentation.org/packages/cluster/versions/1.10.0/topics/agriculture) dataset that contains observations about $n=12$ countries including
Copy link
Contributor

@schiffner schiffner Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use [agriculture](&cluster::agriculture).
(The build script for the tutorial will expand this to the correct link.)


```{r, fig.width = 5}
library("cluster")
data(agriculture)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use data(agriculture, package "cluster")


So let's have a look at the data first.

```{r, fig.width = 5}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify the aspect ratio (fig.asp) instead of the fig.width.
(This works better for the pdf version of the tutorial.)


* define the learning task ([here](http://mlr-org.github.io/mlr-tutorial/devel/html/task/index.html)),
* select a learning method ([here](http://mlr-org.github.io/mlr-tutorial/devel/html/learner/index.html)),
* train the learner with data ([here](http://mlr-org.github.io/mlr-tutorial/devel/html/train/index.html)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "train the learner" is sufficient.

We now have to define a clustering task. Notice that a clustering task doesn't have a target variable.

```{r message = FALSE}
library(mlr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need library(mlr) and then can also leave out the message = FALSE option.


Tuning will address the question of choosing the best hyperparameters for our problem.

We first create a search space for the number of clusters $k$, e. g. $k \in \lbrace 2, 3, 4, 5 \rbrace$. Further we define an optimization algorithm and a [resampling strategy](http://mlr-org.github.io/mlr-tutorial/devel/html/resample/index.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above you need to link to resample.md.


We first create a search space for the number of clusters $k$, e. g. $k \in \lbrace 2, 3, 4, 5 \rbrace$. Further we define an optimization algorithm and a [resampling strategy](http://mlr-org.github.io/mlr-tutorial/devel/html/resample/index.html).

Finally, by combining all the previous pieces, we can tune the parameter $k$ by calling \texttt{tuneParams}. We will use discrete_ps with grid search and the silhouette coefficient as optimization criterion:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [&tuneParams]
  • I would also mention 3-fold cross-validation.

discrete_ps = makeParamSet(makeDiscreteParam("centers", values = c(2, 3, 4, 5)))
ctrl = makeTuneControlGrid()
res = tuneParams(cluster.lrn, agri.task, measures = silhouette, resampling = cv3,
par.set = discrete_ps, control = ctrl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please indent code by 2 spaces?


This is our final clustering for our problem.

```{r, fig.width= 5}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use fig.asp.

This is our final clustering for our problem.

```{r, fig.width= 5}
plot(y ~ x, col = tuned.pred$data$response, data = agriculture)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use the getter function (I think getPredictionResponse should work here)?

head(data)
```

Our aim - as mentioned before - is to predict which kind of people would have survided.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

survided

typo


#### Preprocessing

The data set is corrected regarding their data types.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do str(data) to show the different types, then mention how and why they need corrected

@pat-s
Copy link
Collaborator

pat-s commented Jun 21, 2018

@juliambr Do you still have motivation to finish this up here? Would be great! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants