-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
87 lines (68 loc) · 2.87 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
output: github_document
bibliography: vignettes/bibliography.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
fig.path="man/figures/")
```
# dobin <img src='man/figures/logo.png' align="right" height="132.5" />
<!-- badges: start -->
[![R-CMD-check](https://github.com/sevvandi/dobin/workflows/R-CMD-check/badge.svg)](https://github.com/sevvandi/dobin/actions)
<!-- badges: end -->
[![](https://cranlogs.r-pkg.org/badges/dobin)](https://cran.r-project.org/package=dobin)
The R package **dobin** constructs a set of basis vectors tailored for outlier detection as described in [@dobin]. According to Collins English dictionary, "dob in" is an informal verb meaning to inform against specially to the police. Naming credits goes to Rob Hyndman.
## Installation
You can install dobin from CRAN:
```{r ins1, eval=FALSE}
install.packages("dobin")
```
Or you can install the development version from [GitHub](https://github.com/sevvandi/dobin)
```{r ins2, eval=FALSE}
install.packages("devtools")
devtools::install_github("sevvandi/dobin")
```
## Example
A bimodal distribution in six dimensions, with 5 outliers in the middle. We consider 805 observations in six dimensions. Of these 805 observations, 800 observations are non-outliers; 400 observations are centred at $(5,0,0,0,0,0)$ and the other 400 centred at $(-5,0,0,0,0,0)$. The outlier distribution consists of $5$ points with mean $(0,0,0,0,0,0)$ and standard deviations $0.2$ in the first dimension and are similar to other observations in other dimensions.
```{r bimodal}
library(dobin)
library(ggplot2)
set.seed(1)
# A bimodal distribution in six dimensions, with 5 outliers in the middle.
X <- data.frame(
x1 = c(rnorm(400,mean=5), rnorm(5, mean=0, sd=0.2), rnorm(400, mean=-5)),
x2 = rnorm(805),
x3 = rnorm(805),
x4 = rnorm(805),
x5 = rnorm(805),
x6 = rnorm(805)
)
labs <- c(rep("Norm",400), rep("Out",5), rep("Norm",400))
out <- dobin(X)
autoplot(out)
```
To see the outliers in a different colour we plot again.
```{r plotting}
XX <- cbind.data.frame(out$coords[ ,1:2], as.factor(labs))
colnames(XX) <- c("DB1", "DB2", "labs" )
ggplot(XX, aes(DB1, DB2, color=labs)) + geom_point() + theme_bw()
```
To compare, we perform PCA on the same dataset. The first two principal components are shown in the figure below:
```{r bimodal_pca}
set.seed(1)
# A bimodal distribution in six dimensions, with 5 outliers in the middle.
X <- data.frame(
x1 = c(rnorm(400,mean=5), rnorm(5, mean=0, sd=0.2), rnorm(400, mean=-5)),
x2 = rnorm(805),
x3 = rnorm(805),
x4 = rnorm(805),
x5 = rnorm(805),
x6 = rnorm(805)
)
labs <- c(rep("Norm",400), rep("Out",5), rep("Norm",400))
out <- prcomp(X, scale=TRUE)
XX <- cbind.data.frame(out$x[ ,1:2], as.factor(labs))
colnames(XX) <- c("PC1", "PC2", "labs" )
ggplot(XX, aes(PC1, PC2, color=labs)) + geom_point() + theme_bw()
```
# References