-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.qmd
193 lines (145 loc) · 8.21 KB
/
README.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
format: gfm
---
<!-- README.md is generated from README.qmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
options(width = 100)
```
# weird <img src="man/figures/weird-hex.png" align="right" width = 150 />
<!-- badges: start -->
[![R build status](https://github.com/robjhyndman/weird-package/workflows/R-CMD-check/badge.svg)](https://github.com/robjhyndman/weird-package/actions)
<!-- badges: end -->
## Overview
The weird package contains functions and data used in the book [*That's Weird: Anomaly Detection Using R*](https://OTexts.com/weird/) by Rob J Hyndman. It also loads several packages needed to do the analysis described in the book.
## Installation
You can install the **stable** version from [CRAN](https://cran.r-project.org/package=weird) with:
``` r
install.packages("weird")
```
You can install the **development** version of weird from [GitHub](https://github.com/robjhyndman/weird-package) with:
``` r
# install.packages("devtools")
devtools::install_github("robjhyndman/weird-package")
```
## Usage
`library(weird)` will load the following packages:
* [dplyr](https://dplyr.tidyverse.org), for data manipulation.
* [ggplot2](https://ggplot2.tidyverse.org), for data visualisation.
* [distributional](https://cran.r-project.org/package=distributional), for handling probability distributions.
You also get a condensed summary of conflicts with other packages you have loaded:
```{r usage}
library(weird)
```
## Example: Old Faithful Geyser data
The `oldfaithful` data set contains eruption data from the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA, from 1 January 2015 to 1 October 2021. The data were obtained from the [geysertimes.org](https://geysertimes.org) website. Recordings are incomplete, especially during the winter months when observers may not be present. There also appear to be some recording errors. The data set contains `r NROW(oldfaithful)` observations of 3 variables: `time` giving the time at which each eruption began, `duration` giving the length of the eruption in seconds, and `waiting` giving the time to the next eruption in seconds. In the analysis below, we omit the eruption with `duration` greater than 1 hour as this is likely to be a recording error. Some of the long `waiting` values are probably due to omitted eruptions, and so we also omit eruptions with `waiting` greater than 2 hours.
```{r oldfaithful}
oldfaithful
```
## Kernel density estimates
The package provides the `kde_bandwidth()` function for estimating the bandwidth of a kernel density estimate, `dist_kde()` for constructing the distribution, and `gg_density()` for plotting the resulting density. The figure below shows the kernel density estimate of the `duration` variable obtained using these functions.
```{r of-density}
of <- oldfaithful |>
filter(duration < 3600, waiting < 7200)
dist_kde(of$duration) |>
gg_density(show_points = TRUE, jitter = TRUE) +
labs(x = "Duration (seconds)")
```
The same functions also work with bivariate data. The figure below shows the kernel density estimate of the `duration` and `waiting` variables.
```{r of-density2}
of |>
select(duration, waiting) |>
dist_kde() |>
gg_density(show_points = TRUE, alpha = 0.15) +
labs(x = "Duration (seconds)", y = "Waiting time (seconds)")
```
## Statistical tests
Some old methods of anomaly detection used statistical tests. While these are not recommended, they are still widely used, and are provided in the package for comparison purposes.
```{r of-test}
of |> filter(peirce_anomalies(duration))
of |> filter(chauvenet_anomalies(duration))
of |> filter(grubbs_anomalies(duration))
of |> filter(dixon_anomalies(duration))
```
In this example, they only detect the tiny 1-second duration, which is almost certainly a recording error. An explanation of these tests is provided in [Chapter 4 of the book](https://otexts.com/weird/04-tests.html)
## Boxplots
Boxplots are widely used for anomaly detection. Here are three variations of boxplots applied to the `duration` variable.
```{r of-boxplot}
#| fig-height: 1.5
of |>
ggplot(aes(x = duration)) +
geom_boxplot() +
scale_y_discrete() +
labs(y = "", x = "Duration (seconds)")
of |> gg_hdrboxplot(duration) +
labs(x = "Duration (seconds)")
of |> gg_hdrboxplot(duration, show_points = TRUE) +
labs(x = "Duration (seconds)")
```
The latter two plots are highest density region (HDR) boxplots, which allow the bimodality of the data to be seen. The dark shaded region contains 50% of the observations, while the lighter shaded region contains 99% of the observations. The plots use vertical jittering to reduce overplotting, and highlight potential outliers (those points lying outside the 99% HDR). An explanation of these plots is provided in [Chapter 5 of the book](https://otexts.com/weird/05-boxplots.html).
It is also possible to produce bivariate boxplots. Several variations are provided in the package. Here are two types of bagplot.
```{r of-boxplot2}
of |>
gg_bagplot(duration, waiting) +
labs(x = "Duration (seconds)", y = "Waiting time (seconds)")
of |>
gg_bagplot(duration, waiting, scatterplot = TRUE) +
labs(x = "Duration (seconds)", y = "Waiting time (seconds)")
```
And here are two types of HDR boxplot
```{r of-boxplot3}
of |>
gg_hdrboxplot(duration, waiting) +
labs(x = "Duration (seconds)", y = "Waiting time (seconds)")
of |>
gg_hdrboxplot(duration, waiting, scatterplot = TRUE) +
labs(x = "Duration (seconds)", y = "Waiting time (seconds)")
```
The latter two plots show possible outliers in black (again, defined as points outside the 99% HDR).
## Scoring functions
Several functions are provided for providing anomaly scores for all observations.
* The `surprisals()` function uses either a fitted statistical model, or a kernel density estimate, to compute density scores.
* The `stray_scores()` function uses the stray algorithm to compute anomaly scores.
* The `lof_scores()` function uses local outlier factors to compute anomaly scores.
* The `glosh_scores()` function uses the Global-Local Outlier Score from Hierarchies algorithm to compute anomaly scores.
* The `lookout_prob()` function uses the lookout algorithm of [Kandanaarachchi & Hyndman (2022)]( https://robjhyndman.com/publications/lookout/) to compute anomaly probabilities.
Here are the top 0.02% most anomalous observations identified by each of the methods.
```{r of-scores}
of |>
mutate(
surprisal = surprisals(cbind(duration, waiting), probability = FALSE),
strayscore = stray_scores(cbind(duration, waiting)),
lofscore = lof_scores(cbind(duration, waiting), k = 150),
gloshscore = glosh_scores(cbind(duration, waiting)),
lookout = lookout_prob(cbind(duration, waiting))
) |>
filter(
surprisal > quantile(surprisal, prob = 0.998) |
strayscore > quantile(strayscore, prob = 0.998) |
lofscore > quantile(lofscore, prob = 0.998) |
gloshscore > quantile(gloshscore, prob = 0.998) |
lookout < 0.002
) |>
arrange(lookout)
```
The `surprisals()` function can also compute the probability of obtaining surprisal values at least as extreme as those observed. In fact, this is the default behaviour, obtained
when `probability = TRUE`.
```{r}
of |>
mutate(
surprisal = surprisals(cbind(duration, waiting), probability = FALSE),
prob = surprisals(cbind(duration, waiting))
) |>
arrange(prob)
```
## Robust multivariate scaling
Some anomaly detection methods require the data to be scaled first, so all observations are on the same scale. However, many scaling methods are not robust to anomalies. The `mvscale()` function provides a multivariate robust scaling method, that optionally takes account of the relationships betwen variables, and uses robust estimates of center, scale and covariance by default. The centers are removed using medians, the scale function is the IQR, and the covariance matrix is estimated using a robust OGK estimate. The data are scaled using the Cholesky decomposition of the inverse covariance. Then the scaled data are returned. The scaled variables are rotated to be orthogonal, so are renamed as `z1`, `z2`, etc. Non-rotated scaling is possible by setting `cov = NULL`.
```{r of-mvscale}
mvscale(of)
mvscale(of, cov = NULL)
```