-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchap-selection.qmd
405 lines (273 loc) · 22.9 KB
/
chap-selection.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
```{r echo = FALSE, cache = FALSE}
source("utils.R", local = TRUE)
```
# Selection {#chap-selection}
Auditors must often evaluate balances or populations that include a large quantity of items. As it is not possible to individually examine all of these items, they must select a subset, or sample, from the total population to make a statement about a specific characteristic of the population. Several selection methodologies, which are widely accepted in the audit context, are available for this purpose. This chapter discusses the most frequently used sampling methodology for audit sampling and demonstrates how to select a sample using these methods in R.
## Sampling Units
Selecting a subset from the population requires knowledge of the sampling units; physical representations of the population that needs to be audited. Generally, the auditor has to choose between two types of sampling units: individual items in the population or individual monetary units in the population. In order to perform statistical selection, the population must be divided into individual sampling units that can be assigned a probability to be included in the sample. The total collection of all sampling units which have been assigned a selection probability is called the sampling frame.
### Items
A sampling unit for record (i.e., attributes) sampling is generally a characteristic of an item in the population. For example, suppose that you inspect a population of receipts. A possible sampling unit for record sampling can be the date of payment of the receipt. When a sampling unit (e.g., date of payment) is selected by the sampling method, the population item that corresponds to the sampled unit is included in the sample.
### Monetary Units
A sampling unit for monetary unit sampling is different than a sampling unit for record sampling in that it is an individual monetary unit within an item or transaction, like an individual dollar. For example, a single sampling unit can be the 10$^{\text{th}}$ dollar from a specific receipt in the population. When a sampling unit (e.g., individual dollar) is selected by the sampling method, the population item that includes the sampling unit is included in the sample.
## Sampling Methods
This section discusses four sampling methods that are commonly used in audit sampling. The methods that will be discussed are:
- Random sampling
- Fixed interval sampling
- Cell sampling
- Modified sieve sampling
First, let's get some notation out of the way. As discussed in Chapter 2, the population size $N$ is defined as the total set of individual sampling units (denoted by $x_i$).
\begin{equation}
N = \{x_1, x_2, \dots, x_N\}.
\end{equation}
In statistical sampling, every sampling unit $x_i$ in the population should receive a selection probability $p(x_i)$. The purpose of the sampling method is to provide a framework to assign selection probabilities to each of the sampling units, and subsequently draw sampling units from the population until a set of size $n$ has been created.
To illustrate how the resulting sample differs for various sampling methods, we will use the `BuildIt` data set included in the **jfa** package. These data can be loaded into R using the code below. For simplicity, we will use a sample size of $n$ = 10 for all examples.
```{r}
data(BuildIt)
n <- 10
```
### Random Sampling
Random sampling is the most simple and straight-forward selection method. The random sampling method provides a method that allows every sampling unit in the population an equal chance of being selected, meaning that every combination of sampling units has the same probability of being selected as every other combination of the same number of sampling units. Simply put, the algorithm draws a random selection of size $n$ of the sampling units. Therefore, the selection probability for each sampling unit is defined as:
\begin{equation}
p(x) = \frac{1}{N}.
\end{equation}
To make this procedure visually intuitive, @fig-selection-random below provides an illustration of the random sampling method.
![Illustration of random sampling, which involves selecting a subset of items from a population in such a way that every sampling unit in the population has an equal chance of being included in the sample. ](img/selection_random.png){#fig-selection-random fig-align="center"}
- **Advantage(s):** The random sampling method yields an optimal random selection, with the additional advantage that the sample can be easily extended by applying the same method again.
- **Disadvantages:** Because the selection probabilities are equal for all sampling units there is no guarantee that items with a large monetary value in the population will be included in the sample.
#### Record Sampling
Random sampling can easily be coded in base R. First, we have to get a vector of of the possible items (rows) in the population that can be selected. When we are performing record sampling, we can simply use R's build in `sample()` function to draw a random sample from a vector `1:nrow(BuildIt)` representing the row indices of the items and store the result in a variable `items`.
```{r}
set.seed(1)
items <- sample.int(nrow(BuildIt), size = n, replace = FALSE)
items
```
You can then select the sample from the population using the selected indices stored in `items`.
```{r}
BuildIt[items, ]
```
The sample can be reproduced in **jfa** via the `selection()` function. This function takes as input the population data, the sample size, and the characteristics of the sampling method. The argument `units` allows you to specify that you want to use record sampling (`units = "items"`), while the `method` argument enables you to specify that you are performing random sampling (`method = 'random'`).
```{r}
set.seed(1)
selection(BuildIt, size = n, units = "items", method = "random")$sample
```
An alternative to specifying the desired sample size through the `size` argument is to provide an object generated by the `planning()` function to the `selection()` function. For instance, the following code utilizes the `planning()` function to plan a sample size based on a performance materiality of 0.03, or three percent, and a sampling risk of 0.05, or five percent, which can be passed directly to `selection()` to select the sample from the `BuildIt` population.
```{r}
selection(BuildIt, size = planning(materiality = 0.03), units = "items", method = "random")
```
The ability of one function to accept input from another function allows for the implementation of a workflow in which the `planning()` function and the `selection()` function are sequentially linked. Additionally, the use of R's native pipe operator `|>` further simplifies this process.
```{r}
planning(materiality = 0.03) |>
selection(data = BuildIt, units = "items", method = "random")
```
The `selection()` function has three additional arguments which you can use to preprocess your population before selection. These arguments are `order`, `decreasing` and `randomize`.
The `order` argument takes as input a column name in `data` which determines the order of the population. For example, you can order the population from lowest book value to highest book value before engaging in the selection. In this case, you should use the `decreasing = FALSE` (its default value) argument.
```{r}
set.seed(1)
selection(BuildIt, size = n, order = "bookValue", units = "items", method = "random")$sample
```
The `randomize` argument can be used to randomly shuffle the items in the population before selection. For example, you can randomly shuffle the population before engaging in the selection using `randomize = TRUE`.
```{r}
set.seed(1)
selection(BuildIt, size = n, randomize = TRUE, units = "items", method = "random")$sample
```
#### Monetary Unit Sampling
When we are performing record sampling, we have to consider that each item in the population consists of multiple smaller items (i.e., the monetary units), which means that items with a higher book value should get a higher probability of being selected. The `sample()` function faciliates weighted selection via the `prob` argument, which takes a vector of values and, using normalization, computes the weights for selection. The call below is similar to before, but in this case we use the book values in the column `bookValues` of the data set to weigh the items and store the result in a variable `items`.
```{r}
set.seed(1)
items <- sample.int(nrow(BuildIt), size = n, replace = TRUE, prob = BuildIt[["bookValue"]])
items
```
You can then select the sample from the population using the selected indices stored in `items`.
```{r}
BuildIt[items, ]
```
The sample can be reproduced in **jfa** via the `selection()` function. The argument `units` allows you to specify that you want to use monetary unit sampling (`units = "values"`), while the `method` argument enables you to specify that you are performing random sampling (`method = 'random'`). Note that you should provide the name of the column in the data that contains the monetary units via the `values` argument.
```{r}
set.seed(1)
selection(BuildIt, size = n, units = "values", method = "random", values = "bookValue")$sample
```
### Fixed Interval Sampling
Fixed interval sampling is a method designed for yielding representative samples from monetary populations. The algorithm determines a uniform interval on the (optionally ranked) sampling units. Next, a starting point is handpicked or randomly selected in the first interval and a sampling unit is selected throughout the population at each of the uniform intervals from the starting point. For example, if the interval has a width of 10 sampling units and sampling unit number 5 is chosen as the starting point, the sampling units 5, 15, 25, etc. are selected to be included in the sample.
The number of required intervals $I$ can be determined by dividing the number of sampling units in the population by the required sample size:
\begin{equation}
I = \frac{N}{n},
\end{equation}
in which $n$ is the required sample size and $N$ is the total number of sampling units in the population.
If the space between the selected sampling units is equal, the selection probability for each sampling unit is theoretically defined as:
\begin{equation}
p(x) = \frac{1}{I},
\end{equation}
with the property that the space between selected units $i$ (of which the first one is the starting point) is the same as the interval $I$, see @#fig-selection-interval below. However, in practice the selection is deterministic and completely depends on the chosen starting points (using `start`).
![Illustration of fixed interval sampling. The population is represented by the horizontal line, and the vertical lines indicate the intervals of size *I* at which samples units are selected. By using fixed interval sampling, equal spacing between sampling units is ensures, which means that every $\text{i}^{\text{th}}$ unit in the population is included in the sample.](img/selection_interval.png){#fig-selection-interval fig-align="center"}
The fixed interval method yields a sample that allows every sampling unit in the population an equal chance of being selected. However, the fixed interval method has the property that all items in the population with a monetary value larger than the interval $I$ have an selection probability of one because one of these items' sampling units are always selected from the interval. Note that, if the population is arranged randomly with respect to its deviation pattern, fixed interval sampling is equivalent to random selection.
- **Advantage(s):** The advantage of the fixed interval sampling method is that it is often simple to understand and fast to perform. Another advantage is that, in monetary unit sampling, all items that are greater than the calculated interval will be included in the sample. In record sampling, since units can be ranked on the basis of value, there is also a guarantee that some large items will be in the sample.
- **Disadvantage(s):** A pattern in the population can coincide with the selected interval, rendering the sample less representative. What is sometimes seen as an added complication for this method is that the sample is hard to extend after drawing the initial sample. This is due to the chance of selecting the same sampling unit. However, by removing the already selected sampling units from the population and redrawing the intervals this problem can be efficiently solved.
#### Record Sampling
To code fixed interval sampling in a record sampling context, we first have to compute the size of the interval we are working with. This is computed by dividing the number of items in the population by the desired sample size $n$. Suppose the auditor wants to select a sample of 10 items, then the interval is computed by:
```{r}
interval <- nrow(BuildIt) / n
```
Next, we have to determine the starting point. We are going to take the fifth unit in each interval in this case.
```{r}
start <- 5
```
To find which rows are part of the sample, we execute the following code:
```{r}
items <- ceiling(start + interval * 0:(n - 1))
```
You can then select the sample from the population using the selected indices stored in `items`.
```{r}
BuildIt[items, ]
```
The sample can be reproduced in **jfa** via the `selection()` function. The argument `units` allows you to specify that you want to use record sampling (`units = "items"`), while the `method` argument enables you to specify that you are performing fixed interval sampling (`method = 'interval'`). Note that, by default, the first sampling unit from each interval is selected. However, this can be changed by setting the argument `start` to a different value.
```{r}
selection(BuildIt, size = n, units = "items", method = "interval", start = start)$sample
```
#### Monetary Unit Sampling
In monetary unit sampling, the only difference is that we are computing the interval on the basis of the booked values in the column `bookValue` of the data set. In this case, the starting point `start = 5` determines which monetary unit from each interval is selected.
```{r}
interval <- sum(BuildIt[["bookValue"]]) / n
```
To find which units are part of the sample, we execute the following code:
```{r}
units <- start + interval * 0:(n - 1)
```
To obtain which items are part of the sample, we can run the following for loop. Note that this does not take into account whether the book values contain negative values, which should not be included in the cumulative sum below.
```{r}
all_units <- ifelse(BuildIt[["bookValue"]] < 0, 0, BuildIt[["bookValue"]])
all_items <- 1:nrow(BuildIt)
items <- numeric(n)
for (i in 1:n) {
item <- which(units[i] <= cumsum(all_units))[1]
items[i] <- all_items[item]
}
```
You can then select the sample from the population using the selected indices stored in `items`.
```{r}
BuildIt[items, ]
```
The sample can be reproduced in **jfa** via the `selection()` function. The argument `units` allows you to specify that you want to use monetary unit sampling (`units = "values"`), while the `method` argument enables you to specify that you are performing fixed interval sampling (`method = 'interval'`). Note that you should provide the name of the column in the data that contains the monetary units via the `values` argument.
```{r}
selection(BuildIt, size = n, units = "values", method = "interval", values = "bookValue", start = start)$sample
```
### Cell Sampling
The cell sampling method divides the (optionally ranked) population into a set of intervals $I$ that are computed through the previously given equations. Within each interval, a sampling unit is selected by randomly drawing a number between 1 and the interval range $I$. This causes the space $i$ between the sampling units to vary. The procedure is displayed in @#fig-selection-cell.
Like in the fixed interval sampling method, the selection probability for each sampling unit is defined as:
\begin{equation}
p(x) = \frac{1}{I}.
\end{equation}
![Illustration of cell sampling. In this illustration, the population is fist divided into distinct cells of size $I$ and subsequently a random sampling unit is selected within each cell such that the space between units $i$ varies.](img/selection_cell.png){#fig-selection-cell fig-align="center"}
The cell sampling method has the property that all items in the population with a monetary value larger than twice the interval $I$ have a selection probability of one.
- **Advantage(s):** More sets of samples are possible than in fixed interval sampling, as there is no systematic interval $i$ to determine the selections. It is argued that the cell sampling algorithm offers a solution to the pattern problem in fixed interval sampling.
- **Disadvantage(s):** A disadvantage of this sampling method is that not all items in the population with a monetary value larger than the interval have a selection probability of one. Besides, population items can be in two adjacent cells, thereby creating the possibility that an items is included in the sample twice.
#### Record Sampling
To code cell sampling in a record sampling context, we again have to compute the size of the interval we are working with:
```{r}
interval <- nrow(BuildIt) / n
```
Next, we have to randomly determine which items are going to be selected in each interval.
```{r}
set.seed(1)
starts <- floor(runif(n, 0, interval))
```
To find which rows are part of the sample, we execute the following code:
```{r}
items <- floor(starts + interval * 0:(n - 1))
```
You can then select the sample from the population using the selected indices stored in `items`.
```{r}
BuildIt[items, ]
```
The sample can be reproduced in **jfa** via the `selection()` function. The argument `units` allows you to specify that you want to use record sampling (`units = "items"`), while the `method` argument enables you to specify that you are performing cell sampling (`method = 'cell'`).
```{r}
set.seed(1)
selection(BuildIt, size = n, units = "items", method = "cell")$sample
```
#### Monetary Unit Sampling
In monetary unit sampling, the only difference is that we are computing the interval on the basis of the booked values in the column `bookValue` of the data set. In this case, the starting points `start` determines which monetary unit from each interval is selected.
```{r}
interval <- sum(BuildIt[["bookValue"]]) / n
```
To obtain which items are part of the sample, we can run the following for loop. Note that this does not take into account whether the book values contain negative values, which should not be included in the cumulative sum below.
```{r}
set.seed(1)
all_units <- ifelse(BuildIt[["bookValue"]] < 0, 0, BuildIt[["bookValue"]])
all_items <- 1:nrow(BuildIt)
intervals <- 0:n * interval
items <- numeric(n)
for (i in 1:n) {
unit <- stats::runif(1, intervals[i], intervals[i + 1])
item <- which(unit <= cumsum(all_units))[1]
items[i] <- all_items[item]
}
```
You can then select the sample from the population using the selected indices stored in `items`.
```{r}
BuildIt[items, ]
```
The sample can be reproduced in **jfa** via the `selection()` function. The argument `units` allows you to specify that you want to use monetary unit sampling (`units = "values"`), while the `method` argument enables you to specify that you are performing cell sampling (`method = 'cell'`). Note that you should provide the name of the column in the data that contains the monetary units via the `values` argument.
```{r}
set.seed(1)
selection(BuildIt, size = n, units = "values", method = "cell", values = "bookValue")$sample
```
### Modified Sieve Sampling
The fourth option for the sampling method is modified sieve sampling (Hoogduin, Hall, & Tsay, 2010). The algorithm starts by selecting a standard uniform random number $R_i$ between 0 and 1 for each item in the population. Next, the sieve ratio:
\begin{equation}
S_i = \frac{Y_i}{R_i}
\end{equation}
is computed for each item by dividing the book value of that item by the random number. Lastly, the items in the population are sorted by their sieve ratio $S$ (in decreasing order) and the top $n$ items are selected for inspection. In contrast to the classical sieve sampling method (Rietveld, 1978), the modified sieve sampling method provides precise control over sample sizes.
#### Monetary Unit Sampling
```{r}
set.seed(1)
all_units <- ifelse(BuildIt[["bookValue"]] < 0, 0, BuildIt[["bookValue"]])
all_items <- 1:nrow(BuildIt)
ri <- all_units / stats::runif(length(all_items), 0, 1)
items <- all_items[order(-ri)]
items <- items[1:n]
```
You can then select the sample from the population using the selected indices stored in `items`.
```{r}
BuildIt[items, ]
```
The sample can be reproduced in **jfa** via the `selection()` function. The argument `units` allows you to specify that you want to use monetary unit sampling (`units = "values"`), while the `method` argument enables you to specify that you are performing modified sieve sampling (`method = 'sieve'`). Note that you should provide the name of the column in the data that contains the monetary units via the `values` argument.
```{r}
set.seed(1)
selection(BuildIt, size = n, units = "values", method = "sieve", values = "bookValue")$sample
```
## Practical Exercises
1. Select a random sample of 120 items from the `BuildIt` data set.
::: {.content-visible when-format="html"}
<details>
<summary>Click to reveal answer</summary>
Selecting a random sample of items can be done using the `selection()` function with the additional arguments `size = 120`, `method = "random"` and `units = "items"`.
```{r}
selec <- selection(data = BuildIt, size = 120, method = "random", units = "items")
head(selec[["sample"]], 5)
```
</details>
:::
2. Select a sample of 240 monetary units from the `BuildIt` data set using a fixed interval selection method. Use a starting point of 12.
::: {.content-visible when-format="html"}
<details>
<summary>Click to reveal answer</summary>
Selecting a random sample of items can be done using the `selection()` function with the arguments `size = 240`, `method = "interval"` and `units = "values"`. Additionally, for fixed interval monetary unit sampling, the book values must be given in via argument `values = "bookValue"`. The starting point is indicated using `start = 12`.
```{r}
selec <- selection(data = BuildIt, size = 240, method = "interval", units = "values", values = "bookValue", start = 12)
head(selec[["sample"]], 5)
```
</details>
:::
::: {.content-visible when-format="pdf"}
\clearpage
## Answers to the Exercises
1. Selecting a random sample of 120 items can be done using the `selection()` function with the additional arguments `size = 120`, `method = "random"` and `units = "items"`.
```{r}
selec <- selection(data = BuildIt, size = 120, method = "random", units = "items")
head(selec[["sample"]], 5)
```
2. Selecting a random sample of 240 monetary units can be done using the `selection()` function with the arguments `size = 240`, `method = "interval"` and `units = "values"`. Additionally, for fixed interval monetary unit sampling, the book values must be given in via argument `values = "bookValue"`. The starting point is indicated using `start = 12`.
```{r}
selec <- selection(data = BuildIt, size = 240, method = "interval", units = "values", values = "bookValue", start = 12)
head(selec[["sample"]], 5)
```
:::