-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchap-planning.qmd
1227 lines (837 loc) · 73 KB
/
chap-planning.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r echo = FALSE, cache = FALSE}
source("utils.R", local = TRUE)
```
# Planning {#chap-planning}
One of the key considerations in audit sampling is determining the appropriate sample size to reduce the sampling risk to an appropriately low level, while minimizing audit effort. To quantify the sampling risk, it is necessary to specify the statistical model that connects the data to the population misstatement, referred to as the likelihood. This chapter will delve into three standard likelihoods commonly employed in audit sampling: the hypergeometric likelihood, the binomial likelihood, and the Poisson likelihood.
Note that planning is a game of optimization that could be avoided. If you are using the Bayesian approach to audit sampling, it is not required to plan a specific sample size in advance [@Derks2022]. That is because in Bayesian inference, the posterior distribution after seeing each item is used as the prior distribution for the next item. That means that you can simply start sampling and monitor the evidence in the data over time. However, to get an idea of how many samples are required to reduce the sampling risk to an level, planning a sample using a Bayesian approach can still be good practice.
## Required Information
First, planning a minimum sample requires knowledge of the conditions that lead to acceptance or rejection of the population (i.e., the sampling objectives). Typically, sampling objectives can be classified into one or both of the following:
- **Hypothesis testing**: The goal of the sample is to obtain evidence for or against the claim that the misstatement in the population is lower than a given value (i.e., the performance materiality).
- **Estimation**: The goal of the sample is to obtain an accurate estimate of the misstatement in the population with a certain precision.
Second, it is advised to specify the expected (or tolerable) misstatements in the sample. The expected misstatements are the misstatements that you allow in the sample, while still retaining the desired amount of assurance about the population. It is strongly recommended to set the value for the expected misstatements in the sample conservatively to minimize the chance of the observed misstatements in the sample exceeding the expected misstatements, which would imply that insufficient work has been done in the end to reduce the sampling risk to an appropriately low level.
Finally, next to determining the sampling objective(s) and the expected misstatements, it is important to determine the statistical distribution linking the sample outcomes to the population misstatement. This distribution is called the likelihood (i.e., `poisson`, `binomial`, or `hypergeometric`). All three aforementioned likelihoods are commonly used in an audit sampling context, however, `poisson` is the default likelihood in **jfa** because it is the most conservative of the three. In the subsections below, we elaborate on the three standard likelihoods for audit sampling and demonstrate how they can be used to obtain a minimum sample size.
In **jfa**, determining an appropriate sample size is achieved via the `planning()` function.
![When the sampling objectives, the expected misstatements and the likelihood are set, there exists only one sample size that minimizes the amount of audit effort. Image available under a [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode) license.](img/book_puzzle.png){#fig-planning-1 fig-align="center" width=65%}
## The Hypergeometric Likelihood
The hypergeometric distribution is a discrete probability distribution that is commonly used to model the number of events occurring in a fixed number of trials when the population size is known. It assumes that samples are drawn from the population without replacement, and is therefore the likelihood that most closely resembles the audit practice. For our purpose, we can use the hypergeometric distribution as a likelihood to model the number of misstatements that are expected to be found in the sample.
The probability mass function (PMF) of the hypergeometric distribution is given by:
\begin{equation}
p(X = k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}},
\end{equation}
where $k$ is the number of misstatements in the sample, $n$ is the sample size, $N$ is the population size and $K$ is the total number of misstatements assumed in the population. The assumed misstatements $K$ is a whole number, that is, a linear extrapolation of the maximum tolerable misstatement rate (i.e., the performance materiality) $\theta_{max}$ to the total population of size $N$. In the equation below, $\lceil...\rceil$ is the ceiling function, which means that $\lceil 1.2 \rceil$ = 2.
\begin{equation}
K = \lceil \theta_{max} N \rceil.
\end{equation}
Let's consider how to use the hypergeometric likelihood to calculate the minimum sample size needed to reduce the sampling risk to an appropriately low level.
### Classical Planning
In classical planning using the hypergeometric likelihood, the following statistical model is specified:
\begin{equation}
k \sim \text{Hypergeometric}(n, N, K)
\end{equation}
Given the performance materiality $\theta_{max}$, we can compute $K$ and solve for the minimum sample size $n$ needed to reduce the sampling risk to an appropriately low level. This sample size is also dependent on the number of misstatements that the auditor expects, or tolerates, in the sample.
#### No Expected Misstatements
If the auditor does not expect any misstatements in the sample, they can set $k$ = 0, which consequently determines how the sample size can be calculated. For example, if we want to achieve an assurance level of 95 percent ($\alpha$ = 0.05) for a performance materiality of $\theta_{max}$ = 0.03 in a population of $N$ = 1000 items, then $K$ = $\lceil 0.03 \cdot 1000 \rceil$ = 30 and the minimum sample size under the assumption of no expected misstatements in the sample is $n$ = 94.
```{r}
plan <- planning(materiality = 0.03, expected = 0, conf.level = 0.95, likelihood = "hypergeometric", N.units = 1000)
plan
```
The sample size of 94 can be confirmed by checking that 94 is the minimum integer that results in less than five percent probability of finding no misstatements, given the assumption that the population misstatement is truly 0.03, or three percent. The `dhyper()` function calculates the probability of observing $k$ missatements in a sample of $n$ items given the assumed hypergeometric distribution with $N$ items and $K$ assumed misstatements in the population. By calculating this probability for $n$ = 93, we can show that this sample size is insufficient as the relevant probability is higher than the sampling risk $\alpha$.
```{r}
K <- ceiling(0.03 * 1000)
dhyper(x = 0, m = K, n = 1000 - K, k = 93) < 0.05
```
However, for $n$ = 94 the relevant probability is lower than the sampling risk $\alpha$ and thus the sample size is considered to be sufficient.
```{r}
dhyper(x = 0, m = K, n = 1000 - K, k = 94) < 0.05
```
We can make this sample size visually intuitive by showing the hypergeometric($k$ | 94, 1000, 30) distribution and highlighting the probability for $k$ = 0, see @fig-planning-hyper-nomis. This probability is lower than the required sampling risk $\alpha$ = 0.05.
```{r label="fig-planning-hyper-nomis", fig.cap="The hypergeometric($k$ | 94, 1000, 30) distribution showing the probability for $k$ = 0 misstatements as a red bar."}
plot(plan)
```
The `planning()` function has two additional arguments that are not shown in the call above: `by` and `max`. The argument `by` sets the increment between possible sample sizes for consideration. For example, `by = 10` considers only samples of size 10, 20, 30, and so forth.
```{r}
planning(materiality = 0.03, likelihood = "hypergeometric", N.units = 1000, by = 10)
```
The argument `max` sets the sample size at which the algorithm terminates. This can be used to avoid too many iterations of the algorithm at very low values of the performance materiality. For instance, `max = 50` throws an error if more than 100 samples are required.
```{r error=TRUE}
planning(materiality = 0.03, likelihood = "hypergeometric", N.units = 1000, max = 50)
```
#### Expected Misstatements
If the auditor expects misstatements in the sample, they can set $k$ to any integer value, which consequently determines how the sample size can be calculated. As another example, if we want to achieve an assurance level of 95 percent ($\alpha$ = 0.05) for a performance materiality of $\theta_{max}$ = 0.03 in a population of $N$ = 1000 items, then the required sample size under the assumption of one expected misstatement in the sample is $n$ = 147.
```{r}
plan <- planning(materiality = 0.03, expected = 1, likelihood = "hypergeometric", N.units = 1000)
plan
```
Once again, the sample size of 147 can be confirmed by checking that 147 is the minimum integer that results in less than five percent probability of finding 0 or 1 misstatements, given the assumption that that the population misstatement is truly three percent. By calculating this probability for $n$ = 146, we can show that this sample size is insufficient as the relevant probability is higher than the sampling risk $\alpha$.
```{r}
sum(dhyper(x = 0:1, m = K, n = 1000 - K, k = 146)) < 0.05
```
However, for $n$ = 147 the relevant probability is lower than the sampling risk $\alpha$ and thus the sample size is considered to be sufficient.
```{r}
sum(dhyper(x = 0:1, m = K, n = 1000 - K, k = 147)) < 0.05
```
Like before, we can make this sample size visually intuitive by showing the hypergeometric($k$ | 147, 1000, 30) distribution and highlighting the probabilities for $k$ = 0 and $k$ = 1, see @fig-planning-hyper-mis. The sum of these probabilities is lower than the required sampling risk $\alpha$ = 0.05.
```{r label="fig-planning-hyper-mis", fig.cap="The hypergeometric($k$ | 147, 1000, 30) distribution showing the probability for $k$ = 0 and $k$ = 1 misstatements as a red bar."}
plot(plan)
```
### Bayesian Planning
Performing Bayesian planning with the hypergeometric likelihood [@Dyer1993] requires that you specify a prior distribution for the total misstatements $K$. Practically, this means that you should provide an input for the `prior` argument in the `planning()` function.
Setting `prior = TRUE` performs Bayesian planning using a default prior conjugate to the specified `likelihood` (i.e., a beta-binomial prior). Because this is a Bayesian analysis, the following statistical model is specified:
\begin{align}
k &\sim \text{Hypergeometric}(n, N, K) \\
K &\sim \text{Beta-binomial}(N, \alpha, \beta)
\end{align}
The beta-binomial prior distribution is the conjugate prior for to the hypergeometric likelihood (see this [list](https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions) of conjugate priors), which means that the posterior distribution of $K$ can be determined analytically. For example, if the prior distribution for $K$ is:
\begin{equation}
K \sim \text{Beta-binomial}(N, \alpha, \beta) \,\,\,\,\,\,\,\,\,\, K = 0, \ldots, N
\end{equation}
and the auditor has observed a sample of $n$ items containing $k$ misstatements, then the posterior distribution for $K$ is:
\begin{equation}
K \sim \text{Beta-binomial}(N - n, \alpha + k, \beta + k - n) \,\,\,\,\,\,\,\,\,\, K = k, k + 1, \ldots, N - n + k.
\end{equation}
#### No Expected Misstatements
Planning for no expected misstatements in the sample can be done by setting the value for the `expected` argument to zero. If we want to achieve an assurance level of 95 percent ($\alpha$ = 0.05) for a performance materiality of $\theta_{max}$ = 0.1 in a population of $N$ = 20 items, then the required sample size under the assumption of zero expected misstatements in the sample is $n$ = 15. The command below uses a default beta-binomial($N$, 1, 1) prior distribution to plan this sample, since `planning()` is given the hypergeometric likelihood.
```{r}
plan <- planning(materiality = 0.1, likelihood = "hypergeometric", N.units = 20, prior = TRUE)
```
The `summary()` function can be used to obatain relevant information about the planning.
```{r}
summary(plan)
```
You can inspect how the prior distribution compares to the expected posterior distribution by using the `plot()` function, see @fig-planning-bbinom-nomis. The expected posterior distribution is the posterior distribution that would occur if you actually observed the planned sample containing the expected misstatements. Note that the posterior distribution is only defined in the range [$k$; $N - n + k$], since a part of the population has already been seen.
```{r label="fig-planning-bbinom-nomis", fig.cap="The beta-binomial prior and posterior distribution on the range [$k$; $N - n + k$] after seeing no misstatements in the sample."}
plot(plan)
```
#### Expected Misstatements
Planning for expected misstatements in the sample can be done by setting the value for the `expected` argument to a different value than zero. For example, the code below calculates the minimum sample size to achieve an assurance level of 95 percent ($\alpha$ = 0.05) for a performance materiality of $\theta_{max}$ = 0.1 in a population of $N$ = 50 items, given one expected misstatement in the sample. This sample size is $n$ = 32.
```{r}
plan <- planning(materiality = 0.1, expected = 1, likelihood = "hypergeometric", N.units = 50, prior = TRUE)
plan
```
Like before, you can inspect how the prior distribution compares to the expected posterior distribution by using the `plot()` function, see @fig-planning-bbinom-mis below for the output of this call.
```{r label="fig-planning-bbinom-mis", fig.cap="The beta-binomial prior and posterior distribution on the range [$k$; $N - n + k$] after seeing one misstatement in the sample."}
plot(plan)
```
## The Binomial Likelihood
The binomial distribution is a discrete probability distribution that is commonly used to model the number of events occurring in a fixed number of trials. It is similar to the hypergeometric distribution, however, it assumes that samples are drawn from the population with replacement. For our purpose, we can use the binomial distribution as a likelihood to model the number of misstatements that are expected to be found in the sample.
In audit sampling, the binomial likelihood is often used to approximate the hypergeometric likelihood since it is easier to work with (i.e., it only has two parameters: $\theta$ and $n$, while the hypergeometric has three: $n$, $N$, and $K$). However, the binomial likelihood is more conservative than the hypergeometric likelihood, meaning that resulting sample sizes will be higher.
The probability mass function (PMF) of the binomial distribution is given by:
\begin{equation}
p(k; n, \theta) = \binom{n}{k} \theta^{k} (1-\theta)^{n - k},
\end{equation}
where $k$ is the number of misstatements in the sample, $n$ is the sample size and $\theta$ is the probability of misstatement in the population. Let's consider how to use the binomial likelihood to calculate the minimum sample size needed to reduce the sampling risk to an appropriately low level.
### Classical Planning
In classical planning using the binomial likelihood, the following statistical model is specified:
\begin{equation}
k \sim \text{Binomial}(n, \theta_{max})
\end{equation}
#### No Expected Misstatements
If the auditor does not expect any misstatements in the sample, they can set $k$ = 0, which consequently determines how the sample size can be calculated. Given a performance materiality $\theta_{max}$, we can solve for the minimum sample size $n$ needed to reduce the sampling risk to an appropriately low level. A useful trick to utilize is that, if we do not expect any misstatements in the sample, the formula for the minimum required sample size reduces to:
\begin{equation}
n = \lceil\frac{\ln(\alpha)}{\ln(1 - \theta_{max})}\rceil.
\end{equation}
For example, if we want to achieve an assurance level of 95 percent ($\alpha$ = 0.05) for a performance materiality of $\theta_{max}$ = 0.03, then the required sample size under the assumption of zero expected misstatements in the sample is $n$ = 99.
```{r}
ceiling(log(1 - 0.95) / log(1 - 0.03))
```
In **jfa**, this sample size can be replicated using the following code:
```{r}
plan <- planning(materiality = 0.03, likelihood = "binomial")
plan
```
The sample size of 99 can be confirmed by checking that 99 is the minimum integer that results in less than five percent probability of finding 0 misstatements, given the assumption that the population misstatement is truly three percent. The `dbinom()` function calculates the probability of observing $k$ missatements in a sample of $n$ items given an assumed misstatement probability $\theta_{max}$. By calculating this probability for $n$ = 98, we can show that this sample size is insufficient as the relevant probability is higher than the sampling risk $\alpha$.
```{r}
dbinom(x = 0, size = 98, prob = 0.03) < 0.05
```
However, for $n$ = 99 the relevant probability is lower than the sampling risk $\alpha$ and thus the sample size is considered to be sufficient.
```{r}
dbinom(x = 0, size = 99, prob = 0.03) < 0.05
```
We can make this sample size visually intuitive by showing the binomial($k$ | 99, 0.03) distribution and highlighting the probability for $k$ = 0, see @fig-planning-binom-nomis. This probability is lower than the required sampling risk $\alpha$ = 0.05.
```{r label="fig-planning-binom-nomis", fig.cap="The binomial($k$ | 99, 0.03) distribution showing the probability for $k$ = 0 misstatement as a red bar."}
plot(plan)
```
#### Expected Misstatementss
However, if the number of expected misstatements in the sample is non-zero, it becomes more difficult to solve the formula for $n$. Hence, they will need to set $k$ to a different integer value, which consequently determines how the sample size is calculated. Here, we can iteratively try every value of $n$ and return the smallest integer that satisfies the sampling objectives.
In **jfa**, this can be done by adjusting the `expected` argument in the `planning()` function. For example, if we want to achieve an assurance level of 95 percent ($\alpha$ = 0.05) for a performance materiality of $\theta_{max}$ = 0.03, then the required sample size under the assumption of one expected misstatement in the sample is $n$ = 157.
```{r}
plan <- planning(materiality = 0.03, expected = 1, likelihood = "binomial")
```
Once again, the sample size of 157 can be confirmed by checking that 157 is the minimum integer that results in less than five percent probability of finding 0 or 1 misstatements, given the assumption that the population misstatement is truly three percent. By calculating this probability for $n$ = 156, we can show that this sample size is insufficient as the relevant probability is higher than the sampling risk $\alpha$.
```{r}
sum(dbinom(x = 0:1, size = 156, prob = 0.03)) < 0.05
```
However, for $n$ = 157 the relevant probability is lower than the sampling risk $\alpha$ and thus the sample size is considered to be sufficient.
```{r}
sum(dbinom(x = 0:1, size = 157, prob = 0.03)) < 0.05
```
Like before, we can make this sample size visually intuitive by showing the binomial($k$ | 157, 0.03) distribution and highlighting the probabilities for $k$ = 0 and $k$ = 1, see @fig-planning-binom-mis. The sum of these probabilities is lower than the required sampling risk $\alpha$ = 0.05.
```{r label="fig-planning-binom-mis", fig.cap="The binomial($k$ | 157, 0.03) distribution showing the probability for $k$ = 0 and $k$ = 1 misstatements as a red bar."}
plot(plan)
```
#### Expected Misstatement Rate
When the expected misstatement rate in the sample $\theta_{exp}$ is assessed, the value for $k$ can be determined as $k$ = $n\theta_{exp}$, which consequently determines how the sample size can be calculated.
To account for the fact that $k$ can have non-integer values in this case, we can use a well-known similarity between the binomial distribution and the beta distribution to plan the sample size. The upper bound for any binomial($k$; $n$, $\theta$) distributed variable can also be obtained via percentiles of the beta(1 + $k$, $n - k$) distribution.
For example, the upper bound for a sample of $n$ = 10 items containing $k$ = 2 misstatements, when calculated via the traditional `binom.test()` is:
```{r}
ub_binom <- binom.test(x = 2, n = 10, p = 0.03, conf.level = 0.95, alternative = "less")$conf.int[2]
ub_binom
```
When calculated via the beta relationship, the upper bound is:
```{r}
ub_beta <- qbeta(p = 0.95, shape1 = 1 + 2, shape2 = 10 - 2)
ub_beta
```
It can be validated that the two approaches result in the same upper bound via:
```{r}
ub_binom == ub_beta
```
This relationship between the binomial likelihood and the beta distribution is deliberately **not** used in **jfa**. That is because, in the case of the binomial distribution, the auditing standards round the tolerable misstatements upwards to a whole number. For example, if we try to call the `planning()` function with the argument `expected = 1.5`, **jfa** will internally convert this to `expected = 2` and base the sample size on this in order to stay compliant with @AICPA-A. The resulting sample size is $n$ = 208 in this case.
```{r}
planning(materiality = 0.03, expected = 1.5, likelihood = "binomial")
```
### Bayesian Planning
Performing Bayesian planning using the binomial likelihood requires that you specify a prior distribution for the parameter $\theta$. Practically, this means that you should provide an input for the `prior` argument in the `planning()` function.
Setting `prior = TRUE` performs Bayesian planning using a default prior conjugate to the specified `likelihood` (i.e., a beta prior). Because this is a Bayesian analysis, the following statistical model is specified:
\begin{align}
k &\sim \text{Binomial}(n, \theta) \\
\theta &\sim \text{Beta}(\alpha, \beta)
\end{align}
The beta prior distribution is the conjugate prior for the binomial likelihood (see this [list](https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions) of conjugate priors), which means that the posterior distribution of $\theta$ can be determined analytically. For example, if the prior distribution for $\theta$ is:
\begin{equation}
\theta \sim \text{Beta}(\alpha, \beta) \,\,\,\,\,\,\,\,\,\, \theta \in [0, 1]
\end{equation}
and the auditor has observed a sample of $n$ items containing $k$ misstatements, then the posterior distribution for $\theta$ is:
\begin{equation}
\theta \sim \text{Beta}(\alpha + k, \beta + n - k) \,\,\,\,\,\,\,\,\,\, \theta \in [0, 1].
\end{equation}
For example, the command below uses a default beta($\alpha$ = 1, $\beta$ = 1) prior distribution to plan the sample, since `planning()` is given the binomial likelihood. If we want to achieve an assurance level of 95 percent ($\alpha$ = 0.05) for a performance materiality of $\theta_{max}$ = 0.03, then the required sample size under the assumption of zero expected misstatements in the sample is $n$ = 98.
```{r}
plan <- planning(materiality = 0.03, expected = 0, conf.level = 0.95, likelihood = "binomial", prior = TRUE)
```
The `summary()` function can be used to obatain relevant information about the planning.
```{r}
summary(plan)
```
You can inspect how the prior distribution compares to the expected posterior distribution by using the `plot()` function, see @fig-planning-beta-nomis. The expected posterior distribution is the posterior distribution that would occur if you actually observed the planned sample containing the expected misstatements.
```{r label="fig-planning-beta-nomis", fig.cap="The beta prior and posterior distribution on the range [0; 1] after seeing no misstatements in the sample of 98 units."}
plot(plan)
```
The input for the `prior` argument can also be an object created by the `auditPrior` function. If `planning()` receives a prior for which there is no conjugate likelihood available, it will numerically derive the posterior distribution. For example, the command below uses a Normal(0, 0.05) prior distribution to plan the sample using the binomial likelihood. Because this is a Bayesian analysis, the following statistical model is specified:
\begin{align}
k &\sim \text{Binomial}(n, \theta) \\
\theta &\sim \text{Normal}(\mu = 0, \sigma = 0.05)
\end{align}
```{r}
prior <- auditPrior(method = "param", likelihood = "normal", alpha = 0, beta = 0.05)
plan <- planning(materiality = 0.03, likelihood = "poisson", prior = prior)
```
The `summary()` function can be used to obatain relevant information about the planning.
```{r}
summary(plan)
```
The resulting sample size under this prior is $n$ = 90, a reduction of 8 samples when compared to the default beta(1, 1) prior distribution. @fig-planning-beta-nomis2 shows this prior and posterior distribution.
```{r label="fig-planning-beta-nomis2", fig.cap="The normal prior distribution in this example contains risk-reducing information. The posterior distribution has roughly the same upper bound as the beta posterior in the previous example and occurs after seeing no misstatements in a sample of 90 units."}
plot(plan)
```
## The Poisson Likelihood
The Poisson distribution is a discrete probability distribution that is commonly used to model the number of events occurring in a fixed time or space. We can use the Poisson distribution as a likelihood to model the number of misstatements that are expected to be found in the sample.
In audit sampling, the Poisson likelihood is often used to approximate the binomial likelihood since it is easier to work with (i.e., it only has one parameter: $\lambda$, while the binomial has two parameters: $\theta$ and $n$). However, the Poisson likelihood is more conservative than the binomial likelihood, meaning that resulting sample sizes will be higher.
The probability mass function (PMF) of the Poisson distribution is given by:
\begin{equation}
p(k; \lambda) = \frac{\lambda^k e^{-\lambda}}{k!},
\end{equation}
where $k$ is the number of misstatements in the sample, and $\lambda$ is the average number of misstatements expected in the sample. The average number of misstatements is related to the misstatement rate in the population, denoted by $\theta$, and the sample size, $n$, by the following equation:
\begin{equation}
\lambda = n \theta.
\end{equation}
Let's consider how to use the Poisson likelihood to calculate the minimum sample size needed to reduce the sampling risk to an appropriately low level.
### Classical Planning
In classical planning using the Poisson likelihood, the following statistical model is specified:
\begin{equation}
k \sim \text{Poisson}(n \theta_{max})
\end{equation}
#### No Expected Misstatements
Given the performance materiality $\theta_{max}$ and the Poisson likelihood, we can solve for the minimum sample size $n$ needed to reduce the sampling risk to an appropriately low level. A useful trick to utilize is that, if we do not expect any misstatements in the sample, the formula for the required sample size reduces to:
\begin{equation}
n = \lceil - \frac{\ln(\alpha)}{\theta_{max}} \rceil.
\end{equation}
For example, if we want to achieve an assurance level of 95 percent ($\alpha$ = 0.05) for a performance materiality of $\theta_{max}$ = 0.03, then the required sample size under the assumption of zero expected misstatements in the sample is $n$ = 100.
```{r}
ceiling(-log(1 - 0.95) / 0.03)
```
In **jfa**, this sample size can be replicated using the following code:
```{r}
plan <- planning(materiality = 0.03, likelihood = "poisson")
```
The sample size of 100 can be confirmed by checking that 100 is the minimum integer that results in less than five percent probability of finding no misstatements, given the assumption that the population misstatement is truly three percent. The `dpois()` function calculates the probability of observing $k$ missatements in a sample of $n$ items given an assumed misstatement probability $\theta_{max}$. By calculating this probability for $n$ = 99, we can show that this sample size is insufficient as the relevant probability is higher than the sampling risk $\alpha$.
```{r}
dpois(x = 0, lambda = 99 * 0.03) < 0.05
```
However, for $n$ = 100 the relevant probability is lower than the sampling risk $\alpha$ and thus the sample size is considered to be sufficient.
```{r}
dpois(x = 0, lambda = 100 * 0.03) < 0.05
```
We can make this visually intuitive by showing the Poisson($k$ | 100 $\cdot$ 0.03) distribution and highlighting the probability for $k$ = 0, see @fig-planning-pois-nomis. This probability is lower than the required sampling risk $\alpha$ = 0.05.
```{r label="fig-planning-pois-nomis", fig.cap="The Poisson($k$ | 100 $\\cdot$ 0.03) distribution showing the probability for $k$ = 0 misstatements as a red bar."}
plot(plan)
```
#### Expected Misstatements
However, if the number of expected misstatements in the sample is non-zero, it becomes more difficult to solve the formula for $n$ algebraically. Hence, they will need to set $k$ to a different integer value. Next, we can iteratively try every value of $n$ and return the smallest integer that satisfies the sampling objectives.
For example, if we want to achieve an assurance level of 95 percent ($\alpha$ = 0.05) for a performance materiality of $\theta_{max}$ = 0.03, then the required sample size under the assumption of one expected misstatement in the sample is $n$ = 159.
```{r}
plan <- planning(materiality = 0.03, expected = 1, likelihood = "poisson")
```
Once again, the sample size of 159 can be confirmed by checking that 159 is the minimum integer that results in less than five percent probability of finding $k$ = 0 or $k$ = 1 misstatements, given the assumption that the population misstatement is truly three percent. By calculating this probability for $n$ = 158, we can show that this sample size is insufficient as the relevant probability is higher than the sampling risk $\alpha$.
```{r}
sum(dpois(x = 0:1, lambda = 158 * 0.03)) < 0.05
```
However, for $n$ = 159 the relevant probability is lower than the sampling risk $\alpha$ and thus the sample size is considered to be sufficient.
```{r}
sum(dpois(x = 0:1, lambda = 159 * 0.03)) < 0.05
```
Like before, we can make this visually intuitive by showing the Poisson($k$ | 159 $\cdot$ 0.03) distribution and highlighting the probabilities for $k$ = 0 and $k$ = 1, see @fig-planning-pois-mis. The sum of these probabilities is lower than the required sampling risk $\alpha$ = 0.05.
```{r label="fig-planning-pois-mis", fig.cap="The Poisson($k$ | 159 $\\cdot$ 0.03) distribution showing the probability for $k$ = 0 and $k$ = 1 misstatements as a red bar."}
plot(plan)
```
#### Expected Misstatement Rate
When the expected misstatements in the sample $\theta_{exp}$ is assessed, the value for $k$ can be determined as $k$ = $n\theta_{exp}$, which consequently determines how the sample size can be calculated.
To account for the fact that $k$ can have non-integer values in this case, we use a well-known similarity between the Poisson distribution and the gamma distribution to plan the sample size. The upper bound for any Poisson($k$; $n\theta$) distributed variable can also be obtained via percentiles of the gamma(1 + $k$, $n$) distribution.
For example, the upper bound for a sample of $n$ = 10 items containing $k$ = 2 misstatements, when calculated via the traditional `poisson.test()` is:
```{r}
ub_pois <- poisson.test(x = 2, T = 10, r = 0.03, alternative = "less")$conf.int[2]
ub_pois
```
When calculated via the relationship with the gamma distribution, the upper bound is:
```{r}
ub_gamma <- qgamma(p = 0.95, shape = 1 + 2, rate = 10)
ub_gamma
```
It can be validated that the two approaches result in the same upper bound via:
```{r}
ub_pois == ub_gamma
```
This relationship between the Poisson likelihood and the gamma distribution is used under the hood in **jfa**. For example, if we want to achieve an assurance level of 95 percent ($\alpha$ = 0.05) for a performance materiality of $\theta_{max}$ = 0.03, then the required sample size under the assumption of 1.5 expected misstatements in the sample is $n$ = 185.
```{r}
planning(materiality = 0.03, expected = 1.5, likelihood = "poisson")
```
The sample size of 185 can be confirmed by checking that 185 is the minimum integer that results in less than five percent probability of finding a misstatement rate in the population equal to, or higher than, three percent. By calculating this probability for $n$ = 184, we can show that this sample size is insufficient as the relevant upper bound is higher than the performance materiality $\theta_{max}$.
```{r}
qgamma(p = 0.95, shape = 1 + 1.5, rate = 184) < 0.03
```
However, for $n$ = 185 the relevant upper bound is lower than the performance materiality $\theta_{max}$ and thus the sample size is sufficient.
```{r}
qgamma(p = 0.95, shape = 1 + 1.5, rate = 185) < 0.03
```
### Bayesian Planning
Performing Bayesian planning with the Poisson likelihood requires that you specify a prior distribution for the parameter $\theta$. Practically, this means that you should provide an input for the `prior` argument in the `planning()` function.
Setting `prior = TRUE` performs Bayesian planning using a default prior conjugate to the specified `likelihood` (i.e., a gamma prior). Because this is a Bayesian analysis, the following statistical model is specified:
\begin{align}
k &\sim \text{Poisson}(n\theta) \\
\theta &\sim \text{Gamma}(\alpha, \beta)
\end{align}
The gamma prior distribution is the conjugate prior for the Poisson likelihood (see this [list](https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions) of conjugate priors), which means that the posterior distribution of $\theta$ can be determined analytically. For example, if the prior distribution for $\theta$ is:
\begin{equation}
\theta \sim \text{Gamma}(\alpha, \beta) \,\,\,\,\,\,\,\,\,\, \theta \in [0, \infty]
\end{equation}
and the auditor has observed a sample of $n$ items containing $k$ misstatements, then the posterior distribution for $\theta$ is:
\begin{equation}
\theta \sim \text{Gamma}(\alpha + k, \beta + n) \,\,\,\,\,\,\,\,\,\, \theta \in [0, \infty].
\end{equation}
For example, the command below uses a default gamma($\alpha$ = 1, $\beta$ = 1) prior distribution to plan the sample, since `planning()` is given the Poisson likelihood. If we want to achieve an assurance level of 95 percent ($\alpha$ = 0.05) for a performance materiality of $\theta_{max}$ = 0.03, then the required sample size under the assumption of zero expected misstatements in the sample is $n$ = 99.
```{r}
plan <- planning(materiality = 0.03, likelihood = "poisson", prior = TRUE)
```
The `summary()` function can be used to obatain relevant information about the planning.
```{r}
summary(plan)
```
You can inspect how the prior distribution compares to the expected posterior distribution by using the `plot()` function, see @fig-planning-gamma-nomis. The expected posterior distribution is the posterior distribution that would occur if you actually observed the planned sample containing the expected misstatements.
```{r label="fig-planning-gamma-nomis", fig.cap="The gamma prior and posterior distribution on the range [0; $\\infty$] after seeing no misstatements in the sample of 99 units."}
plot(plan)
```
The input for the `prior` argument can also be an object created by the `auditPrior` function. If `planning()` receives a prior for which there is no conjugate likelihood available, it will numerically derive the posterior distribution. For example, the command below uses a Normal(0, 0.05) prior distribution to plan the sample using the Poisson likelihood. Concretely, this means that the following statistical model is specified:
\begin{align}
k &\sim \text{Poisson}(n\theta) \\
\theta &\sim \text{Normal}(\mu = 0, \sigma = 0.05)
\end{align}
```{r}
prior <- auditPrior(method = "param", likelihood = "normal", alpha = 0, beta = 0.05)
plan <- planning(materiality = 0.03, likelihood = "poisson", prior = prior)
```
The `summary()` function can be used to obatain relevant information about the planning.
```{r}
summary(plan)
```
The resulting sample size under this prior is $n$ = 91, a reduction of 8 samples when compared to the default gamma(1, 1) prior. @fig-planning-gamma-nomis2 shows this prior and posterior distribution.
```{r label="fig-planning-gamma-nomis2", fig.cap="The normal prior distribution in this example contains risk-reducing information. The posterior distribution has roughly the same upper bound as the one in the previous example and occurs after seeing no misstatements in the sample of 91 units."}
plot(plan)
```
## Multi-Stage Sampling
A multi-stage sampling plan, as opposed to a single-stage one, allows for an intermediate evaluation of the sample. In the first stage of a multi-stage sampling plan, the auditor selects an initial sample of size $n_1$. If this sample contains a tolerable number of misstatements (usually 0), the auditor can approve the population. However, if the sample contains more misstatements than the tolerable number, the auditor cannot approve the population. In such a scenario, the initial sample is supplemented with a second sample of $n_2$ items. If this additional sample contains a tolerable number of misstatements, the population can still be approved. If not, the population should either be rejected or a third sample of $n_3$ items should be added.
In the classical (i.e., frequentist) methodology, multi-stage sampling plans can be formulated by decomposing the sampling risk into multiple components. For example, if the auditor initially plans for $k$ = 0 misstatements but considers extending the sample if $k$ = 1 misstatement is discovered, then the probability of erroneously rejecting the hypothesis of material misstatement comprises:
- $p(k \leq 0 | n_1, \theta_{max})$: The probability of finding $k$ = 0 misstatements in the first sample of $n_1$ items; plus
- $p(k = 1 | n_1, \theta_{max})$: The probability of finding $k$ = 1 misstatement in the first sample of $n_1$ items; multiplied by
- $p(k \leq 0 | n_2, \theta_{max})$: The probability finding $k$ = 0 misstatements in the second sample of $n_2$ items.
The sum of these probabilities should be less than or equal to the sampling risk $\alpha$.
\begin{equation}
p(k \leq 0 | n_1, \theta_{max}) + p(k = 1 | n_1, \theta_{max}) \cdot p(k \leq 0 | n_2, \theta_{max}) \leq \alpha
\end{equation}
To constrain the problem, **jfa** sets the number of samples in the extension equal to the initial sample size ($n_1$ = $n_2$ = $n_s$). This implies that the sample size per stage in this two-stage sampling plan is the smallest integer $n_s$ that fulfills the following condition.
\begin{equation}
p(k \leq 0 | n_s, \theta_{max}) + p(k = 1 | n_s, \theta_{max}) \cdot p(k \leq 0 | n_s, \theta_{max}) \leq \alpha
\end{equation}
In **jfa**, multi-stage sampling plans can be calculated by supplying an integer vector of misstatements, after which each stage should be extended, to the `planning()` function via its `expected` argument. For example, the following code calculates the required sample size if the auditor initially plans for $k$ = 0 misstatements but considers extending the sample if $k$ = 1 misstatement is discovered. The required sample size per stage is $n_s$ = 103, resulting in a total sample size (if both stages are necessary) of $n$ = 206.
```{r}
multiplan <- planning(materiality = 0.03, likelihood = "binomial", expected = c(1, 0))
print(multiplan)
```
To confirm this calculation, we need to ensure that that the probability of incorrectly rejecting the null hypothesis of material misstatement, under the binomial distribution, is less than the sampling risk $\alpha$ = 0.05.
```{r}
p_k0_n1 <- pbinom(q = 0, size = 103, prob = 0.03)
p_k1_n1 <- dbinom(x = 1, size = 103, prob = 0.03)
p_k0_n2 <- pbinom(q = 0, size = 103, prob = 0.03)
p_k0_n1 + p_k1_n1 * p_k0_n2 < 0.05
```
The minimum sample size per stage, $n_2$ = 103, is only slighly larger than the minimum sample size for the first stage if the auditor opts for a single-stage sampling plan expecting $k$ = 0, which is $n$ = 99.
```{r}
planning(materiality = 0.03, likelihood = "binomial", expected = 0)
```
However, the total sample size, $n$ = 206, is considerably larger than the minimum sample size if the auditor opts for a single-stage sampling plan expecting $k$ = 1, which is $n$ = 157. This illustrates the cost of allowing a sample size extension in the classical approach.
```{r}
planning(materiality = 0.03, likelihood = "binomial", expected = 1)
```
Note that the sample size per stage $n_s$ = 103 is determined based on $n_s$ = $n_1$ = $n_2$. However, it is important to note that this equality is not mandatory. If the auditor opts for an initial sample size larger than $n_s$, they can decrease the size of the follow-up sample. To illustrate this tradeoff, you can use the `plot()` function, see @fig-planning-multi. The figure includes textual information specifying the total sample size under the equality constrained multi-stage sampling plan (red) and under multiple choices for the initial sample size $n_1$.
```{r label="fig-planning-multi", fig.cap="The preferred initial sample size $n_1$ versus the required follow-up sample size $n_2$."}
plot(multiplan)
```
As another example, consider a three-stage sampling plan where the auditor plans to extend the sample after finding $k$ = 1 misstatement in the first stage, extend further if they find $k$ = 1 misstatement in the second stage, and still be able to approve the population if they find $k$ = 0 misstatements in the third stage. The required sample size in each of the three stages is $n_s$ = 208.
```{r}
planning(materiality = 0.03, likelihood = "binomial", expected = c(3, 1, 0))
```
This can be confirmed by ensuring that the probability of incorrectly rejecting the null hypothesis of the population containing three percent misstatement is below the acceptable sampling risk $\alpha$ = 0.05.
```{r}
p_k2_n1 <- pbinom(q = 2, size = 208, prob = 0.03)
p_k3_n1 <- dbinom(x = 3, size = 208, prob = 0.03)
p_k0_n2 <- pbinom(q = 0, size = 208, prob = 0.03)
p_k1_n2 <- dbinom(x = 1, size = 208, prob = 0.03)
p_k0_n3 <- pbinom(q = 0, size = 208, prob = 0.03)
p_k2_n1 + p_k3_n1 * p_k0_n2 + p_k3_n1 * p_k1_n2 * p_k0_n3 < 0.05
```
Unlike the previous example, the minimum sample size per stage, $n_s$ = 208, is not larger than the minimum sample size for the first stage if the auditor decides to use a single-stage sampling plan expecting $k$ = 2 misstatements, which is also $n$ = 208.
```{r}
planning(materiality = 0.03, likelihood = "binomial", expected = 2)
```
The Bayesian approach to audit sampling uses the posterior distribution from the observed sample as a prior distribution for a potential second sample in an iterative manner. For this reason, this approach does not affect the sampling risk based on the number of tests [@Rouder2014]. Hence, in the Bayesian framework, it is entirely appropriate to start sampling until there is enough evidence to make a decision [@Edwards1963]. This means that if you find $k$ = 1 misstatement in the initial $n_1$ = 99 samples, calculating the sample size extension simply involves computing the single-stage sample size under the expectation of $k$ = 1 tolerable misstatement. The total required sample size then becomes $n$ = 157, and the sample size extension therefore amounts to $n_2$ = 158 - 99 = 59 items.
```{r}
planning(materiality = 0.03, expected = 1, prior = TRUE)
```
## Prior Distributions
In principle, any distribution that covers the range of $\theta$ can be used as a prior distribution. However, some distributions are more suitable than others. For instance, the beta-binomial, beta and gamma distributions are all commonly used because they are so-called conjugate distributions, that is, they stay in the same family when updated by the data. The **jfa** package provides the ability to construct a prior distribution for audit sampling. More specifically, the `auditPrior()` function is used to specify a prior distribution that can be used as input for the `prior` argument in the `planning()` (and `evaluation()`) function. Below is an enumeration of the several ways that a prior distribution can be constructed using the `auditPrior` function.
### Default Prior
The default prior distributions are created using `method = 'default'`. There are no explicit rules for what constitutes a default prior. However, **jfa**'s default priors satisfy two criteria. First, they contain relatively little information about the population misstatement and, second, they are proper (i.e., they integrate to 1). For completeness, all default priors in **jfa** are provided in the following list.
* `likelihood = 'poisson'`: gamma($\alpha$ = 1, $\beta$ = 1)
* `likelihood = 'binomial'`: beta($\alpha$ = 1, $\beta$ = 1)
* `likelihood = 'hypergeometric'`: beta-binomial($N$, $\alpha$ = 1, $\beta$ = 1)
* `likelihood = 'normal'`: normal($\mu$ = 0, $\sigma$ = 1000)
* `likelihood = 'uniform'`: uniform(*min* = 0, *max* = 1)
* `likelihood = 'cauchy'`: Cauchy($x_0$ = 0, $\gamma$ = 1000)
* `likelihood = 't'`: Student-t(*df* = 1)
* `likelihood = 'chisq'`: chi-squared(*df* = 1)
* `likelihood = 'exponential'`: exponential($\lambda$ = 1)
For instance, to create a default prior distribution using the binomial likelihood (i.e., a beta(1, 1) prior), you can use the following code that creates a prior distribution and stores it in the `prior` object. You can then use the `summary()` function to obtain relevant information about the prior distribution.
```{r}
prior <- auditPrior(method = "default", likelihood = "binomial")
summary(prior)
```
All prior distributions can be visually inspected via the `plot()` function, see @fig-planning-prior-default.
```{r label="fig-planning-prior-default", fig.cap="The default beta(1, 1) prior distribution."}
plot(prior)
```
Furthermore, the `predict()` function produces the predictions of the prior distribution on the data level for a sample of `n` items. For example, the command below requests the prediction of the default beta(1, 1) prior for a hypothetical sample of 6 items.
```{r}
predict(prior, n = 6)
```
The predictions of the prior distribution can be visualized using the `plot()` function, see @fig-planning-prior-predict.
```{r label="fig-planning-prior-predict", fig.cap="The predictions of the beta(1, 1) prior distribution concerming the possible misstatements in an an intended sample of $n$ = 6."}
plot(predict(prior, n = 10))
```
### Parametric Prior
You can manually specify the parameters of the prior distribution with `method = 'param'` and the `alpha` and `beta` arguments, which correspond to the first and (optionally) second parameter of the prior as described above. For example, the commands below create a beta(2, 10) prior distribution, a normal(0.025, 0.05) prior distribution and a Student-t(0.01) prior distribution.
```{r}
auditPrior(method = "param", likelihood = "binomial", alpha = 2, beta = 10)
auditPrior(method = "param", likelihood = "normal", alpha = 0.025, beta = 0.05)
auditPrior(method = "param", likelihood = "t", alpha = 0.01)
```
### Improper Prior
You can construct an improper prior distribution with classical properties using `method = 'strict'`. The posterior distribution of from this prior yields the same results as the classical methodology with respect to sample sizes and upper limits, but is only proper once a single non-misstated unit is present in the sample [@Derks2022b]. For example, the command below creates an improper beta(1, 0) prior distribution. This method requires the `poisson`, `binomial` or `hypergeometric` likelihood.
```{r}
auditPrior(method = "strict", likelihood = "binomial")
```
### Impartial Prior
You can incorporate the assumption that tolerable misstatement is equally likely as intolerable misstatement [@Derks2022b] using `method = 'impartial'`. For example, the command below creates an impartial beta prior distribution for a performance materiality of five percent. This method requires that you specify a value for the `materiality`.
```{r}
auditPrior(method = "impartial", likelihood = "binomial", materiality = 0.05)
```
### Probability of Tolerable Misstatement
You can manually assign prior probabilities to the hypothesis of tolerable misstatement and the hypotheses of intolerable misstatement [@Derks2021] with `method = 'hyp'` in combination with `p.hmin`. For example, the command below incorporates the information that the hypothesis of tolerable misstatement has a pror probability of 60 percent into a beta distribution. Naturally, this method requires that you specify a value for the `materiality`.
```{r}
auditPrior(method = "hyp", likelihood = "binomial", materiality = 0.05, p.hmin = 0.6)
```
### Audit Risk Model
You can translate risk assessments from the Audit Risk Model (inherent risk and internal control risk) into a prior distribution [@Derks2021] using `method = 'arm'` in combination with the `ir` and `cr` arguments. For example, the command below incorporates the information that the inherent risk is equal to 90 percent and internal control risk is equal to 60 percent into a beta prior distribution. This method requires the `poisson`, `binomial` or `hypergeometric` likelihood.
```{r}
auditPrior(method = "arm", likelihood = "binomial", materiality = 0.05, ir = 0.9, cr = 0.6)
```
### Bayesian Risk Assessment Model
You can incorporate information about the mode and the upper bound of the prior distribution using `method = 'bram'`. For example, the code below incorporates the information that the mode of the prior distribution is one percent and the upper bound is 60 percent into a beta prior distribution. This method requires the `poisson`, `binomial` or `hypergeometric` likelihood.
```{r}
auditPrior(method = "bram", likelihood = "binomial", materiality = 0.05, expected = 0.01, ub = 0.6)
```
### Earlier Sample
You can incorporate information from an earlier sample into the prior distribution [@Derks2021] using `method = 'sample'` in combination with `x` and `n`. For example, the command below incorporates the information from an earlier sample of 30 items in which 0 misstatements were found into a beta prior distribution. This method requires the `poisson`, `binomial` or `hypergeometric` likelihood.
```{r}
auditPrior(method = "sample", likelihood = "binomial", x = 0, n = 30)
```
### Weighted Earlier Sample
You can incorporate information from last years results, weighted by a factor [@Derks2021], into the prior distribution using `method = 'factor'` in combination with `x` and `n`. For example, the command below incorporates the information from a last years results (a sample of 58 items in which 0 misstatements were found), weighted by a factor 0.7, into a beta prior distribution. This method requires the `poisson`, `binomial` or `hypergeometric` likelihood.
```{r}
auditPrior(method = "power", likelihood = "binomial", x = 0, n = 58, delta = 0.7)
```
### Nonparametric Prior
You can base the prior on samples of the prior distribution using `method = 'nonparam'` in combination with `samples`. For example, the command below creates a prior on 1000 samples of a beta(1, 10) distribution. The `likelihood` argument is not required and will be ignored in this method.
```{r}
auditPrior(method = "nonparam", samples = stats::rbeta(1000, 1, 10))
```
## Practical Examples
This section contains practical examples of how to conduct the planning of statistical audit samples and demonstrates how to set up a prior distribution based on various types of relevant audit information.
### Audit Risk Model
In our first example, an auditor is performing tests of details on a population of the auditee. For instance, let's say an auditor is performing an audit on a company's accounts payable transactions. The company has a total of $N$ = 1000 accounts payable transactions for the year. Rather than testing all 1000 transactions, the auditor can choose to test a sample of the transactions. The performance materiality for the payable transactions account is set to $\theta_{max}$ = 0.03 (3 percent), and the audit risk is set to $\alpha$ = 0.05, or 5 percent. Based on the results of last years audit, where the most likely estimate of the misstatement was one percent, the auditor wants to tolerate one percent misstatements in the sample before giving an unqualified opinion on the population.
```{r}
ar <- 0.05 # Audit risk
materiality <- 0.03 # Performance materiality
expected <- 0.01 # Tolerable deviation rate
```
Before tests of details, the auditor has assessed risk of material misstatement via the audit risk model. In this example, the auditor has assessed the effectiveness of the company's internal controls, such as its segregation of duties and its risk management processes, and has determined that they are sufficient to prevent or detect material misstatements. Because the internal control systems were effective, the auditor assesses the control risk as medium. The auditor's firm defines the risk categories low, medium, and high respectively as 50 percent, 60 percent, and 100 percent. According to the Audit Risk Model, the detection risk can be calculated as a function of the audit risk, the inherent risk and the control risk.
```{r}
ir <- 1 # Inherent risk
cr <- 0.6 # Control risk
dr <- ar / (ir * cr) # Detection risk
dr
```
By using the detection risk of 8.33 percent as the sampling risk for this population, the auditor can plan for a sample while taking into account the risk-reducing information. The required minimum sample size is 174 in this case.
```{r}
planning(materiality = 0.03, expected = expected, conf.level = 1 - dr)
```
The example above is a frequentist one. However, the auditor is free to apply a Bayesian philosophy in planning the sample. For example, the risk assessments from the ARM can be incorporated into a prior distribution. This can be done using `method = "arm"` in the `auditPrior()` function, which takes the values of the inherent risk probability `ir` and the control risk probability `cr`. Hence, the prior distribution in this example can be constructed using the following command:
```{r}
prior <- auditPrior(method = "arm", materiality = 0.03, expected = expected, ir = ir, cr = cr)
```
The `summary()` function can be used to obtain relevant information about the prior distribution.
```{r}
summary(prior)
```
Furthermore, the prior distribution can be visualized with a call to the `plot()` function, see @fig-planning-prior-example1.
```{r label="fig-planning-prior-example1", fig.cap="The prior distribution constructed on the basis of the assessments of inherent risk and control risk."}
plot(prior)
```
By using the prior distribution to incorporate the assessments of the inherent risk and the control risk, the auditor can plan a sample while taking into account the risk-reducing information. The required minimum sample size is also 174 in this case.
```{r}
planning(materiality = 0.03, expected = expected, conf.level = 1 - ar, prior = prior)
```
### Benchmark Analysis
The auditor may incorporate information obtained through analytical procedures [@Derks2021], such as a benchmark analysis, into the prior distribution for $\theta$. While we have previously discussed methods for constructing a prior distribution based on existing knowledge, there is no set procedure for incorporating information obtained through analytical procedures, as these procedures can vary significantly depending on the type of information being incorporated into the prior distribution. Therefore, it is important to thoroughly substantiate the data and assumptions used in this approach and to carefully consider how these assumptions are incorporated into the prior distribution.
One way to construct a prior distribution on the basis of data is through the use of regression models, such as benchmarking the relationship between sales and costs of sales within the auditee's specific industry sector. The **jfa** package includes a data set `benchmark` that can be used for this example.
```{r}
data(benchmark)
head(benchmark)
```
The auditee's the sum of the sales is $298,112,312 and the sum of the booked costs of sales is $223,994,405, respectively. This is indicated by a blue dot in @fig-planning-scatter-1 below, which visualizes the industry sales versus the cost of sales.
```{r}
C_real <- 223994405
```
```{r label="fig-planning-scatter-1", fig.cap="Scatter plot of the industry sales versus the cost of sales.", echo = FALSE, fig.width = 7, fig.height = 6, fig.align = "center"}
breaks <- pretty(c(benchmark$sales, benchmark$costofsales))
labels <- paste0("$", format(breaks / 1000000, scientific = FALSE), "M")
p <- ggplot(data = benchmark, mapping = aes(x = sales, y = costofsales)) +
geom_point(shape = 21, fill = "darkgray", size = 2) +
geom_point(
data = data.frame(x = 298112312, y = C_real), shape = 21,
mapping = aes(x = x, y = y), fill = "dodgerblue", size = 2.5,
inherit.aes = FALSE, show.legend = FALSE
) +
scale_x_continuous(
name = "Sales", breaks = breaks,
limits = range(breaks), labels = labels
) +
scale_y_continuous(
name = "Cost of sales", breaks = breaks,
limits = range(breaks), labels = labels
) +
geom_segment(x = -Inf, xend = -Inf, y = min(breaks), yend = max(breaks)) +
geom_segment(x = min(breaks), xend = max(breaks), y = -Inf, yend = -Inf)
p <- jfa:::.theme_jfa(p)
p
```
The relationship between the sales $S$ and the cost of sales $C$ can be modelled by a linear equation:
\begin{equation}
C = \beta_0 + \beta_1 \cdot S + \epsilon.
\end{equation}
In practice, this relationship is often more complex than is presented above, and the auditor must carefully construct and evaluate the applied regression model. However, for ease of understanding we will continue our example with this toy model. The auditor can estimate the regression model using the following command:
```{r}
fit <- lm(costofsales ~ 1 + sales, data = benchmark)
summary(fit)
```
The predicted cost of sales for the auditee, based on the industry benchmark, can be computed as follows:
```{r}
C_pred <- predict(fit, newdata = data.frame(sales = 298112312), interval = "prediction", level = 0.90)[1]
C_pred
```
The fitted regression line and the predicted cost of sales (red dot) are visualized in @fig-planning-scatter-2 below.
```{r label="fig-planning-scatter-2", fig.cap="Scatter plot of the industry sales versus the cost of sales including the regression line and the auditee's (predicted) cost of sales.", echo = FALSE, fig.width = 7, fig.height = 6, fig.align = "center"}
breaks <- pretty(c(benchmark$sales, benchmark$costofsales))
labels <- paste0("$", format(breaks / 1000000, scientific = FALSE), "M")
p <- ggplot(data = benchmark, mapping = aes(x = sales, y = costofsales)) +
geom_smooth(
stat = "smooth", formula = y ~ x, method = "lm",
colour = "black", linewidth = 0.5
) +
geom_point(shape = 21, fill = "darkgray", size = 2) +
geom_point(
data = data.frame(x = 298112312, y = C_real), shape = 21,
mapping = aes(x = x, y = y), fill = "dodgerblue", size = 2.5,
inherit.aes = FALSE, show.legend = FALSE
) +
geom_point(
data = data.frame(x = 298112312, y = C_pred), shape = 21,
mapping = aes(x = x, y = y), fill = "firebrick", size = 2.5,
inherit.aes = FALSE, show.legend = FALSE
) +
scale_x_continuous(
name = "Sales", breaks = breaks,
limits = range(breaks), labels = labels
) +
scale_y_continuous(
name = "Cost of sales", breaks = breaks,
limits = range(breaks), labels = labels
) +
geom_segment(x = -Inf, xend = -Inf, y = min(breaks), yend = max(breaks)) +
geom_segment(x = min(breaks), xend = max(breaks), y = -Inf, yend = -Inf)
p <- jfa:::.theme_jfa(p)
p
```
The prior distribution can be justified by the data and the auditee's numerical prediction of the cost of sales. In this analytical procedure, the prior distribution on $\theta$ can utilize the relative error distribution from the linear regression. This relative error distribution, which is a Normal($\mu$, $\sigma$) distribution, captures the uncertainty of the prediction of the cost of sales through the use of linear regression, scaled to be a percentage of the total cost of sales. The mean $\mu$ of the prior distribution is determined by the relative deviation of the auditee's booked cost of sales when compared to the predicted cost of sales according to the benchmark data $\frac{C - \hat{C}}{C}$.
```{r}
mu <- (C_real - C_pred) / C_real
mu
```
The standard deviation of the prior distribution is expressed through the standard deviation of the distribution of $\epsilon$:
```{r}
stdev <- sd(fit$residuals) / C_real
stdev
```
The Normal(0.019, 0.05) prior distribution can be constructed through a call to `auditPrior()`, where the likelihood of the prior is specified as `normal`. We call the function with `method = "param"` to manually specify the parameters of the prior distribution.
```{r}
prior <- auditPrior(method = "param", likelihood = "normal", alpha = mu, beta = stdev)
summary(prior)
```
The specified prior distribution can be visualized using the `plot()` function, see @fig-planning-prior-example2.
```{r label="fig-planning-prior-example2", fig.cap="The prior distribution constructed on the basis of the benchmark analysis."}
plot(prior)
```
The performance materiality for this example is set to $\theta_{max}$ = 0.05, or five percent, and the audit risk is set to $\alpha$ = 0.05, or five percent. By using this prior distribution, the required minimum sample size is $n$ = 50.
```{r}
plan <- planning(materiality = 0.05, likelihood = "binomial", prior = prior)
plan
```
You can inspect how the prior distribution compares to the expected posterior distribution by using the `plot()` function, see @fig-planning-prior-posterior-example2.
```{r label="fig-planning-prior-posterior-example2", fig.cap="The prior and expected posterior distribution for this example after seeing a sample of $n$ = 50 containing no misstatements."}
plot(plan)
```
By using a frequentist approach, the required minimum sample size is $n$ = 59. Thus, by performing the analytical procedure and incorporating this information into the prior distribution, the auditor has achieved a reduction in sample size of 9 items.
```{r}
plan <- planning(materiality = 0.05, likelihood = "binomial")
plan
```
### Predictive Modeling
As a final example, we consider an example wehre the auditor incorporates information about the probability of misstatement obtained through a predictive analysis into the prior distribution for $\theta$. In this example, the auditor is conducting an audit on a long-term client that they have been working with for fifteen years. Hence, the auditor has access to a historycal data set called `history`, which contains the samples from all of the previous audits that have been conducted on this auditee over the past fifteen years.
```{r}
history <- read.csv("https://github.com/koenderks/sasr/raw/master/data/ch3_history.csv", colClasses = c("factor", "numeric", "numeric"))
head(history)
```
It should be noted that for all historical sample items, there are three known characteristics: whether they contained a misstatement (designated as `k`), the number of full-time equivalent employees (FTEs) who had access to that item within the internal computer systems of the auditee (designated as `ftes`), and the number of days that the item was outstanding (designated as `days`). Additionally, the `ftes` and `days` characteristics are also known for all items in the current year's population. Of course, it is unknown if any misstatements exist within the population of the current year.
```{r}
population <- read.csv("https://github.com/koenderks/sasr/raw/master/data/ch3_population.csv")
head(population)
```
The objective of this analytical procedure is to forecast potential misstatements within the population of the current year. In order for the information obtained through this procedure to serve as prior knowledge in a Bayesian analysis, the procedure must yield a distribution of the probability of misstatement. Therefore, the auditor employs a machine learning technique known as Random Forest [@Hastie2009] to learn the relationship between misstatements, the number of full-time equivalent employees, and the number of outstanding days in the historical data set.
```{r}
set.seed(1)
fit <- randomForest::randomForest(formula = k ~ ftes + days, data = history)
```
The auditor specifically uses the random forest technique due to its ability to provide a distribution of the misstatement probabilities. The probabilistic predictions for the unseen misstatements in the `population` data can be obtained by calling the `predict()` function with the argument `type = "prob"`.
```{r}
predictions <- predict(object = fit, newdata = population, type = "prob")
```
These predictions come in a probabilistic format, which means that for each item in the population of the current year there is a predicted probability that that item is misstated. These probabilities are stored in the second column of the `predictions` data frame.
```{r}
head(predictions)
```