This repository has been archived by the owner on Jul 15, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
ExploreNoise.rmd
358 lines (272 loc) · 12.8 KB
/
ExploreNoise.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
---
title: "Data on noise"
author: "pdt"
date: "February 20, 2018"
output: html_document
---
There is an issue of extreme noise at some sites.
We may be able to use information provided in the database to determine which runs are likely invalid, or which time periods might be removed from the database, due to these problems.
Most of these sites should be re-run with more stringent parameters in the tag-finder.
We should re-run those that we can immediately, and manually retire batches from the database.
A proposed workflow at present is, for each processed batch, to calculate the proportion of runLen 2, 3 plus the numbers of tags detected, and the number of different projects. We can come up with definitive criteria, but the worst ones should be obvious. These batches should be re-run, and the previous data retired from the database.
A future enhancement will provide the ability for a user to decide themselves whether a batch should be retired/replaced from their local database.
NB. For a list of the parameters for running noisy sites, see:
https://github.com/jbrzusto/find_tags/blob/new_server/find_tags_motus.cpp
Even with more stringent tag finding, users will still be required to filter their data, since some noise is intermittent, and we don't know the trade-off between the stringency of tag-finder parameters and the prevalence of false positives.
Information on the amount of noise at a site can be determined from three main sources:
A) On the receiver API
- the number of pulses per hour: Noisy periods will have a larger number (relative to the number of tags detected).
- the number of tags detcted: noisy periods will have large numbers of tags
- the distribution of run lengths: noisy period will have many runLen of 2 and 3.
Ideally, when a user calls for data on a given receiver, they would be provided with these data (it is also a convenient way to get an empirical estimate of whether a receiver was working or not).
B) In the local tag database
Filtering after this point is mostly for assessing the veracity of short runLen hits.
Here we examine this empirically, using known noisy and quiet sites.
First get some data from a known noisy and quiet sites. we should expand this to multiple noisy/quiet sites, to see what extent anything we find is more general.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
require(tidyverse)
require(motus)
Sys.setenv(TZ = "GMT")
```
Get data from a noisy site and a non-noisy site. These were suggested by Zoe. I haven't (as of 20/2/2018) downloaded all of Earl Rowe.
Here are some others (from Zoe).
Koffler - SG-3214BBBK6103, Deployment 2729
Formosa - SG-3615BBBK1E34. Deployments 3071, 3742, 3984
Old Cut - SG-5113BBBK2972, Deployments 2549, 1124, 746, 634, 1503. This might be intermittent, most noise is from a high number of true tags in the area
West Port Bruce - SG-3214BBBK4137, Deployment 2292. Also might be intermittent due to lots of tags in the area
```{r, message=FALSE}
tagme("SG-3214BBBK6103") ## Koffler
tagme("SG-5113BBBK2786") ## Earl Rowe PP
tagme("SG-3615BBBK1E34", new=TRUE) ## Formosa
tagme("SG-5113BBBK2972", update=TRUE) ## Old Cut
```
```{r}
nsy1.sql <- tagme("SG-3214BBBK6103", update=TRUE)
nsy2.sql <- tagme("SG-5113BBBK2972", update=TRUE)
nsy3.sql <- tagme("SG-3615BBBK1E34", update=TRUE)
qte1.sql <- tagme("SG-5113BBBK2786", update=TRUE)
nsy1.tbl <- tbl(nsy1.sql, "alltags")
nsy2.tbl <- tbl(nsy2.sql, "alltags")
nsy3.tbl <- tbl(nsy3.sql, "alltags")
qte1.tbl <- tbl(qte1.sql, "alltags")
## select variables and make a posix time
## since we are doing thing among a bunch of sites, make a little function
## to pull the tbl into a flat file
get.df <- function(in.tbl) {
out.df <- select(in.tbl,
motusTagID, hitID, runID, batchID, noise, ts, sig, runLen,
freqsd, sigsd, slop, burstSlop, port, site=recvSiteName) %>%
distinct() %>% collect() %>%
mutate(ts = as.POSIXct(ts, origin="1970-01-01", TZ = "GMT"),
year = year(ts))
return(out.df)
}
nsy1.df <- get.df(nsy1.tbl) %>%
mutate(site = "Koffler",
type = "Noisy")
nsy2.df <- get.df(nsy2.tbl) %>%
mutate(site = "Old Cut",
type = "Noisy")
nsy3.df <- get.df(nsy3.tbl) %>%
mutate(site = "Formosa",
type = "Noisy")
qte1.df <- get.df(qte1.tbl) %>%
mutate(site = "Earl Rowe",
type = "Quiet")
## make a z-score for time, and clean up the funny burstSlop (issue for JB?)
all.df <- bind_rows(nsy1.df, nsy2.df, nsy3.df, qte1.df) %>%
group_by(runID) %>%
mutate(ts.z = ts - mean(ts),
ts.h = plyr::round_any(ts, 3600),
burstSlop = ifelse(burstSlop < -2, 0, burstSlop))
```
Save the state so one can just pick up here ...
```{r}
saveRDS(all.df, "all.df.RDS")
```
```{r}
all.df <- readRDS("all.df.RDS")
```
The first question is how to identify where there is noise. That can be done by simply looking at the proportion of runLen = 2 and runLen = 3 within each batch. If it is all or mostly noise, then these should follow an exponential distribution.
There is probably some theoretical way to figure out what the parameters of that distribution might look like, but we could also just explore it empirically.
So first look at the mean runLen across batches, for noisy and not-noisy sites.
```{r}
noise.df <- group_by(all.df, site, batchID) %>%
summarize(mn.rl = mean(runLen),
min.ts = min(ts),
max.ts = max(ts),
length = difftime(max.ts, min.ts, units="hours"),
n.tag = length(unique(motusTagID)))
```
The problem is that the range of lengths of batches is huge. So, although we eventually need to re-run batches, we first need to look at shorter time intervals.
```{r}
noise2.df <- group_by(all.df, site, batchID, ts.h) %>%
summarize(mn.rl = mean(runLen),
min.ts = min(ts),
max.ts = max(ts),
length = difftime(max.ts, min.ts, units="hours"),
n.tag = length(unique(motusTagID)))
```
```{r}
p <- ggplot(data=noise2.df, aes(log(mn.rl), ts.h, colour=site))
p + geom_point()
```
```{r}
p <- ggplot(data=noise2.df, aes(log10(mn.rl), log10(n.tag), colour=factor(batchID)))
p + geom_jitter() + facet_wrap(~site)
```
Merge the data on noise back in to the main data frame.
```{r}
## why isn't tmp.df the same size as all.df? Different batches maybe?
tmp.df <- left_join(all.df, noise2.df, by=c("site", "ts.h")) %>%
ungroup()
## need to change this back to all.df here once I figure out where the
## extra hits are coming from.
```
TODO: change both.df to all.df df from here on, and go back to the narrative of site first, then run and so on.
Some simple first steps.
Check with John, but the meaning of each of the relevant variables is ...
- noise: an estimate of underlying noise?
- freqsd: the SD of the frequency offset, within a hit
- sigsd: the SD of the signal strength, within a hit
- slop: the sum? of the absolute difference between the expected and actual times of pulses
- burstSlop: (note that some of these are off by a factor of -25 or -10): like slop, but between hits
We can consider 'noise' then to be within a tag (e.g. at the level of a hit or a run) or among tags (e.g. within a duration or a batch).
Initially, we can readily see that the distribution of runLen is very different at the two types of sites. This is expected, since the noisy site is mostly false positives, which should follow a poisson-like distribution.
This also shows that there are two reasons for trying to figure out the issue of noise; one is to filter good data from bad sites (e.g. which of the noisy sites with longer runLen are actually valid) but also to filter bad data from good sites (e.g. which of the short runLen at the quiet site are actually good).
```{r}
both.df %>% filter(runLen < 10) %>%
select(runLen, runID, type) %>%
distinct() %>%
group_by(runLen, type) %>%
tally() %>%
spread(type, value=n)
```
We can start by summarizing, for a bunch of runLen, each of the stats above.
```{r}
statsum <- both.df %>%
group_by(runLen, type) %>%
summarize(
mn.slop = mean(slop),
var.slop = var(slop),
mn.bslop = mean(burstSlop),
var.bslop = var(burstSlop),
mn.freqsd = mean(freqsd),
var.freqsd = var(freqsd),
mn.noise = mean(noise),
var.noise = var(noise),
mn.sigsd = mean(sigsd),
var.sigsd = var(sigsd)
)
```
This shows (that at least for this limited data set) that there is potentially some information in each of these statistics.
We can plot some of these statistics to see how they vary with runLen. We filter some of the quiet sites that have very long runs.
```{r}
statsum <- gather(statsum, key="Variable", value="value", -runLen, -type)
p <- ggplot(data=filter(statsum, runLen < 75),
aes(runLen, value, colour=type))
p + geom_point() + geom_line() + facet_wrap(~Variable, scales="free", ncol=5)
```
From the plots above, slop, and freqsd seem the most promising. Each has a pattern of a decreasing mean with runLen (where we are in effect treating runLen as a proxy for validity, which is the realistic assumption that long runs of false positives are less plausible than short runs).
Let's look at the distribution of mean slop, within runs, for noisy sites with short run lengths. What we're looking for here is a (possibly) bi-modal relationship, suggesting an easy way to distinguish good from bad runs.
```{r}
slop.df <- filter(both.df, type == "Noisy") %>%
group_by(runID, runLen) %>%
summarize(mn.slop = mean(slop),
mn.freqsd = mean(freqsd))
p <- ggplot(data=slop.df, aes(mn.slop, mn.freqsd))
p + geom_point(alpha=0.1, size=0.4) + facet_wrap(~(runLen > 3))
```
There do appear to be two groups, most obviously in the longer runLens. These are those with mn.slop < 0.0015 and mn.freqsd < 0.15 (about). We could classify all of the points and see if any of the other statistics had any explanatory power towards that classification.
```{r}
z.var = function(in.var) {return((in.var - mean(in.var))/sd(in.var))}
class.df <- both.df %>%
group_by(runLen, runID, type) %>%
summarize(
mn.slop = mean(slop),
var.slop = var(slop),
mn.bslop = mean(burstSlop),
var.bslop = var(burstSlop),
mn.freqsd = mean(freqsd),
var.freqsd = var(freqsd),
mn.noise = mean(noise),
var.noise = var(noise),
mn.sigsd = mean(sigsd),
var.sigsd = var(sigsd)
) %>%
ungroup() %>%
mutate(good = ifelse(mn.freqsd < 0.15 & mn.slop < 0.0015, TRUE, FALSE),
z.slop = z.var(mn.slop),
z.bslop = z.var(mn.bslop),
z.freqsd = z.var(mn.freqsd),
z.noise = z.var(mn.noise),
z.sigsd = z.var(mn.sigsd),
z.vslop = z.var(var.slop),
z.vbslop = z.var(var.bslop),
z.vfreqsd = z.var(var.freqsd),
z.vnoise = z.var(var.noise),
z.vsigsd = z.var(var.sigsd))
```
A first model shows the obvious -- there are many more good runs in the quiet site than in the noisy site.
```{r}
m1 <- glm(good ~ z.bslop + z.noise + z.sigsd + runLen + type,
binomial, data=class.df)
anova(m1)
summary(m1)
```
A table
```{r}
with(class.df, table(good, type))
```
Fit another model for just the noisy site.
```{r}
m2 <- glm(good ~ z.bslop + z.noise + z.sigsd +
z.vbslop + z.vnoise + z.vsigsd + runLen,
binomial, data=filter(class.df, type == "Noisy"))
anova(m2)
summary(m2)
```
Some boxplots
```{r }
statsum2 <- gather(class.df, key="Variable", value="value",
-runLen, -type, -runID, -good) %>%
mutate(rl = ifelse(runLen > 10, 10, runLen))
p <- ggplot(data=filter(statsum2, runLen < 30,
Variable %in% c("z.bslop", "z.freqsd", "z.noise",
"z.sigsd", "z.slop")),
aes(factor(rl), value, colour=good, fill=good))
p + geom_boxplot() + facet_wrap(type~Variable, scales="free", ncol=5)
```
So now, we should look at these statistics within noisy and non-noisy hours, as defined by runLen and number of tags.
And look at the relationship between the noisy periods and freqsd.
```{r}
p <- ggplot(data=sample_n(tmp.df, 10000),
aes(log10(mn.rl), freqsd, colour=as.factor(runLen > 4)))
p + geom_point() + facet_wrap(~type)
```
## pretty messy below here ... ignore for now.
Look at how freqsd varies among runs (just for longer runs) in the two sites
```{r}
p <- ggplot(data=filter(both.df, runLen > 9, type == "Noisy"),
aes(ts.z, freqsd, colour=factor(runID)))
p + geom_point() + geom_line()
```
What about variance in freqsd? Summarize some things by runID
```{r}
both.sum.df <- group_by(both.df, type, runID) %>%
mutate(
mn.nse = mean(noise),
mn.fsd = mean(freqsd),
var.fsd = var(freqsd),
len = length(noise))
```
```{r}
p <- ggplot(data=filter(both.sum.df, len > 10),
aes(mn.nse, mn.fsd, colour=type))
p + geom_point()
```
```{r}
p <- ggplot(data=filter(both.sum.df, len < 30), aes(factor(len), mn.fsd, colour=type, fill=type))
p + geom_boxplot()
```