ExploreNoise.rmd

---
title: "Data on noise"
author: "pdt"
date: "February 20, 2018"
output: html_document
---

There is an issue of extreme noise at some sites.

We may be able to use information provided in the database to determine which runs are likely invalid, or which time periods might be removed from the database, due to these problems. 

Most of these sites should be re-run with more stringent parameters in the tag-finder.

We should re-run those that we can immediately, and manually retire batches from the database. 

A proposed workflow at present is, for each processed batch, to calculate the proportion of runLen 2, 3 plus the numbers of tags detected, and the number of different projects. We can come up with definitive criteria, but the worst ones should be obvious. These batches should be re-run, and the previous data retired from the database.

A future enhancement will provide the ability for a user to decide themselves whether a batch should be retired/replaced from their local database.

NB. For a list of the parameters for running noisy sites, see:

https://github.com/jbrzusto/find_tags/blob/new_server/find_tags_motus.cpp

Even with more stringent tag finding, users will still be required to filter their data, since some noise is intermittent, and we don't know the trade-off between the stringency of tag-finder parameters and the prevalence of false positives.

Information on the amount of noise at a site can be determined from three main sources:

A) On the receiver API 
- the number of pulses per hour: Noisy periods will have a larger number (relative to the number of tags detected). 
- the number of tags detcted: noisy periods will have large numbers of tags
- the distribution of run lengths: noisy period will have many runLen of 2 and 3.

Ideally, when a user calls for data on a given receiver, they would be provided with these data (it is also a convenient way to get an empirical estimate of whether a receiver was working or not). 

B) In the local tag database

Filtering after this point is mostly for assessing the veracity of short runLen hits.

Here we examine this empirically, using known noisy and quiet sites.

First get some data from a known noisy and quiet sites. we should expand this to multiple noisy/quiet sites, to see what extent anything we find is more general. 

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
require(tidyverse)
require(motus)
Sys.setenv(TZ = "GMT")
```
Get data from a noisy site and a non-noisy site. These were suggested by Zoe. I haven't (as of 20/2/2018) downloaded all of Earl Rowe. 

Here are some others (from Zoe). 

Koffler - SG-3214BBBK6103, Deployment 2729
Formosa - SG-3615BBBK1E34. Deployments 3071, 3742, 3984
Old Cut - SG-5113BBBK2972, Deployments 2549, 1124, 746, 634, 1503. This might be intermittent, most noise is from a high number of true tags in the area

West Port Bruce - SG-3214BBBK4137, Deployment 2292.  Also might be intermittent due to lots of tags in the area

```{r, message=FALSE}
tagme("SG-3214BBBK6103") ## Koffler
tagme("SG-5113BBBK2786") ## Earl Rowe PP
tagme("SG-3615BBBK1E34", new=TRUE) ## Formosa
tagme("SG-5113BBBK2972", update=TRUE) ## Old Cut

```

```{r}
nsy1.sql <- tagme("SG-3214BBBK6103", update=TRUE)
nsy2.sql <- tagme("SG-5113BBBK2972", update=TRUE)
nsy3.sql <- tagme("SG-3615BBBK1E34", update=TRUE)
qte1.sql <- tagme("SG-5113BBBK2786", update=TRUE)

nsy1.tbl <- tbl(nsy1.sql, "alltags")
nsy2.tbl <- tbl(nsy2.sql, "alltags")
nsy3.tbl <- tbl(nsy3.sql, "alltags")
qte1.tbl <- tbl(qte1.sql, "alltags")

## select variables and make a posix time
## since we are doing thing among a bunch of sites, make a little function
## to pull the tbl into a flat file

get.df <- function(in.tbl) {
  out.df <- select(in.tbl, 
                   motusTagID, hitID, runID, batchID, noise, ts, sig, runLen, 
                   freqsd, sigsd, slop, burstSlop, port, site=recvSiteName) %>% 
  distinct() %>% collect() %>%
  mutate(ts = as.POSIXct(ts, origin="1970-01-01", TZ = "GMT"), 
         year = year(ts))
  return(out.df)
}


nsy1.df <- get.df(nsy1.tbl) %>%
  mutate(site = "Koffler", 
         type = "Noisy")

nsy2.df <- get.df(nsy2.tbl) %>%
  mutate(site = "Old Cut", 
         type = "Noisy")

nsy3.df <- get.df(nsy3.tbl) %>%
  mutate(site = "Formosa", 
         type = "Noisy")

qte1.df <- get.df(qte1.tbl) %>%
  mutate(site = "Earl Rowe", 
         type = "Quiet")

## make a z-score for time, and clean up the funny burstSlop (issue for JB?)

all.df <- bind_rows(nsy1.df, nsy2.df, nsy3.df, qte1.df) %>%
  group_by(runID) %>%
  mutate(ts.z = ts - mean(ts), 
         ts.h = plyr::round_any(ts, 3600), 
         burstSlop = ifelse(burstSlop < -2, 0, burstSlop))
```

Save the state so one can just pick up here ...
```{r}
saveRDS(all.df, "all.df.RDS")
```
```{r}
all.df <- readRDS("all.df.RDS")
```

The first question is how to identify where there is noise. That can be done by simply looking at the proportion of runLen = 2 and runLen = 3 within each batch. If it is all or mostly noise, then these should follow an exponential distribution.

There is probably some theoretical way to figure out what the parameters of that distribution might look like, but we could also just explore it empirically.

So first look at the mean runLen across batches, for noisy and not-noisy sites. 

```{r}
noise.df <- group_by(all.df, site, batchID) %>%
  summarize(mn.rl = mean(runLen), 
            min.ts = min(ts), 
            max.ts = max(ts), 
            length = difftime(max.ts, min.ts, units="hours"),
            n.tag = length(unique(motusTagID)))

```

The problem is that the range of lengths of batches is huge. So, although we eventually need to re-run batches, we first need to look at shorter time intervals. 

```{r}
noise2.df <- group_by(all.df, site, batchID, ts.h) %>%
  summarize(mn.rl = mean(runLen), 
            min.ts = min(ts), 
            max.ts = max(ts), 
            length = difftime(max.ts, min.ts, units="hours"),
            n.tag = length(unique(motusTagID)))
```

```{r}
p <- ggplot(data=noise2.df, aes(log(mn.rl), ts.h, colour=site))
p + geom_point()
```

```{r}
p <- ggplot(data=noise2.df, aes(log10(mn.rl), log10(n.tag), colour=factor(batchID)))
p + geom_jitter() + facet_wrap(~site)
```

Merge the data on noise back in to the main data frame.

```{r}
## why isn't tmp.df the same size as all.df? Different batches maybe?
tmp.df <- left_join(all.df, noise2.df, by=c("site", "ts.h")) %>%
  ungroup()
## need to change this back to all.df here once I figure out where the 
## extra hits are coming from.
```
TODO: change both.df to all.df df from here on, and go back to the narrative of site first, then run and so on. 

Some simple first steps. 

Check with John, but the meaning of each of the relevant variables is ...
- noise: an estimate of underlying noise?
- freqsd: the SD of the frequency offset, within a hit
- sigsd: the SD of the signal strength, within a hit
- slop: the sum? of the absolute difference between the expected and actual times of pulses
- burstSlop: (note that some of these are off by a factor of -25 or -10): like slop, but between hits

We can consider 'noise' then to be within a tag (e.g. at the level of a hit or a run) or among tags (e.g. within a duration or a batch). 

Initially, we can readily see that the distribution of runLen is very different at the two types of sites. This is expected, since the noisy site is mostly false positives, which should follow a poisson-like distribution.

This also shows that there are two reasons for trying to figure out the issue of noise; one is to filter good data from bad sites (e.g. which of the noisy sites with longer runLen are actually valid) but also to filter bad data from good sites (e.g. which of the short runLen at the quiet site are actually good).

```{r}
both.df %>% filter(runLen < 10) %>%
  select(runLen, runID, type) %>%
  distinct() %>%
  group_by(runLen, type) %>%
  tally() %>%
  spread(type, value=n)
```

We can start by summarizing, for a bunch of runLen, each of the stats above.
```{r}
statsum <- both.df %>% 
  group_by(runLen, type) %>%
  summarize(
    mn.slop = mean(slop), 
    var.slop = var(slop), 
    mn.bslop = mean(burstSlop), 
    var.bslop = var(burstSlop), 
    mn.freqsd = mean(freqsd), 
    var.freqsd = var(freqsd),
    mn.noise = mean(noise), 
    var.noise = var(noise),
    mn.sigsd = mean(sigsd), 
    var.sigsd = var(sigsd)
  ) 
    

```
This shows (that at least for this limited data set) that there is potentially some information in each of these statistics. 

We can plot some of these statistics to see how they vary with runLen. We filter some of the quiet sites that have very long runs. 

```{r}

statsum <- gather(statsum, key="Variable", value="value", -runLen, -type)

p <- ggplot(data=filter(statsum, runLen < 75), 
            aes(runLen, value, colour=type))
p + geom_point() + geom_line() + facet_wrap(~Variable, scales="free", ncol=5)

```

From the plots above, slop, and freqsd seem the most promising. Each has a pattern of a decreasing mean with runLen (where we are in effect treating runLen as a proxy for validity, which is the realistic assumption that long runs of false positives are less plausible than short runs). 

Let's look at the distribution of mean slop, within runs, for noisy sites with short run lengths. What we're looking for here is a (possibly) bi-modal relationship, suggesting an easy way to distinguish good from bad runs.

```{r}
slop.df <- filter(both.df, type == "Noisy") %>%
  group_by(runID, runLen) %>%
  summarize(mn.slop = mean(slop), 
            mn.freqsd = mean(freqsd))

p <- ggplot(data=slop.df, aes(mn.slop, mn.freqsd))
p + geom_point(alpha=0.1, size=0.4) + facet_wrap(~(runLen > 3))
```

There do appear to be two groups, most obviously in the longer runLens. These are those with mn.slop < 0.0015 and mn.freqsd < 0.15 (about). We could classify all of the points and see if any of the other statistics had any explanatory power towards that classification. 

```{r}

z.var = function(in.var) {return((in.var - mean(in.var))/sd(in.var))}

class.df <- both.df %>% 
  group_by(runLen, runID, type) %>%
  summarize(
    mn.slop = mean(slop), 
    var.slop = var(slop), 
    mn.bslop = mean(burstSlop), 
    var.bslop = var(burstSlop), 
    mn.freqsd = mean(freqsd), 
    var.freqsd = var(freqsd),
    mn.noise = mean(noise), 
    var.noise = var(noise),
    mn.sigsd = mean(sigsd), 
    var.sigsd = var(sigsd)
  ) %>%
  ungroup() %>%
  mutate(good = ifelse(mn.freqsd < 0.15 & mn.slop < 0.0015, TRUE, FALSE), 
         z.slop = z.var(mn.slop), 
         z.bslop = z.var(mn.bslop), 
         z.freqsd = z.var(mn.freqsd), 
         z.noise = z.var(mn.noise), 
         z.sigsd = z.var(mn.sigsd), 
         z.vslop = z.var(var.slop), 
         z.vbslop = z.var(var.bslop), 
         z.vfreqsd = z.var(var.freqsd), 
         z.vnoise = z.var(var.noise), 
         z.vsigsd = z.var(var.sigsd))

```

A first model shows the obvious -- there are many more good runs in the quiet site than in the noisy site. 

```{r}
m1 <- glm(good ~ z.bslop + z.noise + z.sigsd + runLen + type, 
          binomial, data=class.df)
anova(m1)
summary(m1)
```
A table
```{r}
with(class.df, table(good, type))
```
Fit another model for just the noisy site. 

```{r}
m2 <- glm(good ~ z.bslop + z.noise + z.sigsd + 
            z.vbslop + z.vnoise + z.vsigsd + runLen, 
          binomial, data=filter(class.df, type == "Noisy"))
anova(m2)
summary(m2)
```

Some boxplots
```{r }
statsum2 <- gather(class.df, key="Variable", value="value", 
                   -runLen, -type, -runID, -good) %>% 
  mutate(rl = ifelse(runLen > 10, 10, runLen))

p <- ggplot(data=filter(statsum2, runLen < 30, 
                        Variable %in% c("z.bslop", "z.freqsd", "z.noise",
                                        "z.sigsd", "z.slop")), 
            aes(factor(rl), value, colour=good, fill=good))
p + geom_boxplot() + facet_wrap(type~Variable, scales="free", ncol=5)
```

So now, we should look at these statistics within noisy and non-noisy hours, as defined by runLen and number of tags.

And look at the relationship between the noisy periods and freqsd. 

```{r}
p <- ggplot(data=sample_n(tmp.df, 10000), 
            aes(log10(mn.rl), freqsd, colour=as.factor(runLen > 4)))
p + geom_point() + facet_wrap(~type)
```

## pretty messy below here ... ignore for now. 


Look at how freqsd varies among runs (just for longer runs) in the two sites

```{r}
p <- ggplot(data=filter(both.df, runLen > 9, type == "Noisy"), 
            aes(ts.z, freqsd, colour=factor(runID)))

p + geom_point() + geom_line() 
```

What about variance in freqsd? Summarize some things by runID

```{r}
both.sum.df <- group_by(both.df, type, runID) %>%
  mutate(
    mn.nse = mean(noise),
    mn.fsd = mean(freqsd),
    var.fsd = var(freqsd), 
    len = length(noise))
```

```{r}
p <- ggplot(data=filter(both.sum.df, len > 10), 
            aes(mn.nse, mn.fsd, colour=type))
p + geom_point()
```

```{r}
p <- ggplot(data=filter(both.sum.df, len < 30), aes(factor(len), mn.fsd, colour=type, fill=type))
p + geom_boxplot()
```