Skip to content
This repository has been archived by the owner on May 1, 2020. It is now read-only.

runs are chained to very old ones, past the tsEnd date. #430

Open
denislepage opened this issue Jan 12, 2019 · 7 comments
Open

runs are chained to very old ones, past the tsEnd date. #430

denislepage opened this issue Jan 12, 2019 · 7 comments
Assignees

Comments

@denislepage
Copy link
Collaborator

At least a few recent batches contain runs that are marked as continuation of very old runs for tags that should be long dead.

This results in old tags showing up in detection timelines.

select c.batchID, b.runID, b.motusTagID, b.tsBegin, b.tsEnd, MIN(a.ts) as min_ts, MAX(a.ts) as max_ts from hits a inner join runs b on a.runID = b.runID
inner join batches c on a.batchID = c.batchID where a.ts > b.tsEnd and c.batchID > 217000
group by c.batchID, b.runID, motusTagID, b.tsBegin, b.tsEnd order by c.batchID

This was pointed out by Phil for this deployment.
https://motus.org/data/tagDeploymentDetections?id=8377

The tag involved (20442 ) appears in the table below.

This seems to have started only a few days ago. I'm running a query now to identify the earliest batchID, but that is a slow query. I'll post an update if the issue started earlier than I think.

batchID runID motusTagID tsBegin tsEnd min_ts max_ts
217213 259 20442 1467829829.9299 1467829888.4142 1538240751.7887 1538241131.6877
217213 261 -7 1467840344.2572 1467840365.6879 1538240747.9028 1538241127.7989
217213 262 20433 1467872885.6687 1467872928.2599 1538240747.9028 1538241007.8317
217213 263 -7 1467875903.8191 1467875962.3318 1538240751.789 1538241131.6876
217213 264 -7 1467875930.3086 1467876068.2843 1538240747.9029 1538240987.8373
217213 265 -11 1467880021.6371 1467880081.7267 1538240751.7888 1538241011.7196
217260 30381 15586 1441260327.6529 1441260418.8059 1538317608.1012 1538317988.0047
217260 30382 15567 1441260350.7835 1441260704.3378 1538317608.1012 1538318047.9897
217260 30383 15549 1441260353.0237 1441260666.2063 1538317608.1011 1538318027.9948
217260 30384 15549 1441260403.4986 1441260625.8264 1538317612.5377 1538317992.4445
217260 30385 15569 1441260404.0514 1441260848.52 1538317612.5378 1538318032.4348
217260 30386 14679 1441260644.5382 1441260684.149 1538317612.5377 1538318092.42
217300 39766270 29270 1532849927.5349 1532850124.0139 1538247105.8853 1538248645.4969
217300 39766272 -580 1535162114.3128 1535162185.1724 1538247105.8852 1538248645.4969
@jbrzusto
Copy link
Owner

jbrzusto commented Jan 14, 2019

tldr:

This is a big bug which has left a significant amount of data in
the master DB incorrect; i.e. any hits written as part of runs continued from
a previous batch will almost certainly have the wrong runID and so
will claim to be from the wrong tag and wrong antenna!

Weird. Working backward from the master DB on sgdata:

MariaDB [motus]> select distinct(runID) from hits where batchID=217213;
+-------+
| runID |
+-------+
|   264 |
|   261 |
|   262 |
|   263 |
|   259 |
|   265 |
+-------+
MariaDB [motus]> select distinct runID from batchRuns where batchID=217213;
+----------+
| runID    |
+----------+
| 34202595 |
+----------+

We have lost referential integrity here: the only runID recorded
in the batchRuns table for batch 217213 is 34202595 and yet we have 6
extremely unlikely (given how low they are) runIDs in the hits records
for the same batch.

Did this error come from the tag finder? Or did it arise in pushing
records to the master DB?

We find the receiver to which this batch belongs:

MariaDB [motus]> select motusDeviceID from batches where batchID=217213;
+---------------+
| motusDeviceID |
+---------------+
|            31 |
+---------------+

and look up its serno in the metadatadb:

$ meta "select serno from recvDeps where deviceID=31 limit 1"
SG-1614BBBK1869

We examine this receiver's DB using sqlite3 -column /sgm/recv/SG-1614BBBK1869.motus:

sqlite> select * from motusTX where batchID + offsetBatchID = 217213;
batchID     tsMotus          offsetBatchID  offsetRunID  offsetHitID
----------  ---------------  -------------  -----------  -----------
362         1547035579.2565  216851         0            854251524  

The motusTX table records the offsets for batchIDs, runIDs, and hitIDs
between the master table, and the receiver table; IDs for runs and hits are assigned
sequentially when a batch is transferred, by reserving IDs via the master DB's maxKeys table.

So, e.g. batch 217213 in the master table is batch 217213 - offsetBatchID = 217213 - 216851 = 362
in the receiver's batches table. This record shows the extremely unlikely value of offsetRunID = 0,
which would only be valid if this were the very first block of runs to be added to the master database.

So somehow, offsetRunID was assigned a clearly bogus value.

Looking at the code in motusServer/R/pushToMotus.R, we find this clause starting at line 186:

## get count of new runs and 1st run ID for this batch
runInfo = sql("select count(*), min(runID) from runs where batchIDbegin = %d", b$batchID)

if (runInfo[1,1] > 0) {

    ## reserve the required number of runIDs
    firstMotusRunID = motusReserveKeys("runs", runInfo[1,1])
    offsetRunID = firstMotusRunID - runInfo[1,2]

However, for this specific batch, there are no runs that begin in the batch; only some runs
which are extended in it, as we see by using the query in the code above on the receiver database:

sqlite> select count(*), min(runID) from runs where batchIDbegin=362;
count(*)    min(runID)
----------  ----------
0

And so the if statement clause above evaluates to FALSE. Unfortunately, that means offsetRunID is
left unset (or rather, is left at 0, having been set to that at line 125).
Then, when hits are written to the master DB, their local (i.e. specific to the receiver DB) runIDs are not
correctly converted to global (i.e. specific to the master DB) runIDs in this block (which starts at line 353):

            res = dbSendQuery(con, sprintf("select * from hits where batchID = %d order by hitID", b$batchID))

            repeat {
                hits = dbFetch(res, CHUNK_ROWS)
                if (nrow(hits) == 0)
                    break
                hits$hitID   = hits$hitID   + offsetHitID
                hits$runID   = hits$runID   + offsetRunID
                hits$batchID = hits$batchID + offsetBatchID
                dbWriteTable(mtcon, "hits", hits, append=TRUE, row.names=FALSE)

                ## copy the helper field tagDepProjectID from the value for the associated run

Moreover, this reveals that the runID for all hits from continued
runs will be recorded incorrectly, since the value of offsetRunID
being used in the correction above, even when set, is only valid for
runs started in the current batch, not for those being continued. For
the latter, a different offsetRunID value will apply for each
distinct batch from which runs are being continued!

We need to clean up this code to use the correct runID for each reference
to a run!

@jbrzusto
Copy link
Owner

One approach is to create a map from local runIDs to master runIDs, populating the latter either from newly-reserved keys (for runs beginning in this batch), or from keys corrected by offsets (for runs begun in previous batches).

@jbrzusto
Copy link
Owner

The bigger job will be correcting the master database.

Options:

  1. start from scratch, re-exporting every receiver; this is clean but we have to make sure to not
    reset the maxKeys table so that all batch, run, and hit IDs are new.
  2. correct hits in-place. Not sure what the client-side motus.org or R package capabilities are for
    dealing with this.

Looks like 1 will be the better way to go. It can be coupled with a complete re-run of every receiver,
if we choose.

@denislepage
Copy link
Collaborator Author

denislepage commented Jan 14, 2019 via email

@jbrzusto
Copy link
Owner

Nope - it's been in the code for years:

$ cd /home/sg/src/motusServer/R
$ git blame pushToMotus.R
...
8691d441 package/R/pushToMotus.R (john brzustowski 2016-04-06 15:48:49 +0000 360)                 
hits$runID   = hits$runID   + offsetRunID
...

jbrzusto added a commit that referenced this issue Jan 15, 2019
…atch

This is the bug fix for #430, but closing it requires other actions.
@jbrzusto
Copy link
Owner

jbrzusto commented Jan 15, 2019

TODO

  • fix bug (see 57d989f)

  • identify magnitude of problem in existing master database
    Somewhat tricky. We need to know for which hits the runID has been stored
    incorrectly. This happened in two ways

    • a resumed batch had no new runs in it, only continued runs; hits
      for these runs would have used an offsetRunID of 0, meaning their
      runIDs are too small. These hits can be detected in the master
      DB because their runID will not be in the batchRuns table for
      their batchID. (Note that the runIDs in the batchRuns table have
      been recorded correctly.)

    • hits in a resumed batch that did have new runs. Any hits in
      that batch which are in continued runs will have a runID that is
      too large, but is not large enough to match the runID of any new
      run. It is very likely that incorrect runID will also not be present
      in the batchRuns table for that hit's batchID.

    So checking a hits' (batchID, runID) pair against the batchRuns
    table will catch almost all cases of hits with incorrect runIDs.
    We can improve the query speed by limiting it to only those batches
    which were resumed; i.e. which have the same monoBN but later
    (not too much later) tsStart as the preceding batch for the same
    receiver:

create temporary table resumedBatches as
select
   t2.batchID
from
   batches as t1
   join batches as t2 on
      t1.motusDeviceID = t2.motusDeviceID
      and t1.batchID < t2.batchID
      and t1.monoBN = t2.monoBN
      and t1.tsStart < t2.tsStart
      and t2.tsStart - t1.tsEnd < 1000

We can then run the search query as:

create table hits_with_bad_runIDs as
select t2.hitID from
   resumedBatches as t1
   join hits as t2 on t1.batchID = t2.batchID
   left join batchRuns as t3 on t2.batchID=t3.batchID and t2.runID=t3.runID
where t3.batchID is null

There are 3284 resumed batches with a total of 32024675 hits.
Running the second query shows there are 306968 hits in 93 batches with incorrect runIDs, and
a total of 443 distinct incorrect runIDs.

@jbrzusto
Copy link
Owner

see #431 for a combined TO DO.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants