-
Notifications
You must be signed in to change notification settings - Fork 37
Description
There must be exactly one prior per variable per PFT. However, at least in the BETY dump used by default on Docker, there are at least 4 PFT-variable combinations with multiple priors:
priors_pfts <- tbl(con, "priors") %>%
inner_join(tbl(con, "pfts_priors"),
by = c("id" = "prior_id"),
suffix = c(".priors", ".pfts_priors")) %>%
collect()
priors_pfts %>%
group_by(pft_id, variable_id) %>%
count() %>%
filter(n > 1)
# # A tibble: 4 x 3
# # Groups: pft_id, variable_id [4]
# pft_id variable_id n
# <S3: integer64> <S3: integer64> <int>
# 1 16 7 2
# 2 1000000122 41 2
# 3 1000000130 1000000012 2
# 4 1000000130 1000000015 2
First, and easiest, we need to remove the problematic records from the current BETY dump (@robkooper?). Here are the duplicate records, grouped by PFT and variable, and sorted (within group) by age (oldest updated first).
# A tibble: 8 x 7
# Groups: variable_id, pft_id [4]
pft_id variable_id id.priors id.pfts_priors distn parama paramb
<S3: intege> <S3: integer6> <S3: integer> <S3: integer6> <chr> <dbl> <dbl>
1 16 7 189 597 unif 0. 100
2 16 7 197 1000001394 gamma 1.50e+0 0.4
3 1000000122 41 139 1000001711 unif 0. 10
4 1000000122 41 1000000321 1000001698 unif 1.00e-3 1
5 1000000130 1000000012 1000000123 1000002030 norm 6.60e+3 10
6 1000000130 1000000012 1000000167 1000002020 norm 3.80e+3 10
7 1000000130 1000000015 1000000310 1000002044 norm 8.60e-1 0.025
8 1000000130 1000000015 1000000311 1000002024 norm 1.10e+0 0.025
More generally, we should try to figure out a way to make sure these records don't creep into the database again. Unfortunately, it looks like there is no way to impose PostgreSQL constraints across tables, which is what would be required here. Perhaps a short term solution is to periodically run some external validation of BETY (possibly based on my dplyr
code above) that makes sure that duplicates don't exist in this table.
Related to #185.