Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor convert_input to Perform tasks via helper function #3338

Open
wants to merge 29 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
614b8f9
Shift functions to check for missing files
Sweetdevil144 Jul 18, 2024
838af61
Update CHANGELOG
Sweetdevil144 Jul 18, 2024
d5e8d24
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Jul 25, 2024
f22b962
Remove unutilized variables from convert_input
Sweetdevil144 Jul 25, 2024
d884203
Update logger statements in convert_input
Sweetdevil144 Jul 25, 2024
68d9516
Added seperate function to check machine info
Sweetdevil144 Jul 25, 2024
5208b02
Update input args to get machine info
Sweetdevil144 Jul 25, 2024
f570646
Correct roxygen documentations
Sweetdevil144 Jul 25, 2024
e479c46
Update tests
Sweetdevil144 Jul 25, 2024
d9e911b
Merge branch 'PecanProject:develop' into gsoc/convert-input
Sweetdevil144 Jul 31, 2024
ed581f7
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Jul 31, 2024
63ac964
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Aug 5, 2024
0f9ac13
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Aug 9, 2024
b98617f
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Aug 12, 2024
4b771d3
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Aug 14, 2024
63f270f
Refactor extra variables in `run.meta.anbalysis`
Sweetdevil144 Aug 14, 2024
dbb7a6d
Merge branch 'PecanProject:develop' into gsoc/convert-input
Sweetdevil144 Aug 16, 2024
74003d9
get existing machine info using helper function
Sweetdevil144 Aug 21, 2024
95fb810
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Aug 28, 2024
2bcb7c4
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Aug 31, 2024
fcae9bd
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Sep 3, 2024
c8e8a02
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Sep 5, 2024
766174f
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Sep 15, 2024
d9074df
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Sep 21, 2024
94c92f0
Merge branch 'develop' into gsoc/convert-input
Sweetdevil144 Oct 7, 2024
a578be2
Applied changes as suggested by @infotroph
Sweetdevil144 Oct 9, 2024
293a68b
Minor review changes
Sweetdevil144 Oct 9, 2024
f7f6926
Update base/db/R/get.machine.info.R
Sweetdevil144 Oct 9, 2024
8f820b0
Apply suggestions from code review
Sweetdevil144 Oct 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Change Log

All notable changes are kept in this file. All changes made should be added to the section called
`Unreleased`. Once a new release is made this file will be updated to create a new `Unreleased`
section for the next release.
Expand All @@ -9,6 +10,8 @@ For more information about this file see also [Keep a Changelog](http://keepacha

### Added

- Refactor `convert_input` to Perform tasks via helper function. Subtask of [#3307](https://github.com/PecanProject/pecan/issues/3307)

### Fixed

### Changed
Expand Down
111 changes: 111 additions & 0 deletions base/db/R/add.database.entries.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
#' Return new arrangement of database while adding code to deal with ensembles
#'
#' @param result list of results from the download function
#' @param con database connection
#' @param start_date start date of the data
#' @param end_date end date of the data
#' @param write whether to write to the database
#' @param overwrite Logical: If a file already exists, create a fresh copy?
#' @param insert.new.file whether to insert a new file
#' @param input.args input arguments obtained from the convert_input function
#' @param machine machine information
#' @param mimetype data product specific file format
#' @param formatname format name of the data
#' @param allow.conflicting.dates whether to allow conflicting dates
#' @param ensemble ensemble id
#' @param ensemble_name ensemble name
#' @param existing.input existing input records
#' @param existing.dbfile existing dbfile records
#' @param input input records
#' @return list of input and dbfile ids
#'
#' @author Betsy Cowdery, Michael Dietze, Ankur Desai, Tony Gardella, Luke Dramko

add.database.entries <- function(
result, con, start_date,
end_date, write, overwrite,
insert.new.file, input.args,
machine, mimetype, formatname,
allow.conflicting.dates, ensemble,
ensemble_name, existing.input,
existing.dbfile, input) {
if (write) {
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved
# Setup newinput. This list will contain two variables: a vector of input IDs and a vector of DB IDs for each entry in result.
# This list will be returned.
newinput <- list(input.id = NULL, dbfile.id = NULL) # Blank vectors are null.
for (i in 1:length(result)) { # Master for loop
id_not_added <- TRUE

if (!is.null(existing.input) && nrow(existing.input[[i]]) > 0 &&
(existing.input[[i]]$start_date != start_date || existing.input[[i]]$end_date != end_date)) {
# Updating record with new dates
db.query(paste0("UPDATE inputs SET start_date='", start_date, "', end_date='", end_date, "' WHERE id=", existing.input[[i]]$id), con)
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved
id_not_added <- FALSE

# The overall structure of this loop has been set up so that exactly one input.id and one dbfile.id will be written to newinput every iteration.
newinput$input.id <- c(newinput$input.id, existing.input[[i]]$id)
newinput$dbfile.id <- c(newinput$dbfile.id, existing.dbfile[[i]]$id)
}

if (overwrite) {
# A bit hacky, but need to make sure that all fields are updated to expected values (i.e., what they'd be if convert_input was creating a new record)
if (!is.null(existing.input) && nrow(existing.input[[i]]) > 0) {
db.query(paste0("UPDATE inputs SET name='", basename(dirname(result[[i]]$file[1])), "' WHERE id=", existing.input[[i]]$id), con)
}

if (!is.null(existing.dbfile) && nrow(existing.dbfile[[i]]) > 0) {
db.query(paste0("UPDATE dbfiles SET file_path='", dirname(result[[i]]$file[1]), "', file_name='", result[[i]]$dbfile.name[1], "' WHERE id=", existing.dbfile[[i]]$id), con)
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved
}
}

# If there is no ensemble then for each record there should be one parent
# But when you have ensembles, all of the members have one parent !!
parent.id <- if (is.numeric(ensemble)) {
ifelse(is.null(input[[i]]), NA, input[[1]]$id)
} else {
ifelse(is.null(input[[i]]), NA, input[[i]]$id)
}


if ("newsite" %in% names(input.args) && !is.null(input.args[["newsite"]])) {
site.id <- input.args$newsite
}

if (insert.new.file && id_not_added) {
dbfile.id <- dbfile.insert(in.path = dirname(result[[i]]$file[1]), in.prefix = result[[i]]$dbfile.name[1], "Input", existing.input[[i]]$id, con, reuse = TRUE, hostname = machine$hostname)
newinput$input.id <- c(newinput$input.id, existing.input[[i]]$id)
newinput$dbfile.id <- c(newinput$dbfile.id, dbfile.id)
} else if (id_not_added) {
# This is to tell input.insert if we are writing ensembles
# Why does it need it? Because it checks for inputs with the same time period, site, and machine
# and if it returns something it does not insert anymore, but for ensembles, it needs to bypass this condition
ens.flag <- if (!is.null(ensemble) | is.null(ensemble_name)) TRUE else FALSE
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved

new_entry <- dbfile.input.insert(
in.path = dirname(result[[i]]$file[1]),
in.prefix = result[[i]]$dbfile.name[1],
siteid = site.id,
startdate = start_date,
enddate = end_date,
mimetype = mimetype,
formatname = formatname,
parentid = parent.id,
con = con,
hostname = machine$hostname,
allow.conflicting.dates = allow.conflicting.dates,
ens = ens.flag
)

newinput$input.id <- c(newinput$input.id, new_entry$input.id)
newinput$dbfile.id <- c(newinput$dbfile.id, new_entry$dbfile.id)
}
} # End for loop

successful <- TRUE
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved
return(newinput)
} else {
PEcAn.logger::logger.warn("Input was not added to the database")
successful <- TRUE
return(NULL)
}
}
49 changes: 49 additions & 0 deletions base/db/R/check.missing.files.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#' Function to check if result has empty or missing files
#'
#' @param result A list of dataframes with file paths
#' @param outname Name of the output file
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved
#' @param existing.input Existing input records
#' @param existing.dbfile Existing dbfile records
#' @return A list of dataframes with file paths, a list of strings with the output file name, a list of existing input records, and a list of existing dbfile records
#'
#' @author Betsy Cowdery, Michael Dietze, Ankur Desai, Tony Gardella, Luke Dramko

check_missing_files <- function(result, outname, existing.input = NULL, existing.dbfile = NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
check_missing_files <- function(result, outname, existing.input = NULL, existing.dbfile = NULL) {
check_missing_files <- function(result, existing.input = NULL, existing.dbfile = NULL) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(need to update calls too)

result_sizes <- purrr::map_dfr(
result,
~ dplyr::mutate(
.,
file_size = purrr::map_dbl(file, file.size),
missing = is.na(file_size),
empty = file_size == 0
)
)

if (any(result_sizes$missing) || any(result_sizes$empty)) {
log_format_df <- function(df) {
formatted_df <- rbind(colnames(df), format(df))
formatted_text <- purrr::reduce(formatted_df, paste, sep = " ")
paste(formatted_text, collapse = "\n")
}
Comment on lines +22 to +26
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pull this function definition out to the end of the file -- I think the runtime overhead of nested function definitions is small, but I like to avoid them for clarity anyway.


PEcAn.logger::logger.severe(
"Requested Processing produced empty files or Nonexistent files:\n",
log_format_df(result_sizes[, c(1, 8, 9, 10)]),
"\n Table of results printed above.",
wrap = FALSE
)
}

# Insert into Database
outlist <- unlist(strsplit(outname, "_"))

# Wrap in a list for consistant processing later
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved
if (exists("existing.input") && is.data.frame(existing.input)) {
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved
existing.input <- list(existing.input)
}

if (exists("existing.dbfile") && is.data.frame(existing.dbfile)) {
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved
existing.dbfile <- list(existing.dbfile)
}
return(list(result_sizes, outlist, existing.input, existing.dbfile))
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved
}
150 changes: 15 additions & 135 deletions base/db/R/convert_input.R
Original file line number Diff line number Diff line change
Expand Up @@ -384,7 +384,7 @@ convert_input <-
if (!is.null(ensemble) && ensemble) {
return.all <-TRUE

}else{
} else{
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved
return.all <- FALSE
}
existing.dbfile <- dbfile.input.check(siteid = site.id,
Expand Down Expand Up @@ -734,143 +734,23 @@ convert_input <-
#--------------------------------------------------------------------------------------------------#
# Check if result has empty or missing files

result_sizes <- purrr::map_dfr(
result,
~ dplyr::mutate(
.,
file_size = purrr::map_dbl(file, file.size),
missing = is.na(file_size),
empty = file_size == 0
)
)

if (any(result_sizes$missing) || any(result_sizes$empty)){
log_format_df = function(df){
rbind(colnames(df), format(df))
purrr::reduce( paste, sep=" ") %>%
paste(collapse="\n")
}

PEcAn.logger::logger.severe(
"Requested Processing produced empty files or Nonexistant files :\n",
log_format_df(result_sizes[,c(1,8,9,10)]),
"\n Table of results printed above.",
wrap = FALSE)
}

# Insert into Database
outlist <- unlist(strsplit(outname, "_"))

# Wrap in a list for consistant processing later
if (exists("existing.input") && is.data.frame(existing.input)) {
existing.input <- list(existing.input)
}

if (exists("existing.dbfile") && is.data.frame(existing.dbfile)) {
existing.dbfile <- list(existing.dbfile)
}
checked.missing.files <- check_missing_files(result, outname, existing.input, existing.dbfile)

# Unwrap parameters after performing checks for missing files
result_sizes <- checked.missing.files$result_sizes;
outlist <- checked.missing.files$outlist;
existing.input <- checked.missing.files$existing.input;
existing.dbfile <- checked.missing.files$existing.dbfile;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is their any other way to unwrap these variables in a single context line. For example, inheriting the above list and utilising the list$var syntax in-place. A lot of variable operations(assignment and reassignment) would consume extra memory in the heap. Should we also implement a Garbage Collection (gc()) or is it automatically applied in R ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than try to unwrap them in one line I'd consider whether they need to be assigned separate names at all -- e.g. can all places below that refer to result_sizes refer to checked.missing.files$result_sizes instead? If that's what you're suggesting by "utilising the list$var syntax in-place", then yes, I like that approach.

That said these objects are all very small and the runtime/memory overhead from reassignment will be negligible and R's reference-counting garbage collection will take care of the cleanup for us. Happily though, the same optimization gives (what I think is) a substantial improvement in code clarity, so I endorse it on those grounds.


#---------------------------------------------------------------#
# New arrangement of database adding code to deal with ensembles.
if (write) {

# Setup newinput. This list will contain two variables: a vector of input IDs and a vector of DB IDs for each entry in result.
# This list will be returned.
newinput = list(input.id = NULL, dbfile.id = NULL) #Blank vectors are null.
for(i in 1:length(result)) { # Master for loop
id_not_added <- TRUE

if (exists("existing.input") && nrow(existing.input[[i]]) > 0 &&
(existing.input[[i]]$start_date != start_date || existing.input[[i]]$end_date != end_date)) {

# Updating record with new dates
db.query(paste0("UPDATE inputs SET start_date='", start_date, "', end_date='",
end_date, "' WHERE id=", existing.input[[i]]$id),
con)
id_not_added = FALSE

# The overall structure of this loop has been set up so that exactly one input.id and one dbfile.id will be written to newinput every interation.
newinput$input.id = c(newinput$input.id, existing.input[[i]]$id)
newinput$dbfile.id = c(newinput$dbfile.id, existing.dbfile[[i]]$id)
}

if (overwrite) {
# A bit hacky, but need to make sure that all fields are updated to expected
# values (i.e., what they'd be if convert_input was creating a new record)
if (exists("existing.input") && nrow(existing.input[[i]]) > 0) {
db.query(paste0("UPDATE inputs SET name='", basename(dirname(result[[i]]$file[1])),
"' WHERE id=", existing.input[[i]]$id), con)

}

if (exists("existing.dbfile") && nrow(existing.dbfile[[i]]) > 0) {
db.query(paste0("UPDATE dbfiles SET file_path='", dirname(result[[i]]$file[1]),
"', ", "file_name='", result[[i]]$dbfile.name[1],
"' WHERE id=", existing.dbfile[[i]]$id), con)

}
}

# If there is no ensemble then for each record there should be one parent
#But when you have ensembles, all of the members have one parent !!
if (is.numeric(ensemble)){
parent.id <- ifelse(is.null(input[i]), NA, input[1]$id)
}else{
parent.id <- ifelse(is.null(input[i]), NA, input[i]$id)
}



if ("newsite" %in% names(input.args) && !is.null(input.args[["newsite"]])) {
site.id <- input.args$newsite
}

if (insert.new.file && id_not_added) {
dbfile.id <- dbfile.insert(in.path = dirname(result[[i]]$file[1]),
in.prefix = result[[i]]$dbfile.name[1],
'Input', existing.input[[i]]$id,
con, reuse=TRUE, hostname = machine$hostname)
newinput$input.id <- c(newinput$input.id, existing.input[[i]]$id)
newinput$dbfile.id <- c(newinput$dbfile.id, dbfile.id)
} else if (id_not_added) {

# This is to tell input.insert if we are wrting ensembles
# Why does it need it ? bc it checks for inputs with the same time period, site and machine
# and if it returns somethings it does not insert anymore, but for ensembles it needs to bypass this condition
if (!is.null(ensemble) | is.null(ensemble_name)){
ens.flag <- TRUE
}else{
ens.flag <- FALSE
}

new_entry <- dbfile.input.insert(in.path = dirname(result[[i]]$file[1]),
in.prefix = result[[i]]$dbfile.name[1],
siteid = site.id,
startdate = start_date,
enddate = end_date,
mimetype,
formatname,
parentid = parent.id,
con = con,
hostname = machine$hostname,
allow.conflicting.dates = allow.conflicting.dates,
ens=ens.flag
)


newinput$input.id <- c(newinput$input.id, new_entry$input.id)
newinput$dbfile.id <- c(newinput$dbfile.id, new_entry$dbfile.id)
}

} #End for loop

successful <- TRUE
return(newinput)
} else {
PEcAn.logger::logger.warn("Input was not added to the database")
successful <- TRUE
return(NULL)
}
Sweetdevil144 marked this conversation as resolved.
Show resolved Hide resolved
return (add.database.entries(result, con, start_date,
end_date, write, overwrite,
insert.new.file, input.args,
machine, mimetype, formatname,
allow.conflicting.dates, ensemble,
ensemble_name, existing.input,
existing.dbfile, input))
} # convert_input


Expand Down
Loading
Loading