Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
3d0f427
Enable set -o pipefail to catch errors in piped commands
DavidHuber-NOAA Nov 20, 2025
4ad6542
Refactor cleanup task and add checks for gempak files
DavidHuber-NOAA Nov 20, 2025
00372f3
Merge remote-tracking branch 'emc/develop' into feature/save_gempak_1p00
DavidHuber-NOAA Nov 20, 2025
8e4322b
Shellcheck fixes
DavidHuber-NOAA Nov 20, 2025
c6a23cf
Ignore 2312
DavidHuber-NOAA Nov 20, 2025
484235c
Remove unused variable
DavidHuber-NOAA Nov 20, 2025
749f917
Disable pipefail exits for now
DavidHuber-NOAA Nov 21, 2025
ceeae5e
Shellcheck issues
DavidHuber-NOAA Nov 21, 2025
cea9f0e
Merge branch 'feature/save_gempak_1p00' of github.com:davidhuber-noaa…
DavidHuber-NOAA Nov 21, 2025
7006ec5
Fix bufrsnd tarball creation
DavidHuber-NOAA Nov 21, 2025
4efde5a
Correct archiving to actually pull the data used
DavidHuber-NOAA Nov 21, 2025
214eb50
Correct positional arguments
DavidHuber-NOAA Nov 21, 2025
ebb0e33
Tar all bufr files
DavidHuber-NOAA Nov 21, 2025
cdf5a8f
Use a find command instead
DavidHuber-NOAA Nov 21, 2025
ef21b22
Merge remote-tracking branch 'emc/develop' into feature/save_gempak_1p00
DavidHuber-NOAA Nov 25, 2025
3303cc4
Do not load gh anymore
DavidHuber-NOAA Nov 25, 2025
473c7b5
Enable archiving on WCOSS2 by default
DavidHuber-NOAA Nov 25, 2025
7cda242
Make the bufr soundings and surface files EE2-compliant
DavidHuber-NOAA Nov 25, 2025
41effff
Update soundings tarball name to be EE2-compliant
DavidHuber-NOAA Nov 26, 2025
c8bc24c
Correct atmos product dependencies
DavidHuber-NOAA Dec 1, 2025
30395ab
Replace tabs with spaces
DavidHuber-NOAA Dec 1, 2025
a88f475
Increase memory for gempak jobs (needed for meta task)
DavidHuber-NOAA Dec 1, 2025
159dc68
Make cleanup script smarter
DavidHuber-NOAA Dec 1, 2025
f15662b
Merge develop; fix conflicts
DavidHuber-NOAA Dec 1, 2025
73f4bcc
Reduce RTOFS retention to 24 hours
DavidHuber-NOAA Dec 1, 2025
a8d5f41
Remove extra }
DavidHuber-NOAA Dec 2, 2025
cd59c0f
Rename RTOFS and GEMPAK variables
DavidHuber-NOAA Dec 2, 2025
d2ebe7f
Clarify RTOFS cleanup language
DavidHuber-NOAA Dec 2, 2025
9331409
Add DO_BUFRSND dependency on gempak/bufr sounding outputs
DavidHuber-NOAA Dec 2, 2025
19e44a1
Linter cleanup
DavidHuber-NOAA Dec 2, 2025
844d528
Rename variables to more meaningful names
DavidHuber-NOAA Dec 2, 2025
28659b9
Only delete empty RUN.PDY directories
DavidHuber-NOAA Dec 2, 2025
38793d2
Add check for invalid GEMPAK retention
DavidHuber-NOAA Dec 2, 2025
8e6333a
Merge remote-tracking branch 'emc/develop' into feature/save_gempak_1p00
DavidHuber-NOAA Dec 2, 2025
76bc130
Merge remote-tracking branch 'emc/develop' into feature/save_gempak_1p00
DavidHuber-NOAA Dec 8, 2025
f677e4f
Merge remote-tracking branch 'emc/develop' into feature/save_gempak_1p00
DavidHuber-NOAA Dec 9, 2025
bb7a31a
Re-disable HPSS archiving on WCOSS2 by default
DavidHuber-NOAA Dec 9, 2025
4a8bb34
Disable C48_S2SW on WCOSS2
DavidHuber-NOAA Dec 9, 2025
8382dea
Turn on HPSS archiving for extended tests
DavidHuber-NOAA Dec 9, 2025
fb0f744
Revert the use of 'bufr_target'
DavidHuber-NOAA Dec 9, 2025
246487c
Remove now-unused variable
DavidHuber-NOAA Dec 9, 2025
865a53d
Correct GEMPAK_CLEANUP check
DavidHuber-NOAA Dec 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 16 additions & 8 deletions dev/parm/config/gfs/config.cleanup
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,30 @@ source "${EXPDIR}/config.resources" cleanup
export CLEANUP_COM="YES" # NO=retain ROTDIR. YES default in cleanup.sh

#--starting and ending hours of previous cycles to be removed from rotating directory
export RMOLDSTD=144
export RMOLDEND=24
# Selectively remove files between SELECTIVE_RM_MIN and SELECTIVE_RM_MIN hours old, based on exclude_string
# Remove all RTOFS files older than RTOFS_CLEANUP_MAX hours
# Remove all files, except GEMPAK files, older than SELECTIVE_RM_MAX hours, based on gempak_exclude_string
# Remove all files older than the max of all SELECTIVE_RM* variables
# Retain all files newer than SELECTIVE_RM_MIN hours

if [[ "${DO_GEMPAK}" == "YES" && "${RUN}" == "gfs" ]]; then
export RMOLDSTD=346
export RMOLDEND=222
export SELECTIVE_CLEANUP_MAX=144
export SELECTIVE_CLEANUP_MIN=24
export GEMPAK_CLEANUP_MAX=240
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChristopherHill-NOAA Can you verify that this is the correct setting for the *gfs_1p00_* files needed for the GEMPAK meta task? This will delete the target files that are older than 240 hours before the current cycle at the end of the cycle (e.g. if the current cycle is 2022012000, delete *gfs_1p00_* files from cycles before 2022011000).

Copy link
Contributor

@ChristopherHill-NOAA ChristopherHill-NOAA Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidHuber-NOAA This is generally correct - only the *gfs_1p00* files are considered as being needed by the GEMPAK meta task as early as -240 hours before the active cycle, and such data older than -240 hours can be deleted.

From the COMROOT/C96_atm3DVar_extended_gempak test case, it is observed that only the products/atmos/gempak/1p00 subdirectories and their content (i.e. *gfs_1p00* files) are retained within the cycle range of 2022010300 to 2022010718 -- which are -168 and -54 hours relative to the last cycle 2022011000 of the test case, respectively. Before and after this cycle range, all (spoofed) subdirectories and files under products/atmos remain present.

Copy link
Contributor

@ChristopherHill-NOAA ChristopherHill-NOAA Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is supposed that the desired effect is a) to retain only *gfs_1p00* files within the range of {GEMPAK_CLEANUP_MAX} to {SELECTIVE_CLEANUP_MIN} and b) to delete those *gfs_1p00* files older than{GEMPAK_CLEANUP_MAX}, since the value of GEMPAK_CLEANUP_MAX is assumed as being greater than the value of SELECTIVE_CLEANUP_MAX. It is further observed that the products/atmos/gempak/1p00/gfs_1p00* data content remains present in directories representing cycles earlier than the value of {GEMPAK_CLEANUP_MAX}. It is noted that the SELECTIVE_CLEANUP_{MIN/MAX} variables are intended only to focus file cleanup over a limited range of cycles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChristopherHill-NOAA Thank you for checking the directories. I had created a bug during a merge after testing and did not thoroughly test again. I have now fixed that bug and further improved the scripting. Would you mind checking the COM directory again?

For clarity:

  • My test case runs from cycle 2021122018 through 2022011000. I have spoofed data from 2021122200 through 2022011000.
  • I am only running my test on the last cycle (2022011000).
  • I am working on the assumption that when files are needed for the GEMPAK meta task, 240 hours (40 cycles) worth of products/atmos/gempak/1p00/gfs_1p00* files are required (i.e. back to cycle 2021123100).
  • prepbufr, cnvstat.tar, and analysis.atm.a*.nc files should be retained for 144 hours (24 cycles, or back to 2022010400)
  • The cleanup task starts looking in directories 24 hours older than the maximum retention period (240 hours in the RUN=gfs case) meaning 264 hours before 2022011000, or cycle 2021123000.
  • Directories older than 240 hours (i.e. gfs.20211230/00, gfs.20211230/06, gfs.20211230/12, and gfs.20211230/18) should be deleted
  • Empty RUN.PDY directories (i.e. gfs.20211230) should also be deleted

Please correct me if I am wrong about any of these bullets. If everything looks correct, I will run another test for cycle 2022011006 to verify that the cleanup increments by a cycle for all excepted files.

Copy link
Contributor

@ChristopherHill-NOAA ChristopherHill-NOAA Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bullet points appear to be correct, and representative of the intended effort to clean all but *gfs_1p00* files out to -240 hours and to clean those remaining *gfs_1p00* files beyond -240 hours.

From the latest test, it is observed that the gfs.20211230 directory is missing, and assumed to have been deleted by the cleanup script. The *gfs_1p00* data remain for cycles 2021123100 through 2022010900, with the directories representing cycles 2021122918 and older being unaffected.

Some of the *gfs_1p00* files are unexpectedly missing under each of the cycle 18 directories. Specifically missing from gfs.yyyymmdd/18/products/atmos/gempak/1p00 are *gfs_1p00* files with the following suffixes:
f003, f015, f027, f033, f039, f042, f045, f051, f054
These files may have been inadvertently deleted, or were not captured in the spoofing scheme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChristopherHill-NOAA I believe this was caused by the ptmp scrubber. I have rerun the test case from the beginning, i.e.

  • starting at 2021122018 and ending on 2022011000
  • spoofed data from 2021122200 through 2022011000
    • The filenames were mimicked by inventorying the contents of gfs.20211221 and gdas.20211221
  • The gdas_cleanup and gfs_cleanup jobs were then run on 2022011000
    • This resulted in the removal of all GFS cycles on 20211230 and all GDAS cycles on 20220103 from the COM directory
  • Note that the forecasts now go out to 384 hours instead of 240

I then verified that the *gfs_1p00* files were present at/after cycle 2021123100 at all forecast hours.

You can find the COM directory here: /lfs/h2/emc/ptmp/david.huber/rt_gempak3/COMROOT/C96_atm3DVar_extended_gempak3
and the associated EXPDIR here: /lfs/h2/emc/ptmp/david.huber/rt_gempak3/EXPDIR/C96_atm3DVar_extended_gempak3

Does this look good to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidHuber-NOAA The latest workflow run - with modified GEMPAK file cleanup through case C96_atm3DVar_extended_gempak3 - appears to provide the desired results. Each of the expected files is confirmed to present in the gfs.yyyymmdd/cc/products/atmos/gempak/1p00 directories over the cycle range of 2021123100 to 2022010900 following invocation of exglobal_cleanup.sh with GEMPAK_CLEANUP=240.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you for confirming @ChristopherHill-NOAA!

export RTOFS_CLEANUP_MAX=48
if [[ ${SELECTIVE_CLEANUP_MIN} -gt ${SELECTIVE_CLEANUP_MAX} ]]; then
echo "FATAL ERROR: Invalid selective cleanup times: "
echo " SELECTIVE_CLEANUP_MIN=${SELECTIVE_CLEANUP_MIN} > SELECTIVE_CLEANUP_MAX=${SELECTIVE_CLEANUP_MAX}"
exit 1
fi

# Specify the list of files to exclude from the first stage of cleanup
# Because arrays cannot be exported, list is a single string of comma-
# separated values. This string is split to form an array at runtime.
export exclude_string=""
case ${RUN} in
gdas | gfs) exclude_string="*prepbufr*, *cnvstat.tar*, *analysis.atm.a*.nc" ;;
enkf*) exclude_string="*f006.ens*" ;;
gdas | gfs) exclude_string+="*prepbufr*, *cnvstat.tar*, *analysis.atm.a*.nc" ;;
enkf*) exclude_string+="*f006.ens*" ;;
*) exclude_string="" ;;
esac
export exclude_string

echo "END: config.cleanup"
100 changes: 66 additions & 34 deletions scripts/exglobal_cleanup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,78 +6,107 @@
# Remove DATAoutput from the forecast model run
# TODO: Handle this better
DATAfcst="${DATAROOT}/${RUN}fcst.${PDY:-}${cyc}"
if [[ -d "${DATAfcst}" ]]; then rm -rf "${DATAfcst}"; fi
if [[ -d "${DATAfcst}" ]];
then rm -rf "${DATAfcst}";
fi
#DATAefcs="${DATAROOT}/${RUN}efcs???${PDY:-}${cyc}"
rm -rf "${DATAROOT}/${RUN}efcs"*"${PDY:-}${cyc}"
###############################################################

if [[ "${CLEANUP_COM:-YES}" == NO ]] ; then
exit 0

Check warning

Code scanning / shellcheck

purge_every_days appears unused. Verify use (or export if used externally). Warning

purge_every_days appears unused. Verify use (or export if used externally).
fi

SELECTIVE_CLEANUP_MIN=${SELECTIVE_CLEANUP_MIN:-24}
SELECTIVE_CLEANUP_MAX=${SELECTIVE_CLEANUP_MAX:-120}
RTOFS_CLEANUP_MAX=${RTOFS_CLEANUP_MAX:-48}
GEMPAK_CLEANUP_MAX=${GEMPAK_CLEANUP_MAX:-240}
###############################################################
# Clean up previous cycles; various depths

# Step back every assim_freq hours and remove old rotating directories
# for successful cycles (defaults from 24h to 120h).
# Retain files needed by Fit2Obs
last_date=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} -${RMOLDEND:-24} hours")
first_date=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} -${RMOLDSTD:-120} hours")
last_rtofs=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} -${RMOLDRTOFS:-48} hours")
last_selective_date=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} -${SELECTIVE_CLEANUP_MIN} hours")
first_selective_date=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} -${SELECTIVE_CLEANUP_MAX} hours")
last_rtofs_date=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} -${RTOFS_CLEANUP_MAX} hours")
last_gempak_date=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} -${GEMPAK_CLEANUP_MAX} hours")
exclude_string="${exclude_string:-}"

# Find the last date among all cleanup targets
max_cleanup_max="${SELECTIVE_CLEANUP_MAX:-120}"
for cleanup_max in "${RTOFS_CLEANUP_MAX}" "${GEMPAK_CLEANUP_MAX}"; do
if [[ ${cleanup_max} -gt ${max_cleanup_max} ]]; then
max_cleanup_max=${cleanup_max}
fi
done

last_date=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} -${max_cleanup_max} hours")

function remove_files() {
local directory=$1
shift
if [[ ! -d ${directory} ]]; then
echo "No directory ${directory} to remove files from, skiping"
return
fi
local find_exclude_string=""
for exclude in "$@"; do
find_exclude_string+="${find_exclude_string} -name ${exclude} -or"
# Find all files and links in the directory and store as an arry
# Run find only once for efficiency
flist=($(find "${directory}" -type f -or -type l))

# Now remove those files that match the exclude patterns
for exclude_pattern in "$@"; do
# Use a temporary array to hold files that do not match the exclude pattern
temp_flist=()
for file in "${flist[@]}"; do
case "$(basename "${file}")" in
${exclude_pattern})
# Match found, skip this file
;;
*)
# No match, keep this file
temp_flist+=("${file}")
;;
esac
done
flist=("${temp_flist[@]}")
done
# Chop off any trailing or
find_exclude_string="${find_exclude_string[*]/%-or}"
# Remove all regular files that do not match
# shellcheck disable=SC2086
if [[ -n "${find_exclude_string}" ]]; then
# String is non-empty → use exclusion
find "${directory}" -type f -not \( ${find_exclude_string} \) -ignore_readdir_race -delete
else
# String is empty → no exclusion
find "${directory}" -type f -ignore_readdir_race -delete
fi

# Remove all symlinks that do not match
# shellcheck disable=SC2086
if [[ -n "${find_exclude_string}" ]]; then
# String is non-empty → use exclusion
find "${directory}" -type l -not \( ${find_exclude_string} \) -ignore_readdir_race -delete
else
# String is empty → no exclusion
find "${directory}" -type l -ignore_readdir_race -delete
fi
# Delete all files in flist.
for file in "${flist[@]}"; do
rm -f "${file}"
done

# Remove any empty directories
find "${directory}" -type d -empty -delete
}

for (( current_date=first_date; current_date <= last_date; \
# Now start removing old COM files/directories
for (( current_date=first_selective_date; current_date <= last_date; \
current_date=$(date --utc +%Y%m%d%H -d "${current_date:0:8} ${current_date:8:2} +${assim_freq} hours") )); do
current_PDY="${current_date:0:8}"
current_cyc="${current_date:8:2}"
rtofs_dir="${ROTDIR}/rtofs.${current_PDY}"
rocotolog="${EXPDIR}/logs/${current_date}.log"

# Extend the exclude list for gempak files if needed
if [[ "${RUN}" == "gfs" && ${current_date} -lt ${last_gempak_date} && "${DO_GEMPAK}" == "YES" ]]; then
# Provide the gempak exclude pattern(s)
exclude_string+=", *gfs_1p00_*"
fi

# Check if the cycle completed successfully by looking at the rocoto log
if [[ -f "${rocotolog}" ]]; then
# TODO: This needs to be revamped to not look at the rocoto log.
# shellcheck disable=SC2312
if [[ $(tail -n 1 "${rocotolog}") =~ "This cycle is complete: Success" ]]; then
YMD="${current_PDY}" HH="${current_cyc}" declare_from_tmpl \
COMOUT_TOP:COM_TOP_TMPL
if [[ -d "${COMOUT_TOP}" ]]; then
IFS=", " read -r -a exclude_list <<< "${exclude_string:-}"
IFS=", " read -r -a exclude_list <<< "${exclude_string}"
remove_files "${COMOUT_TOP}" "${exclude_list[@]:-}"
fi
if [[ -d "${rtofs_dir}" ]] && (( current_date < last_rtofs )); then rm -rf "${rtofs_dir}" ; fi
# Remove all rtofs directories in each RUN older than last_rtofs_date
rtofs_dir="${ROTDIR}/rtofs.${current_PDY}"
if [[ -d "${rtofs_dir}" ]] && (( current_date < last_rtofs_date )); then rm -rf "${rtofs_dir}" ; fi
fi
fi
done
Expand Down Expand Up @@ -108,13 +137,16 @@
fi

# Remove $RUN.$rPDY for the older of GDATE or RDATE
GDATE=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} -${RMOLDSTD:-120} hours")
GDATE=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} -${max_cleanup_max} hours")
RDATE=$(date --utc +%Y%m%d%H -d "${PDY} ${cyc} -${FHMAX_GFS} hours")
if (( GDATE < RDATE )); then
RDATE=${GDATE}
fi

deletion_target="${ROTDIR}/${RUN}.${RDATE:0:8}"
if [[ -d ${deletion_target} ]]; then rm -rf "${deletion_target}"; fi
if [[ -d "${deletion_target}" ]]; then
rm -rf "${deletion_target}"
fi

# sync and wait to avoid filesystem synchronization issues
sync && sleep 1
4 changes: 2 additions & 2 deletions ush/preamble.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ declare -x PS4='+ $(basename ${BASH_SOURCE[0]:-${FUNCNAME[0]:-"Unknown"}})[${LIN

set_strict() {
if [[ ${STRICT:-"YES"} == "YES" ]]; then
# Exit on error and undefined variable
set -eu
# Exit on error, undefined variable, or error in a pipeline (e.g. if and command in "cmd | cmd2" fails)
set -euo pipefail
fi
}

Expand Down