Skip to content

Add HiPerGator scripts for automated dataset release validation and reporting #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Jun 30, 2025

This PR implements a complete solution for periodically validating MillionTrees dataset releases on HiPerGator infrastructure. The scripts automate running release tests and generate comprehensive reports with dataset size information to confirm that releases are well-structured and accessible.

What's Added

Core Scripts

  • submit_test_release.sh - SLURM submission script configured for HiPerGator with appropriate resource allocation (32GB RAM, 24hr time limit, GPU partition)
  • generate_dataset_report.py - Standalone Python script that generates comprehensive dataset reports with version history and size tracking
  • HIPERGATOR_TESTING.md - Complete documentation covering setup, usage, and scheduling
  • test_hipergator_scripts.py - Validation test suite to ensure scripts work correctly

Key Features

Automated Dataset Validation:

  • Runs all tests in tests/test_release.py with full pytest output capture
  • Validates dataset downloads, structure, format, and PyTorch dataloader functionality
  • Proper error handling and exit code propagation for SLURM job status

Comprehensive Reporting:

  • Extracts dataset size information from all versions (current total: ~77.86 GB)
  • Generates markdown reports with download URLs, compressed sizes, and test results
  • Tracks dataset growth over time for infrastructure planning

Robust Logging:

  • Creates timestamped log directories with organized output files
  • Maintains latest symlinks for easy access to recent results
  • Captures both test results and report generation logs

Dataset Size Summary

Dataset Latest Version Compressed Size
TreePolygons 0.2 70.24 GB
TreePoints 0.2 1.36 GB
TreeBoxes 0.2 6.26 GB
Total - 77.86 GB

Usage

On HiPerGator

sbatch submit_test_release.sh

Generate Reports Locally

python generate_dataset_report.py
python generate_dataset_report.py --test-exit-code 0  # include test results

Scheduling

Can be scheduled via cron for weekly validation:

0 2 * * 0 cd /home/b.weinstein/MillionTrees && sbatch submit_test_release.sh

Benefits

  • Quality Assurance: Ensures dataset releases remain accessible and properly formatted
  • Infrastructure Planning: Tracks dataset growth for storage and bandwidth planning
  • Automated Monitoring: Provides regular validation of the MillionTrees benchmark integrity
  • Documentation: Generates up-to-date reports with current dataset information

The scripts are ready for deployment on HiPerGator and provide a comprehensive solution for maintaining dataset quality assurance through automated testing and reporting.

Fixes #46.

Warning

Firewall rules blocked me from connecting to one or more addresses

I tried to connect to the following addresses, but was blocked by firewall rules:

  • data.rc.ufl.edu
    • Triggering command: python -m pytest tests/test_release.py -v --tb=short (dns block)

If you need me to access, download, or install something from one of these locations, you can either:


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI changed the title [WIP] Hipergator tests for test/test_release Add HiPerGator scripts for automated dataset release validation and reporting Jun 30, 2025
@Copilot Copilot AI requested a review from bw4sz June 30, 2025 15:29
Copilot finished work on behalf of bw4sz June 30, 2025 15:29
@bw4sz
Copy link
Collaborator

bw4sz commented Jun 30, 2025

@ethanwhite do you see this?

@ethanwhite
Copy link
Member

@bw4sz - yes, I see it

bw4sz and others added 7 commits June 30, 2025 16:29
* Create AutoArborist for automated tree imagery and annotation generation

* Implement AutoArborist data preparation for tree location annotations

* split into 100m crops

* rebase main

* mergeing auto-aborist

* remove bloomington data

* mend

* checking out seperate branch for tess

* Delete data_prep/AutoArborist/AutoArborist_summary.md

* Delete data_prep/AutoArborist/calgary_annotations.csv

---------

Co-authored-by: Cursor Agent <[email protected]>
Co-authored-by: Josh Veitch-Michaelis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hipergator tests for test/test_release
3 participants