Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example_submission_scripts #350

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ligerzero-ai
Copy link
Contributor

@ligerzero-ai ligerzero-ai commented Oct 1, 2024

@jan-janssen as discussed

Summary by CodeRabbit

  • New Features

    • Introduced job submission scripts for high-performance computing environments tailored for different systems (NCI Gadi, GPU, and Pawsey Setonix).
    • Each script is configured with specific resource requirements and software environments for running VASP simulations.
  • Documentation

    • Added comments and notes in scripts regarding storage access requirements and job parameters for clarity.

Copy link

coderabbitai bot commented Oct 1, 2024

Walkthrough

The changes involve three job submission scripts for high-performance computing (HPC) environments, each tailored for different systems and configurations. The scripts specify job parameters such as resource allocations, software versions, and execution commands for the VASP software. They include directives for managing CPU and GPU resources, memory requirements, and output logging, ensuring compatibility with the respective computing environments.

Changes

Files Change Summary
example_hpc_submission_scripts/australia_nci_gadi.sh, example_hpc_submission_scripts/australia_nci_gadi_gpu.sh, example_hpc_submission_scripts/australia_pawsey_setonix.sh Each script is a job submission script for HPC environments, detailing resource requirements, software versions, execution commands for VASP, and output logging. The GPU script specifies additional parameters for GPU usage.

Poem

In the land of compute where the rabbits play,
Scripts are crafted for a busy day.
With VASP in tow, they hop and run,
Allocating resources, oh what fun!
Memory and CPUs, all in a line,
Hopping through jobs, making them shine! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (4)
example_hpc_submission_scripts/australia_pawsey_setonix.sh (1)

7-7: Consider using a generic job name for the example script

The current job name TSR_RATTLE_struct_1871_2_Mn_8.sh is very specific. For an example submission script, it might be more appropriate to use a generic name like vasp_job or include a placeholder (e.g., ${JOB_NAME}) that users can easily replace.

Would you like me to suggest a more generic job name or a way to make it easily customizable?

example_hpc_submission_scripts/australia_nci_gadi_gpu.sh (3)

1-8: Resource allocation looks good, consider fine-tuning memory.

The PBS directives for resource allocation are well-structured and appropriate for a GPU job on the NCI Gadi system. The walltime, CPU, and GPU requests are reasonable.

Consider fine-tuning the memory request if you find that your jobs consistently use less than 160GB. This could potentially improve queue times and resource utilization. Monitor your job's memory usage and adjust accordingly.


14-16: Helpful storage access note, consider making it more prominent.

The comment about storage access requirements is informative and includes a useful link to documentation. This is valuable information for users who need access to additional directories.

Consider making this note more prominent, perhaps by adding a clear TODO comment or moving it to the top of the script. This will ensure users don't overlook this important configuration step. For example:

# TODO: If job needs access to specific directories, uncomment and modify the following line:
# #PBS -l storage=scratch/ab12+gdata/yz98

18-18: Execution command is well-structured, consider adding error handling and job monitoring.

The mpirun command is well-constructed, using the correct number of processes and an appropriate mapping strategy for GPU jobs. Redirecting output to a log file is also good practice.

Consider enhancing the script with error handling and job monitoring:

  1. Add error checking:
mpirun -np $PBS_NGPUS --map-by ppr:1:numa vasp_std-gpu >vasp.log 2>&1 || { echo "VASP job failed"; exit 1; }
  1. Implement basic job monitoring:
echo "Job started at $(date)"
mpirun -np $PBS_NGPUS --map-by ppr:1:numa vasp_std-gpu >vasp.log 2>&1 || { echo "VASP job failed"; exit 1; }
echo "Job finished at $(date)"
  1. Consider adding a simple progress check, e.g., periodically echoing the size of the log file:
(while true; do sleep 300; echo "Log file size: $(wc -c <vasp.log) bytes"; done) &
mpirun -np $PBS_NGPUS --map-by ppr:1:numa vasp_std-gpu >vasp.log 2>&1 || { echo "VASP job failed"; exit 1; }
kill $!  # Kill the background progress checking process

These additions will make the script more robust and informative.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between ea84b24 and 1ba9220.

📒 Files selected for processing (3)
  • example_hpc_submission_scripts/australia_nci_gadi.sh (1 hunks)
  • example_hpc_submission_scripts/australia_nci_gadi_gpu.sh (1 hunks)
  • example_hpc_submission_scripts/australia_pawsey_setonix.sh (1 hunks)
🧰 Additional context used
🪛 Shellcheck
example_hpc_submission_scripts/australia_pawsey_setonix.sh

[warning] 14-14: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🔇 Additional comments (8)
example_hpc_submission_scripts/australia_nci_gadi.sh (4)

12-13: Consider uncommenting and adjusting the storage directive if needed

The comment about including a PBS storage directive is informative. However, if your job actually requires access to specific storage paths, you should uncomment and adjust this directive.

For example, if you need access to /scratch/ab12/ and /g/data/yz98/, you would add:

#PBS -l storage=scratch/ab12+gdata/yz98

To check if your job might need access to specific storage paths, you could search for file paths in your VASP input files:

#!/bin/bash
# Description: Search for file paths in VASP input files

# Search for file paths in POSCAR, INCAR, KPOINTS, and POTCAR files
rg -t txt '/' POSCAR INCAR KPOINTS POTCAR

If you see paths outside the job's default accessible areas, you'll need to include the appropriate storage directive.


15-15: Consider adding error handling and job completion notification

The VASP execution command is correct, and redirecting output to a log file is good practice. However, consider enhancing the script with error handling and job completion notification. This can help with monitoring job status and debugging if issues occur.

Here's an example of how you could modify the execution line:

mpirun vasp_std > vasp.log 2>&1 || { echo "VASP execution failed"; exit 1; }
echo "VASP job completed successfully" | mail -s "VASP Job Status" [email protected]

This modification will:

  1. Capture both stdout and stderr in the log file.
  2. Exit with an error message if VASP execution fails.
  3. Send an email notification upon successful job completion.

To check if error handling is commonly used in other scripts, you could search for similar patterns:

#!/bin/bash
# Description: Search for error handling patterns in shell scripts

# Search for error handling patterns in shell scripts
rg -t sh '(set -e|exit 1||| \{)'

# Search for job completion notification patterns
rg -t sh '(echo.*completed|mail -s)'

Consider implementing similar error handling and notification mechanisms if they're commonly used in other scripts in your repository.


1-7: Review and possibly adjust resource allocations

The resource allocations look reasonable for a small VASP job, but consider the following:

  1. The walltime of 1 hour might be too short for complex VASP calculations. Consider increasing it based on your specific use case.
  2. 10GB of memory might not be sufficient for larger systems. Adjust this based on your system size and complexity.
  3. While 8 CPUs is reasonable, VASP can often utilize more for better performance. Consider increasing if your calculations would benefit from more parallelization.

To help determine appropriate resource allocations, you could run a test job and analyze its resource usage. Here's a script to help with that:

Adjust your resource allocations based on the results of these analyses.


9-10: Consider checking for newer VASP versions

Loading a specific version (5.4.4) is good for reproducibility. However, this might not be the latest version of VASP. Newer versions could offer bug fixes, performance improvements, or new features.

To check available VASP versions on the system, you could use:

Consider updating to the latest stable version if it's available and compatible with your calculations.

example_hpc_submission_scripts/australia_pawsey_setonix.sh (2)

2-2: Verify intentionality of commented SBATCH directives

The --nodes=1 and --exclusive SBATCH directives are currently commented out. If these are meant to be active, please uncomment them. If they're intentionally commented for flexibility, consider adding a comment explaining why.

Could you confirm if these lines should be active or if they're intentionally commented out for flexibility?

Also applies to: 12-12


15-15: Verify the necessity of unlimited stack size

Setting an unlimited stack size with ulimit -s unlimited is common in some HPC environments, but it's not always necessary and could potentially lead to issues if a program has a memory leak.

Could you confirm if VASP 5.4.4 specifically requires an unlimited stack size on the Setonix system? If not, consider either removing this line or setting a large but finite limit.

example_hpc_submission_scripts/australia_nci_gadi_gpu.sh (2)

9-9: Good use of working directory directive.

The -l wd directive ensures the job runs in the submission directory, which is a best practice. It simplifies script management and reduces the chance of file path-related errors.


11-12: Good practice loading a specific VASP version, verify if it's the most appropriate.

Loading a specific version of VASP (6.2.1) is excellent for reproducibility. The comment is also helpful.

Please verify that 6.2.1 is the most appropriate VASP version for your needs on this system. You can check available versions with:

Consider updating to a newer version if available and compatible with your research requirements.

#SBATCH --mem=32GB
##SBATCH --exclusive
module load vasp/5.4.4
cd "$PBS_O_WORKDIR"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Consider using SLURM_SUBMIT_DIR and add error handling

  1. This script appears to be for a SLURM-based system (Setonix), but it's using PBS_O_WORKDIR. Consider using SLURM_SUBMIT_DIR instead, which is the SLURM equivalent.

  2. Add error handling to the cd command as suggested by Shellcheck.

Here's a suggested modification:

-cd "$PBS_O_WORKDIR"
+cd "$SLURM_SUBMIT_DIR" || { echo "Failed to change directory"; exit 1; }

This change ensures that the script exits if it fails to change to the correct directory, preventing potential issues with file access or job execution in the wrong location.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd "$PBS_O_WORKDIR"
cd "$SLURM_SUBMIT_DIR" || { echo "Failed to change directory"; exit 1; }
🧰 Tools
🪛 Shellcheck

[warning] 14-14: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

cd "$PBS_O_WORKDIR"
ulimit -s unlimited
run_cmd="srun --export=ALL -N 1 -n 32"
$run_cmd vasp_std &> vasp.log
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling and output management

While the current command will capture both stdout and stderr in vasp.log, consider implementing more robust error handling and output management. Here are some suggestions:

  1. Separate stdout and stderr for easier debugging.
  2. Add a check for the exit status of the VASP run.
  3. Consider using a timestamp in the log filename to prevent overwriting in case of multiple runs.

Here's a suggested modification:

-$run_cmd vasp_std &> vasp.log
+log_file="vasp_$(date +%Y%m%d_%H%M%S).log"
+$run_cmd vasp_std > "$log_file" 2>&1
+exit_status=$?
+if [ $exit_status -ne 0 ]; then
+    echo "VASP run failed with exit status $exit_status. Check $log_file for details." >&2
+    exit $exit_status
+fi

This change separates stdout and stderr (but still captures both in the log file), checks the exit status of the VASP run, uses a timestamp in the log filename, and provides an informative error message if the run fails.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
$run_cmd vasp_std &> vasp.log
log_file="vasp_$(date +%Y%m%d_%H%M%S).log"
$run_cmd vasp_std > "$log_file" 2>&1
exit_status=$?
if [ $exit_status -ne 0 ]; then
echo "VASP run failed with exit status $exit_status. Check $log_file for details." >&2
exit $exit_status
fi

@jan-janssen jan-janssen marked this pull request as draft October 30, 2024 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant