-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add python 3.11 support #26566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add python 3.11 support #26566
Conversation
|
Job Documentation on 1aa97f5 wanted to post the following: View the site here This comment will be updated on new commits. |
|
Job Coverage on 1aa97f5 wanted to post the following: Framework coverageCoverage did not change Modules coverageCoverage did not change Full coverage reportsReports
This comment will be updated on new commits. |
|
Job Apptainer build GCC min on 5433882 : invalidated by @milljm Error during download/extract/detection of FBLASLAPACK |
|
There is a lot of fix in order to make Python 3.11 work... |
|
On Linux, there does not appear to be a solution for Python 3.11 and HDF5=1.12.x=mpi_mpich* and VTK If we bump HDF5 to 1.14, there is a solution. But historically doing so takes a lot of work. Edit: |
| - {{ moose_libclang13 }} | ||
| - {{ moose_ld64 }} # [osx] | ||
| - {{ moose_libclang }} # [osx] | ||
| - {{ moose_libclang13 }} # [osx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really not sure how we've been successful while this was in place (look at lines 42 and 43 above!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this libclang versioning work? Previously we were using clang 12 but we link libclang13, and now clang 15 is still linked with libclang13?
| - {{ moose_libgfortran }} | ||
| - {{ moose_libgfortran5 }} | ||
| - {{ moose_hdf5 }} | ||
| - hdf5=1.14.3=mpi* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now necessary. conda_build_config.yaml does not support variant wildcards (as far as I know).
|
I'm curious whether you run into this failure. It has always failed for me with my system 3.11 python |
Yes, this PR certainly does run into this... https://civet.inl.gov/job/2004305/ |
|
I'm glad it's not just my environment 😅 |
e4721b1 to
7d43667
Compare
| #go to the beginning of the file and see if its actually started running the binary | ||
| out_dupe.seek(0) | ||
| output = out_dupe.read() | ||
| if not out_dupe.closed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very concerned that this works. I would expect the inability to read the file should cause this test to fail. But it does not. So I don't understand what is going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full disclosure: I worked on patching a bug with signal checkpoints, but didn't touch this send_signal function and don't think I fully understand it. That being said, shouldn't the time.sleep, os.kill, and break lines below be indented to be under the if not out_dupe.closed check? My assumption is that if out_dupe is already closed, then the moose binary has finished running. In that case, os.kill to a nonexistent PID should raise a ProcessLookupError. If this isn't happening, and the tests are always passing, my guess is that this code is always executed when the binary is running and out_dupe is not closed. I will poke around at this a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have simplified the logic into the attached script. The 'test' is running echo test ; sleep 1. When the script is run with line 37 commented and line 38 uncommented (i.e., time.sleep(5)), os.kill is still able to send a signal to the pid even though the process should have already been terminated (the timer gives a wait time > 5 s). However, when the script is run with line 37 uncommented and line 38 commented (i.e., os.waitpid), os.kill is not able to send the signal because the process has already terminated. HOWEVER, the timer gives a wait time on the order of 1 s, so it seems reasonable that using time.sleep(5) without waitpid would give the subprocess PLENTY of time to terminate. I am not sure how to explain this behavior. Are you? @permcody?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I myself don't know how to explain anything. All I know is that ever since I went to python 3.11 on my machine this test has failed every time I run run_tests and I'll be really happy when it doesn't fail anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may have to do with reaping an exited process. When a process terminates it doesn’t actually end. It hangs around until the parent process closes it out. Until that occurs. The “dead” process is left in “zombie” state. We can take a look at this tomorrow and get to the bottom of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the test just failed on Intel Mac (https://civet.inl.gov/job/2016334/). It usually passes.
misc/signal_handler.test_signal_parallel: Working Directory: /opt/civet1/build/moose/test/tests/misc/signal_handler
misc/signal_handler.test_signal_parallel: Running command: mpiexec -n 3 /opt/civet1/build/moose/test/moose_test-opt -i simple_transient_diffusion_scaled.i Executioner/num_steps=70 --error --error-override --timing Outputs/perf_graph=true --no-gdb-backtrace
misc/signal_handler.test_signal_parallel:
misc/signal_handler.test_signal_parallel:
misc/signal_handler.test_signal_parallel: Exit Code: 30
misc/signal_handler.test_signal_parallel: ################################################################################
misc/signal_handler.test_signal_parallel: Tester failed, reason: CRASH
misc/signal_handler.test_signal_parallel:
[1.090s] FAIL misc/signal_handler.test_signal_parallel FAILED (CRASH) [min_cpus=3]
|
Let’s dive into this tomorrow
…On Tue, Jan 23, 2024 at 4:40 PM Patrick Behne ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In python/TestHarness/testers/SignalTester.py
<#26566 (comment)>:
> @@ -39,15 +39,16 @@ def send_signal(self,pid):
#first, make a true duplicate of the stdout file so we don't mess with the seek on the actual file
out_dupe = copy.copy(self.outfile)
- #go to the beginning of the file and see if its actually started running the binary
- out_dupe.seek(0)
- output = out_dupe.read()
+ if not out_dupe.closed:
pid_example.txt
<https://github.com/idaholab/moose/files/14030841/pid_example.txt>
I have simplified the logic into the attached script. The 'test' is
running echo test ; sleep 1. When the script is run with line 37
commented and line 38 uncommented (i.e., time.sleep(5)), os.kill is still
able to send a signal to the pid even though the process should have
already been terminated (the timer gives a wait time > 5 s). However, when
the script is run with line 37 uncommented and line 38 commented (i.e.,
os.waitpid), os.kill is not able to send the signal because the process
has already terminated. HOWEVER, the time gives a wait time on the order of
1 s, so it seems reasonable that using time.sleep(5) without waitpid
would give the subprocess PLENTY of time to terminate. I am not sure how to
explain this behavior. Are you? @permcody <https://github.com/permcody>?
—
Reply to this email directly, view it on GitHub
<#26566 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXFOICI7YGVOGADF2GOU3LYQBDATAVCNFSM6AAAAABB42EGVOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQNBQGEZTAMBRG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
| @@ -1 +1 @@ | |||
| Subproject commit 7c03afd53c9c7fd7fcb6e3e20bae37cf1b124986 | |||
| Subproject commit 31948b018e9bea83c138035e952d48065458ba4a | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution! This contains a submodule update
| @@ -1 +1 @@ | |||
| Subproject commit 7c03afd53c9c7fd7fcb6e3e20bae37cf1b124986 | |||
| Subproject commit 31948b018e9bea83c138035e952d48065458ba4a | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution! This contains a submodule update
b7a32e3 to
2d9b88f
Compare
|
That was fun. I rebased against my origin, forgetting that I had included a merge from Casey's origin in the mix (his PETSc update), also forgetting that Casey's branch was merged into Next. So... redid it all, rebasing against upstream/next properly. |
307c96c to
8032089
Compare
|
Another git adventure brought to you by milljm. Thank you @cticenhour for helping me sort it out! |
|
Because of Zscaler, the recipe does download VTK. Need to push so I can pull onto another machine... so lame.
Having a pin for libclang only for OSX targets was a mistake. Remove petsc `make check` for now... Add CURL_CA_BUNDLE source: url: paths use `curl`, and therfore we need to pass CA certs.
And here are the Mac Intel dependency pins...
During configure of libMesh package, we need zlib
Add logic which allows method to continue if the file(s) are not closed.
Python 3.11 now requires you to place you global regex's at the begining of the regular expression.
As Casey joked: "Don't use MPI, ESPECIALLY don't use MPI.......BUT maybe here, sure, use MPI." I love it.
The main part of this commit was reworking the logic in SignalTester.send_signal. Namely, the check to see if the moose binary has started running and writing to the output file was modified from an approach that copied and read the output file to a much cheaper approach that simply queried the current position in the output file. Additionally, the 'signal' parameter previously did not do anything. This has been patched. Finally, the 'sleep_time' parameter has been removed because it was not deemed necessary to the testing process (there should be no such thing as a signal being sent too early after the binary has started to run; the checkpoint will simply be output at the end of the next time step).
Update documentation template configration with new versions.
fb9c0e4 to
c92c0dc
Compare
After rebasing on Next to get the latest stuff to fix Next, I find that I need to bump rev-parse head, as my previous commit is failing versioner unittests.
|
This is almost ready for 'Next' scrutiny.... Just need some help with the exodiff. @lindsayad if you would like to take a stab at it? |
|
wow didn't expect there to be so many! |
| - {{ moose_libclang13 }} | ||
| - {{ moose_ld64 }} # [osx] | ||
| - {{ moose_libclang }} # [osx] | ||
| - {{ moose_libclang13 }} # [osx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this libclang versioning work? Previously we were using clang 12 but we link libclang13, and now clang 15 is still linked with libclang13?
|
These failures are crazy. It's like MOOSE meshing has changed with this PR but only on ARM even though this doesn't touch PETSc/libMesh/MOOSE code. Tests are even failing in serial so it's not the mpich bump either |
|
I just looked at the gold/partial_circle_rad_in.e from the meshgenerators/circular_correction_generator.partial_curve_rad failure, and it's definitely not a robust test. The mesh is being generated via XYDelaunay, and the trouble with "generate a Delaunay triangulation of this boundary" in regression tests is that typically the result is not unique. Looking at the bottom right corner of the mesh, where the exodiff complains, we can see nodes at (0,-0.8), (0,-0.9), (0,-1), (-0.1,-1), (-0.2,-1). The three bottom right nodes are going to form one triangle ... and then the remaining part of those nodes' convex hull is a perfectly symmetrical trapezoid. In the gold solution, the trapezoid is divided by the diagonal connecting (-0.2,-1) to (0,-0.9), and that's a Delaunay mesh, but based on the exodiff the failing solution divides that trapezoid by the diagonal connecting (-0.1,-1) to (0,-0.8), and technically that's still a success because that's also a Delaunay mesh. Yay symmetry. I'm not sure what the best solution is here. When I created my own XYDelaunay tests I tried to add asymmetries to the domain to prevent any ambiguity from showing up in them ... but Civet is saying that two of my own tests have failed too!? |
|
What the hell. I just looked at the diffs on this PR. I stand by my explanation of why those tests might get knocked over by a feather, but where is the feather? There should at least be something changing the executables, the underlying malloc/free, something that could get us iterating over elements in a different order. |
|
That's what I'm saying. The failures on this PR are insane |
|
example stack traces from the parallel hangs: process 0 frame #14: 0x000000010196cfd4 libmpifort.12.dylib`mpi_allreduce_ + 144
frame #15: 0x000000010af695b8 libpetsc.3.20.dylib`mumps_propinfo_ + 72
frame #16: 0x000000010ae96b5c libpetsc.3.20.dylib`dmumps_solve_driver_ + 15164
frame #17: 0x000000010aef86f4 libpetsc.3.20.dylib`dmumps_ + 1748
frame #18: 0x000000010aefda2c libpetsc.3.20.dylib`dmumps_f77_ + 6892
frame #19: 0x000000010aef683c libpetsc.3.20.dylib`dmumps_c + 3552
frame #20: 0x000000010a5e3ff4 libpetsc.3.20.dylib`MatSolve_MUMPS + 688
frame #21: 0x000000010a60f3c4 libpetsc.3.20.dylib`MatSolve + 292
frame #22: 0x000000010a9b5f6c libpetsc.3.20.dylib`PCApply_LU + 84
frame #23: 0x000000010a99f96c libpetsc.3.20.dylib`PCApply + 204
frame #24: 0x000000010a9a15ec libpetsc.3.20.dylib`PCApplyBAorAB + 632
frame #25: 0x000000010aba0af0 libpetsc.3.20.dylib`KSPSolve_GMRES + 1112
frame #26: 0x000000010ab48aac libpetsc.3.20.dylib`KSPSolve_Private + 1056
frame #27: 0x000000010ab48648 libpetsc.3.20.dylib`KSPSolve + 16
frame #28: 0x000000010abda684 libpetsc.3.20.dylib`SNESSolve_NEWTONLS + 1316
frame #29: 0x000000010ac18ff8 libpetsc.3.20.dylib`SNESSolve + 1372
process 1 frame #14: 0x0000000104cb0fd4 libmpifort.12.dylib`mpi_allreduce_ + 144
frame #15: 0x000000010e2b98b0 libpetsc.3.20.dylib`mumps_sol_rhsmapinfo_ + 160
frame #16: 0x000000010e1debdc libpetsc.3.20.dylib`dmumps_solve_driver_ + 31676
frame #17: 0x000000010e23c6f4 libpetsc.3.20.dylib`dmumps_ + 1748
frame #18: 0x000000010e241a2c libpetsc.3.20.dylib`dmumps_f77_ + 6892
frame #19: 0x000000010e23a83c libpetsc.3.20.dylib`dmumps_c + 3552
frame #20: 0x000000010d927ff4 libpetsc.3.20.dylib`MatSolve_MUMPS + 688
frame #21: 0x000000010d9533c4 libpetsc.3.20.dylib`MatSolve + 292
frame #22: 0x000000010dcf9f6c libpetsc.3.20.dylib`PCApply_LU + 84
frame #23: 0x000000010dce396c libpetsc.3.20.dylib`PCApply + 204
frame #24: 0x000000010dce55ec libpetsc.3.20.dylib`PCApplyBAorAB + 632
frame #25: 0x000000010dee4af0 libpetsc.3.20.dylib`KSPSolve_GMRES + 1112
frame #26: 0x000000010de8caac libpetsc.3.20.dylib`KSPSolve_Private + 1056
frame #27: 0x000000010de8c648 libpetsc.3.20.dylib`KSPSolve + 16
frame #28: 0x000000010df1e684 libpetsc.3.20.dylib`SNESSolve_NEWTONLS + 1316
frame #29: 0x000000010df5cff8 libpetsc.3.20.dylib`SNESSolve + 1372
|
|
I don't know much about troubleshooting the hangs, but what I can tell you is that non-INL profile'd machines (no JAMF), seem to not hang. -my personal Apple Si machine is fine. But I still get the same exodiff that Civet is reporting. EDIT: |
It sounds like most recently on your personal machine you do get the hang though right? That was with your newest mpich stack? |
Yeah, crazy! This is the first I've seen it hang on non-INL equipment. Simple hello-world examples run to completion just fine. |
Add OpenMPI MOOSE Conda Packages variant. Add Python 3.11 Closes idaholab#26839 idaholab#26566
Add OpenMPI MOOSE Conda Packages variant. Add Python 3.11 Closes idaholab#26839 idaholab#26566
|
Closing in favor of #26839, which includes (will include) all the Python 3.11 fixes |

Add Python 3.11 to our array of python versions to support.
Closes #26560