Skip to content

Conversation

@milljm
Copy link
Contributor

@milljm milljm commented Jan 16, 2024

Add Python 3.11 to our array of python versions to support.

Closes #26560

@moosebuild
Copy link
Contributor

moosebuild commented Jan 17, 2024

Job Documentation on 1aa97f5 wanted to post the following:

View the site here

This comment will be updated on new commits.

@moosebuild
Copy link
Contributor

moosebuild commented Jan 17, 2024

Job Coverage on 1aa97f5 wanted to post the following:

Framework coverage

Coverage did not change

Modules coverage

Coverage did not change

Full coverage reports

Reports

This comment will be updated on new commits.

@moosebuild
Copy link
Contributor

Job Apptainer build GCC min on 5433882 : invalidated by @milljm

Error during download/extract/detection of FBLASLAPACK

@milljm
Copy link
Contributor Author

milljm commented Jan 17, 2024

There is a lot of fix in order to make Python 3.11 work...

@milljm
Copy link
Contributor Author

milljm commented Jan 17, 2024

On Linux, there does not appear to be a solution for Python 3.11 and HDF5=1.12.x=mpi_mpich* and VTK

# is a conflict
conda create -n testing python=3.11 vtk hdf5=1.12.1=mpi_mpich_*

If we bump HDF5 to 1.14, there is a solution. But historically doing so takes a lot of work.

Edit:
Gross. Moving to hdf5 1.14 requires we move to mpich 4.1.2. Which we know causes issues.

@loganharbour loganharbour changed the title 26560 add python311 Add python 3.11 support Jan 17, 2024
- {{ moose_libclang13 }}
- {{ moose_ld64 }} # [osx]
- {{ moose_libclang }} # [osx]
- {{ moose_libclang13 }} # [osx]
Copy link
Contributor Author

@milljm milljm Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really not sure how we've been successful while this was in place (look at lines 42 and 43 above!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this libclang versioning work? Previously we were using clang 12 but we link libclang13, and now clang 15 is still linked with libclang13?

- {{ moose_libgfortran }}
- {{ moose_libgfortran5 }}
- {{ moose_hdf5 }}
- hdf5=1.14.3=mpi*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now necessary. conda_build_config.yaml does not support variant wildcards (as far as I know).

@lindsayad
Copy link
Member

I'm curious whether you run into this failure. It has always failed for me with my system 3.11 python

misc/signal_handler.test_signal: Working Directory: /home/lindad/projects/PR_user_training/test/tests/misc/signal_handler
misc/signal_handler.test_signal: Running command: /home/lindad/projects/PR_user_training/test/moose_test-opt -i simple_transient_diffusion_scaled.i --error --error-override --no-gdb-backtrace
misc/signal_handler.test_signal: Python exception encountered:
misc/signal_handler.test_signal: 
misc/signal_handler.test_signal: Traceback (most recent call last):
misc/signal_handler.test_signal:   File "/home/lindad/projects/PR_user_training/python/TestHarness/schedulers/RunParallel.py", line 54, in run
misc/signal_handler.test_signal:     job.run()
misc/signal_handler.test_signal:   File "/home/lindad/projects/moose/python/TestHarness/schedulers/Job.py", line 231, in run
misc/signal_handler.test_signal:     self.__tester.run(self.timer, self.options)
misc/signal_handler.test_signal:   File "/home/lindad/projects/moose/python/TestHarness/testers/Tester.py", line 420, in run
misc/signal_handler.test_signal:     self.runCommand(timer, options)
misc/signal_handler.test_signal:   File "/home/lindad/projects/moose/python/TestHarness/testers/SignalTester.py", line 70, in runCommand
misc/signal_handler.test_signal:     self.send_signal(self.process.pid)
misc/signal_handler.test_signal:   File "/home/lindad/projects/moose/python/TestHarness/testers/SignalTester.py", line 43, in send_signal
misc/signal_handler.test_signal:     out_dupe.seek(0)
misc/signal_handler.test_signal:   File "/usr/lib/python3.11/tempfile.py", line 947, in seek
misc/signal_handler.test_signal:     return self._file.seek(*args)
misc/signal_handler.test_signal:            ^^^^^^^^^^^^^^^^^^^^^^
misc/signal_handler.test_signal: ValueError: seek of closed file
misc/signal_handler.test_signal: 

@milljm
Copy link
Contributor Author

milljm commented Jan 18, 2024

I'm curious whether you run into this failure. It has always failed for me with my system 3.11 python

Yes, this PR certainly does run into this... https://civet.inl.gov/job/2004305/
Just one of many issues I will have to tackle!

@lindsayad
Copy link
Member

I'm glad it's not just my environment 😅

#go to the beginning of the file and see if its actually started running the binary
out_dupe.seek(0)
output = out_dupe.read()
if not out_dupe.closed:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very concerned that this works. I would expect the inability to read the file should cause this test to fail. But it does not. So I don't understand what is going on here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full disclosure: I worked on patching a bug with signal checkpoints, but didn't touch this send_signal function and don't think I fully understand it. That being said, shouldn't the time.sleep, os.kill, and break lines below be indented to be under the if not out_dupe.closed check? My assumption is that if out_dupe is already closed, then the moose binary has finished running. In that case, os.kill to a nonexistent PID should raise a ProcessLookupError. If this isn't happening, and the tests are always passing, my guess is that this code is always executed when the binary is running and out_dupe is not closed. I will poke around at this a bit more.

Copy link
Contributor

@pbehne pbehne Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pid_example.txt

I have simplified the logic into the attached script. The 'test' is running echo test ; sleep 1. When the script is run with line 37 commented and line 38 uncommented (i.e., time.sleep(5)), os.kill is still able to send a signal to the pid even though the process should have already been terminated (the timer gives a wait time > 5 s). However, when the script is run with line 37 uncommented and line 38 commented (i.e., os.waitpid), os.kill is not able to send the signal because the process has already terminated. HOWEVER, the timer gives a wait time on the order of 1 s, so it seems reasonable that using time.sleep(5) without waitpid would give the subprocess PLENTY of time to terminate. I am not sure how to explain this behavior. Are you? @permcody?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I myself don't know how to explain anything. All I know is that ever since I went to python 3.11 on my machine this test has failed every time I run run_tests and I'll be really happy when it doesn't fail anymore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have to do with reaping an exited process. When a process terminates it doesn’t actually end. It hangs around until the parent process closes it out. Until that occurs. The “dead” process is left in “zombie” state. We can take a look at this tomorrow and get to the bottom of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the test just failed on Intel Mac (https://civet.inl.gov/job/2016334/). It usually passes.

misc/signal_handler.test_signal_parallel: Working Directory: /opt/civet1/build/moose/test/tests/misc/signal_handler
misc/signal_handler.test_signal_parallel: Running command: mpiexec -n 3 /opt/civet1/build/moose/test/moose_test-opt -i simple_transient_diffusion_scaled.i Executioner/num_steps=70 --error --error-override --timing Outputs/perf_graph=true --no-gdb-backtrace
misc/signal_handler.test_signal_parallel: 
misc/signal_handler.test_signal_parallel: 
misc/signal_handler.test_signal_parallel: Exit Code: 30
misc/signal_handler.test_signal_parallel: ################################################################################
misc/signal_handler.test_signal_parallel: Tester failed, reason: CRASH
misc/signal_handler.test_signal_parallel: 
[1.090s]     FAIL misc/signal_handler.test_signal_parallel FAILED (CRASH) [min_cpus=3]

@permcody
Copy link
Member

permcody commented Jan 24, 2024 via email

@@ -1 +1 @@
Subproject commit 7c03afd53c9c7fd7fcb6e3e20bae37cf1b124986
Subproject commit 31948b018e9bea83c138035e952d48065458ba4a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution! This contains a submodule update

@@ -1 +1 @@
Subproject commit 7c03afd53c9c7fd7fcb6e3e20bae37cf1b124986
Subproject commit 31948b018e9bea83c138035e952d48065458ba4a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution! This contains a submodule update

@milljm milljm force-pushed the 26560-add-python311 branch from b7a32e3 to 2d9b88f Compare January 29, 2024 14:07
@milljm
Copy link
Contributor Author

milljm commented Jan 29, 2024

That was fun. I rebased against my origin, forgetting that I had included a merge from Casey's origin in the mix (his PETSc update), also forgetting that Casey's branch was merged into Next.

So... redid it all, rebasing against upstream/next properly.

@milljm milljm force-pushed the 26560-add-python311 branch from 307c96c to 8032089 Compare January 29, 2024 16:59
@milljm
Copy link
Contributor Author

milljm commented Jan 29, 2024

Another git adventure brought to you by milljm.

Thank you @cticenhour for helping me sort it out!

@loganharbour
Copy link
Member

Another git adventure brought to you by milljm.

Thank you @cticenhour for helping me sort it out!

lizard-hehe

Jason M Miller and others added 12 commits January 31, 2024 06:30
Because of Zscaler, the recipe does download VTK. Need to push so
I can pull onto another machine... so lame.
Having a pin for libclang only for OSX targets was a mistake.

Remove petsc `make check` for now...

Add CURL_CA_BUNDLE
source:
  url: paths use `curl`, and therfore we need to pass CA certs.
And here are the Mac Intel dependency pins...
During configure of libMesh package, we need zlib
Add logic which allows method to continue if the file(s) are not
closed.
Python 3.11 now requires you to place you global regex's at the
begining of the regular expression.
As Casey joked:

"Don't use MPI, ESPECIALLY don't use MPI.......BUT maybe here, sure, use MPI."

I love it.
The main part of this commit was reworking the logic in SignalTester.send_signal.
Namely, the check to see if the moose binary has started running and writing to the
output file was modified from an approach that copied and read the output file to
a much cheaper approach that simply queried the current position in the output file.

Additionally, the 'signal' parameter previously did not do anything. This has been patched.

Finally, the 'sleep_time' parameter has been removed because it was not deemed necessary to
the testing process (there should be no such thing as a signal being sent too early after the
binary has started to run; the checkpoint will simply be output at the end of the next time step).
Update documentation template configration with new versions.
@milljm milljm force-pushed the 26560-add-python311 branch from fb9c0e4 to c92c0dc Compare January 31, 2024 13:36
After rebasing on Next to get the latest stuff to fix Next, I find
that I need to bump rev-parse head, as my previous commit is failing
versioner unittests.
@moosebuild
Copy link
Contributor

Job Precheck on 1aa97f5 wanted to post the following:

The following file(s) are changed:

conda/libmesh/meta.yaml

The libmesh conda configuration has changed; ping @idaholab/moose-dev-ops

@milljm
Copy link
Contributor Author

milljm commented Feb 14, 2024

This is almost ready for 'Next' scrutiny.... Just need some help with the exodiff. @lindsayad if you would like to take a stab at it?

@lindsayad
Copy link
Member

wow didn't expect there to be so many!

- {{ moose_libclang13 }}
- {{ moose_ld64 }} # [osx]
- {{ moose_libclang }} # [osx]
- {{ moose_libclang13 }} # [osx]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this libclang versioning work? Previously we were using clang 12 but we link libclang13, and now clang 15 is still linked with libclang13?

@lindsayad
Copy link
Member

These failures are crazy. It's like MOOSE meshing has changed with this PR but only on ARM even though this doesn't touch PETSc/libMesh/MOOSE code. Tests are even failing in serial so it's not the mpich bump either

@roystgnr
Copy link
Contributor

I just looked at the gold/partial_circle_rad_in.e from the meshgenerators/circular_correction_generator.partial_curve_rad failure, and it's definitely not a robust test. The mesh is being generated via XYDelaunay, and the trouble with "generate a Delaunay triangulation of this boundary" in regression tests is that typically the result is not unique.

Looking at the bottom right corner of the mesh, where the exodiff complains, we can see nodes at (0,-0.8), (0,-0.9), (0,-1), (-0.1,-1), (-0.2,-1). The three bottom right nodes are going to form one triangle ... and then the remaining part of those nodes' convex hull is a perfectly symmetrical trapezoid. In the gold solution, the trapezoid is divided by the diagonal connecting (-0.2,-1) to (0,-0.9), and that's a Delaunay mesh, but based on the exodiff the failing solution divides that trapezoid by the diagonal connecting (-0.1,-1) to (0,-0.8), and technically that's still a success because that's also a Delaunay mesh. Yay symmetry.

I'm not sure what the best solution is here. When I created my own XYDelaunay tests I tried to add asymmetries to the domain to prevent any ambiguity from showing up in them ... but Civet is saying that two of my own tests have failed too!?

@roystgnr
Copy link
Contributor

What the hell. I just looked at the diffs on this PR. I stand by my explanation of why those tests might get knocked over by a feather, but where is the feather? There should at least be something changing the executables, the underlying malloc/free, something that could get us iterating over elements in a different order.

@lindsayad
Copy link
Member

lindsayad commented Feb 15, 2024

That's what I'm saying. The failures on this PR are insane

@lindsayad
Copy link
Member

example stack traces from the parallel hangs:

process 0
    frame #14: 0x000000010196cfd4 libmpifort.12.dylib`mpi_allreduce_ + 144
    frame #15: 0x000000010af695b8 libpetsc.3.20.dylib`mumps_propinfo_ + 72
    frame #16: 0x000000010ae96b5c libpetsc.3.20.dylib`dmumps_solve_driver_ + 15164
    frame #17: 0x000000010aef86f4 libpetsc.3.20.dylib`dmumps_ + 1748
    frame #18: 0x000000010aefda2c libpetsc.3.20.dylib`dmumps_f77_ + 6892
    frame #19: 0x000000010aef683c libpetsc.3.20.dylib`dmumps_c + 3552
    frame #20: 0x000000010a5e3ff4 libpetsc.3.20.dylib`MatSolve_MUMPS + 688
    frame #21: 0x000000010a60f3c4 libpetsc.3.20.dylib`MatSolve + 292
    frame #22: 0x000000010a9b5f6c libpetsc.3.20.dylib`PCApply_LU + 84
    frame #23: 0x000000010a99f96c libpetsc.3.20.dylib`PCApply + 204
    frame #24: 0x000000010a9a15ec libpetsc.3.20.dylib`PCApplyBAorAB + 632
    frame #25: 0x000000010aba0af0 libpetsc.3.20.dylib`KSPSolve_GMRES + 1112
    frame #26: 0x000000010ab48aac libpetsc.3.20.dylib`KSPSolve_Private + 1056
    frame #27: 0x000000010ab48648 libpetsc.3.20.dylib`KSPSolve + 16
    frame #28: 0x000000010abda684 libpetsc.3.20.dylib`SNESSolve_NEWTONLS + 1316
    frame #29: 0x000000010ac18ff8 libpetsc.3.20.dylib`SNESSolve + 1372
process 1
    frame #14: 0x0000000104cb0fd4 libmpifort.12.dylib`mpi_allreduce_ + 144
    frame #15: 0x000000010e2b98b0 libpetsc.3.20.dylib`mumps_sol_rhsmapinfo_ + 160
    frame #16: 0x000000010e1debdc libpetsc.3.20.dylib`dmumps_solve_driver_ + 31676
    frame #17: 0x000000010e23c6f4 libpetsc.3.20.dylib`dmumps_ + 1748
    frame #18: 0x000000010e241a2c libpetsc.3.20.dylib`dmumps_f77_ + 6892
    frame #19: 0x000000010e23a83c libpetsc.3.20.dylib`dmumps_c + 3552
    frame #20: 0x000000010d927ff4 libpetsc.3.20.dylib`MatSolve_MUMPS + 688
    frame #21: 0x000000010d9533c4 libpetsc.3.20.dylib`MatSolve + 292
    frame #22: 0x000000010dcf9f6c libpetsc.3.20.dylib`PCApply_LU + 84
    frame #23: 0x000000010dce396c libpetsc.3.20.dylib`PCApply + 204
    frame #24: 0x000000010dce55ec libpetsc.3.20.dylib`PCApplyBAorAB + 632
    frame #25: 0x000000010dee4af0 libpetsc.3.20.dylib`KSPSolve_GMRES + 1112
    frame #26: 0x000000010de8caac libpetsc.3.20.dylib`KSPSolve_Private + 1056
    frame #27: 0x000000010de8c648 libpetsc.3.20.dylib`KSPSolve + 16
    frame #28: 0x000000010df1e684 libpetsc.3.20.dylib`SNESSolve_NEWTONLS + 1316
    frame #29: 0x000000010df5cff8 libpetsc.3.20.dylib`SNESSolve + 1372

@milljm
Copy link
Contributor Author

milljm commented Feb 15, 2024

I don't know much about troubleshooting the hangs, but what I can tell you is that non-INL profile'd machines (no JAMF), seem to not hang. -my personal Apple Si machine is fine. But I still get the same exodiff that Civet is reporting.

EDIT:
Just saying this, so you don't spin your wheels too much on the hangs. Unless you want to! I have INL's IM staff looking into the hangs.

@lindsayad
Copy link
Member

lindsayad commented Feb 16, 2024

I don't know much about troubleshooting the hangs, but what I can tell you is that non-INL profile'd machines (no JAMF), seem to not hang. -my personal Apple Si machine is fine.

It sounds like most recently on your personal machine you do get the hang though right? That was with your newest mpich stack?

@milljm
Copy link
Contributor Author

milljm commented Feb 16, 2024

I don't know much about troubleshooting the hangs, but what I can tell you is that non-INL profile'd machines (no JAMF), seem to not hang. -my personal Apple Si machine is fine.

It sounds like most recently on your personal machine you do get the hang though right? That was with your newest mpich stack?

Yeah, crazy! This is the first I've seen it hang on non-INL equipment. Simple hello-world examples run to completion just fine.

milljm pushed a commit to milljm/moose that referenced this pull request Feb 26, 2024
Add OpenMPI MOOSE Conda Packages variant.
Add Python 3.11

Closes idaholab#26839 idaholab#26566
milljm pushed a commit to milljm/moose that referenced this pull request Mar 5, 2024
Add OpenMPI MOOSE Conda Packages variant.
Add Python 3.11

Closes idaholab#26839 idaholab#26566
@milljm
Copy link
Contributor Author

milljm commented May 1, 2024

Closing in favor of #26839, which includes (will include) all the Python 3.11 fixes

@milljm milljm closed this May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Python 3.11 support

8 participants