Test failure: worker crashes during setup #86

crowogenesis · 2024-11-30T14:28:26Z

Hello,

I was following the installation guide on
https://thousandbrainsproject.readme.io/docs/getting-started
, and for some reason all the workers crashed.
I'm running a very basic Debian 12 environment.

maximum crashed workers reached: 8

=================================================================================== FAILURES ====================================================================================
________________________________________________________________________ tests/unit/base_config_test.py _________________________________________________________________________
[gw1] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw1' crashed while running 'tests/unit/base_config_test.py::BaseConfigTest::test_can_run_eval_epoch'
________________________________________________________________________ tests/unit/base_config_test.py _________________________________________________________________________
[gw0] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw0' crashed while running 'tests/unit/base_config_test.py::BaseConfigTest::test_can_run_episode'
________________________________________________________________________ tests/unit/base_config_test.py _________________________________________________________________________
[gw3] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw3' crashed while running 'tests/unit/base_config_test.py::BaseConfigTest::test_can_save_and_load'
________________________________________________________________________ tests/unit/evidence_lm_test.py _________________________________________________________________________
[gw2] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw2' crashed while running 'tests/unit/evidence_lm_test.py::EvidenceLMTest::test_evidence_time_out'
_______________________________________________________________________ tests/unit/graph_learning_test.py _______________________________________________________________________
[gw4] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw4' crashed while running 'tests/unit/graph_learning_test.py::GraphLearningTest::test_reproduce_multiple_episodes'
___________________________________________________________________________ tests/unit/policy_test.py ___________________________________________________________________________
[gw5] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw5' crashed while running 'tests/unit/policy_test.py::PolicyTest::test_can_run_curv_informed_policy'
________________________________________________________________________ tests/unit/base_config_test.py _________________________________________________________________________
[gw6] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw6' crashed while running 'tests/unit/base_config_test.py::BaseConfigTest::test_logging_info_level'
_______________________________________________________________________ tests/unit/graph_learning_test.py _______________________________________________________________________
[gw7] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw7' crashed while running 'tests/unit/graph_learning_test.py::GraphLearningTest::test_can_run_eval_episode_with_surface_agent'
___________________________________________________________________________ tests/unit/policy_test.py ___________________________________________________________________________
[gw8] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw8' crashed while running 'tests/unit/policy_test.py::PolicyTest::test_surface_policy_moves_back_to_object'
________________________________________________________________________ tests/unit/evidence_lm_test.py _________________________________________________________________________
[gw9] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw9' crashed while running 'tests/unit/evidence_lm_test.py::EvidenceLMTest::test_moving_off_object_5lms'
=================================================================== xdist: maximum crashed workers reached: 8 ===================================================================
============================================================================ short test summary info ============================================================================
FAILED tests/unit/base_config_test.py::BaseConfigTest::test_can_run_eval_epoch
FAILED tests/unit/base_config_test.py::BaseConfigTest::test_can_run_episode
FAILED tests/unit/base_config_test.py::BaseConfigTest::test_can_save_and_load
FAILED tests/unit/evidence_lm_test.py::EvidenceLMTest::test_evidence_time_out
FAILED tests/unit/graph_learning_test.py::GraphLearningTest::test_reproduce_multiple_episodes
FAILED tests/unit/policy_test.py::PolicyTest::test_can_run_curv_informed_policy
FAILED tests/unit/base_config_test.py::BaseConfigTest::test_logging_info_level
FAILED tests/unit/graph_learning_test.py::GraphLearningTest::test_can_run_eval_episode_with_surface_agent
FAILED tests/unit/policy_test.py::PolicyTest::test_surface_policy_moves_back_to_object
FAILED tests/unit/evidence_lm_test.py::EvidenceLMTest::test_moving_off_object_5lms
======================================================================== 10 failed, 148 passed in 19.93s ========================================================================

The text was updated successfully, but these errors were encountered:

tristanls · 2024-11-30T14:40:50Z

Hi @crowogenesis 👋.

A worker crashing error likely points to a segfault in the HabitatSim. The failure you're observing may be caused by HabitatSim, which crashes Python, which then crashes the test worker, leaving no logs or debugging information otherwise.

Running a single HabitatSim test should pinpoint whether this is the case:

pytest tests/unit/habitat_sim_test.py -k 'test_create_environment'

If the above fails, then to get at the logs, it may help to comment out all the other tests in tests/unit/habitat_sim_test.py and run:

python tests/unit/habitat_sim_test.py

At that point, you should see some Habitat-specific logs that may give a hint as to why HabitatSim is crashing.

One workaround for some issues we have in place is when running on machines without a display attached; we use xvfb. However, this workaround is dependent on the specific versions involved. You can see it in our GitHub Actions workflow specification:

tbp.monty/.github/workflows/monty.yml

Lines 371 to 391 in 97750b5

    
                 - name: Install xvfb and mesa 
        
                   # Use X11 framebuffer and mesa opengl libraries for testing. 
        
                   # Not ideal but better than no tests. 
        
                   if: ${{ needs.should_run_monty.outputs.should_run_monty == 'true' }} 
        
                   run: | 
        
                     sudo apt update -y 
        
                     sudo apt install -y --no-install-recommends xvfb libegl1-mesa-dev 
        
                 - name: Run python tests 
        
                   # Using 8 CPUs corresponding to the large GithHub-hosted runner available in the "tbp" runner group. 
        
                   # See: https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners/about-larger-runners#specifications-for-general-larger-runners 
        
                   # 
        
                   # NOTE: LD_PRELOAD is needed to provide what is discussed in https://stackoverflow.com/questions/71010343/cannot-load-swrast-and-iris-drivers-in-fedora-35/72200748#72200748 
        
                   if: ${{ needs.should_run_monty.outputs.should_run_monty == 'true' }} 
        
                   working-directory: tbp.monty 
        
                   run: | 
        
                     export PATH="$HOME/miniconda/bin:$PATH" 
        
                     source activate tbp.monty 
        
                     set -e 
        
                     mkdir -p test_results/pytest 
        
                     export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6 
        
                     xvfb-run pytest -n 8 --cov --cov-report html --cov-report term --junitxml=test_results/pytest/results.xml --verbose

.

crowogenesis · 2024-11-30T16:07:16Z

I don't know if this is a good practice in general, but I'll leave this here anyway: setting an LD preload after activating conda fixed it for now. Thanks a ton @tristanls !

export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6

OcWebb · 2024-12-01T03:25:20Z

I have been running into the same issue on a headless Debian 12 installed. I also had about 30 errors. I added - aihabitat::headless to the end of the environment.yaml dependencies and this has resolved all but two of the errors.

=================================== short test summary info ====================================
FAILED tests/unit/graph_learning_test.py::GraphLearningTest::test_5lm_feature_experiment - Ke...
FAILED tests/unit/graph_learning_test.py::GraphLearningTest::test_can_run_disp_experiment
=================== 2 failed, 316 passed, 468 warnings in 394.36s (0:06:34) ====================

And upon re-running those in that file again, it seems the test_can_run_disp_experiment is resolved and leaving just test_5lm_feature_experiment failing/segfaulting. I will dig into this test further tomorrow, just figured I'd share in case its helpful.

crowogenesis added the bug Something isn't working label Nov 30, 2024

tristanls added the triaged This issue or pull request was triaged label Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test failure: worker crashes during setup #86

Test failure: worker crashes during setup #86

crowogenesis commented Nov 30, 2024 •

edited

Loading

tristanls commented Nov 30, 2024

crowogenesis commented Nov 30, 2024 •

edited

Loading

OcWebb commented Dec 1, 2024

Test failure: worker crashes during setup #86

Test failure: worker crashes during setup #86

Comments

crowogenesis commented Nov 30, 2024 • edited Loading

tristanls commented Nov 30, 2024

crowogenesis commented Nov 30, 2024 • edited Loading

OcWebb commented Dec 1, 2024

crowogenesis commented Nov 30, 2024 •

edited

Loading

crowogenesis commented Nov 30, 2024 •

edited

Loading