Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: worker crashes during setup #86

Open
crowogenesis opened this issue Nov 30, 2024 · 3 comments
Open

Test failure: worker crashes during setup #86

crowogenesis opened this issue Nov 30, 2024 · 3 comments
Labels
bug Something isn't working triaged This issue or pull request was triaged

Comments

@crowogenesis
Copy link

crowogenesis commented Nov 30, 2024

Hello,

I was following the installation guide on
https://thousandbrainsproject.readme.io/docs/getting-started
, and for some reason all the workers crashed.
I'm running a very basic Debian 12 environment.

maximum crashed workers reached: 8

=================================================================================== FAILURES ====================================================================================
________________________________________________________________________ tests/unit/base_config_test.py _________________________________________________________________________
[gw1] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw1' crashed while running 'tests/unit/base_config_test.py::BaseConfigTest::test_can_run_eval_epoch'
________________________________________________________________________ tests/unit/base_config_test.py _________________________________________________________________________
[gw0] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw0' crashed while running 'tests/unit/base_config_test.py::BaseConfigTest::test_can_run_episode'
________________________________________________________________________ tests/unit/base_config_test.py _________________________________________________________________________
[gw3] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw3' crashed while running 'tests/unit/base_config_test.py::BaseConfigTest::test_can_save_and_load'
________________________________________________________________________ tests/unit/evidence_lm_test.py _________________________________________________________________________
[gw2] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw2' crashed while running 'tests/unit/evidence_lm_test.py::EvidenceLMTest::test_evidence_time_out'
_______________________________________________________________________ tests/unit/graph_learning_test.py _______________________________________________________________________
[gw4] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw4' crashed while running 'tests/unit/graph_learning_test.py::GraphLearningTest::test_reproduce_multiple_episodes'
___________________________________________________________________________ tests/unit/policy_test.py ___________________________________________________________________________
[gw5] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw5' crashed while running 'tests/unit/policy_test.py::PolicyTest::test_can_run_curv_informed_policy'
________________________________________________________________________ tests/unit/base_config_test.py _________________________________________________________________________
[gw6] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw6' crashed while running 'tests/unit/base_config_test.py::BaseConfigTest::test_logging_info_level'
_______________________________________________________________________ tests/unit/graph_learning_test.py _______________________________________________________________________
[gw7] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw7' crashed while running 'tests/unit/graph_learning_test.py::GraphLearningTest::test_can_run_eval_episode_with_surface_agent'
___________________________________________________________________________ tests/unit/policy_test.py ___________________________________________________________________________
[gw8] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw8' crashed while running 'tests/unit/policy_test.py::PolicyTest::test_surface_policy_moves_back_to_object'
________________________________________________________________________ tests/unit/evidence_lm_test.py _________________________________________________________________________
[gw9] linux -- Python 3.8.20 /home/defaultuser/miniconda3/envs/tbp.monty/bin/python
worker 'gw9' crashed while running 'tests/unit/evidence_lm_test.py::EvidenceLMTest::test_moving_off_object_5lms'
=================================================================== xdist: maximum crashed workers reached: 8 ===================================================================
============================================================================ short test summary info ============================================================================
FAILED tests/unit/base_config_test.py::BaseConfigTest::test_can_run_eval_epoch
FAILED tests/unit/base_config_test.py::BaseConfigTest::test_can_run_episode
FAILED tests/unit/base_config_test.py::BaseConfigTest::test_can_save_and_load
FAILED tests/unit/evidence_lm_test.py::EvidenceLMTest::test_evidence_time_out
FAILED tests/unit/graph_learning_test.py::GraphLearningTest::test_reproduce_multiple_episodes
FAILED tests/unit/policy_test.py::PolicyTest::test_can_run_curv_informed_policy
FAILED tests/unit/base_config_test.py::BaseConfigTest::test_logging_info_level
FAILED tests/unit/graph_learning_test.py::GraphLearningTest::test_can_run_eval_episode_with_surface_agent
FAILED tests/unit/policy_test.py::PolicyTest::test_surface_policy_moves_back_to_object
FAILED tests/unit/evidence_lm_test.py::EvidenceLMTest::test_moving_off_object_5lms
======================================================================== 10 failed, 148 passed in 19.93s ========================================================================
@crowogenesis crowogenesis added the bug Something isn't working label Nov 30, 2024
@tristanls
Copy link
Contributor

Hi @crowogenesis 👋.

A worker crashing error likely points to a segfault in the HabitatSim. The failure you're observing may be caused by HabitatSim, which crashes Python, which then crashes the test worker, leaving no logs or debugging information otherwise.

Running a single HabitatSim test should pinpoint whether this is the case:

pytest tests/unit/habitat_sim_test.py -k 'test_create_environment'

If the above fails, then to get at the logs, it may help to comment out all the other tests in tests/unit/habitat_sim_test.py and run:

python tests/unit/habitat_sim_test.py

At that point, you should see some Habitat-specific logs that may give a hint as to why HabitatSim is crashing.

One workaround for some issues we have in place is when running on machines without a display attached; we use xvfb. However, this workaround is dependent on the specific versions involved. You can see it in our GitHub Actions workflow specification:

- name: Install xvfb and mesa
# Use X11 framebuffer and mesa opengl libraries for testing.
# Not ideal but better than no tests.
if: ${{ needs.should_run_monty.outputs.should_run_monty == 'true' }}
run: |
sudo apt update -y
sudo apt install -y --no-install-recommends xvfb libegl1-mesa-dev
- name: Run python tests
# Using 8 CPUs corresponding to the large GithHub-hosted runner available in the "tbp" runner group.
# See: https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners/about-larger-runners#specifications-for-general-larger-runners
#
# NOTE: LD_PRELOAD is needed to provide what is discussed in https://stackoverflow.com/questions/71010343/cannot-load-swrast-and-iris-drivers-in-fedora-35/72200748#72200748
if: ${{ needs.should_run_monty.outputs.should_run_monty == 'true' }}
working-directory: tbp.monty
run: |
export PATH="$HOME/miniconda/bin:$PATH"
source activate tbp.monty
set -e
mkdir -p test_results/pytest
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6
xvfb-run pytest -n 8 --cov --cov-report html --cov-report term --junitxml=test_results/pytest/results.xml --verbose
.

@tristanls tristanls added the triaged This issue or pull request was triaged label Nov 30, 2024
@crowogenesis
Copy link
Author

crowogenesis commented Nov 30, 2024

I don't know if this is a good practice in general, but I'll leave this here anyway: setting an LD preload after activating conda fixed it for now. Thanks a ton @tristanls !

export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libstdc++.so.6

@OcWebb
Copy link

OcWebb commented Dec 1, 2024

I have been running into the same issue on a headless Debian 12 installed. I also had about 30 errors. I added - aihabitat::headless to the end of the environment.yaml dependencies and this has resolved all but two of the errors.

=================================== short test summary info ====================================
FAILED tests/unit/graph_learning_test.py::GraphLearningTest::test_5lm_feature_experiment - Ke...
FAILED tests/unit/graph_learning_test.py::GraphLearningTest::test_can_run_disp_experiment
=================== 2 failed, 316 passed, 468 warnings in 394.36s (0:06:34) ====================

And upon re-running those in that file again, it seems the test_can_run_disp_experiment is resolved and leaving just test_5lm_feature_experiment failing/segfaulting. I will dig into this test further tomorrow, just figured I'd share in case its helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged This issue or pull request was triaged
Projects
None yet
Development

No branches or pull requests

3 participants