-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pexpect.exceptions.TIMEOUT
due to child.expect(pexpect.EOF, timeout=5)
#1343
Comments
Yep - I'm seeing the same problem. How is the magic number of |
Looking a little deeper at this, it seems that initialisation of the Should the same be done on the new flush instruction to prevent a Timeout exception being raised? https://github.com/ansible/ansible-runner/blob/devel/src/ansible_runner/runner.py#L350 |
The configurable timeout is only there to control how long it'll hang out looking for a prompt to respond to before rolling to the next loop iteration (and allowing other timeouts to be evaluated). The new final stream pump is only there to ensure we don't lose data that pexpect has already buffered from the already-exited child process (previously, any data that pexpect buffered between the last poll and reporting the child exit was dropped on the floor). 5s is already a relative eternity to wait for buffered data in that situation- the initial bug was 100% fixed in all our reproducers by setting the timeout to 0.01s. If it's still not there after 5s, it's likely either a bug in pexpect or the underlying platform stdio stream handling (I've personally encountered both)- waiting longer is unlikely to change anything. It'd probably make more sense to just eat the exception from that final buffer scrape and issue a warning that data could've been lost. I'm curious how often you're encountering this, and if it's specific to any particular distro/container runtime/whatever. |
@tchernomax If you've got a repro for a job that always fails, I'd love to see it- IIUC even a 100% empty stream should still yield EOF, but there are a lot of moving parts in there between the OS/console stream layers, Python, and pexpect, and this is definitely one of those areas where different kernels/distros/OSs/versions don't all behave in exactly the same way. |
Thanks for replying. FWIW, this is in a container based on Alpine 3.18 with Python 3.11.4. The host OS is RHEL 8. |
I'm having to spin-up on |
I've got a repro for this issue - no container required. A combination of the number of hosts in the inventory (some of which are unreachable), and the time spent in the Code import ansible_runner
from time import sleep
def my_cancel_handler():
print("CANCELLATION")
return False
def my_artifacts_handler(artifacts_dir):
print(artifacts_dir)
def my_status_handler(data, runner_config):
print(f"STATUS - config: {runner_config} data: {data}")
def my_event_handler(data):
print(f"EVENT - {data}")
sleep(2)
return True
def my_finished_handler(runner):
print(f"FINISH")
print(f"\tSTATS: {runner.stats}")
r = ansible_runner.run(
private_data_dir="./private_data_dir",
playbook="playbook.yml",
cancel_callback=my_cancel_handler,
artifacts_handler=my_artifacts_handler,
status_handler=my_status_handler,
event_handler=my_event_handler,
finished_callback=my_finished_handler,
settings={"suppress_ansible_output": False},
timeout=1200,
ignore_logging=False,
debug=True) Inventory
Playbook ---
- name: breaker
hosts: all
gather_facts: no
tasks:
- name: Ping stuff
ansible.builtin.ping: Error
|
Aha, a slow synchronous event handler blocking the stream consumption... Thanks for the repro- that clarified the issue nicely! We're generally only using really lightweight event handlers that unconditionally persist to disk for later analysis, so even with the rare case of a lot of data in that final buffer, the overhead is pretty negligible. This is one of those unfortunate problems with having pexpect wrapped up in the stream handling even when we're 100% sure it's no longer needed. We've got some longer-term ideas to eliminate it entirely, but they rely on broader architectural changes to Ansible Core that are still in the formative stages. IMO this problem was arguably worse before the new stream drain fix, since those final events were always silently lost- now at least you know there was a problem (where before you just got a truncated event stream). The simple exception catch + warning I proposed to "fix" this though is almost certainly not the right thing. We'll need to balance some accommodation for long-running/blocking event handlers with the need to respect other timeouts properly. A proper fix will probably loop-poll for EOF and consult the stream position to see if it moved between calls and/or track in-flight event handlers before deciding whether to give up or reissue another poll. Meantime, just spitballing, but is there any reason your long-running event handler needs to be synchronous with the input pipeline (e.g., does the long-running part affect a filtering decision?), or could you offload it to a threadpool or other mechanism that wouldn't block the pump? |
We've already moved the event handler logic to another thread and the problem has gone. Nothing like a slow-consumer problem to remind you why you need to be quick on a callback. Happy days!. |
Hello
We are running a lot of CI with ansible-runner and since we move to ansible_runner 2.3.5, on some of them we experience this:
Some CI jobs works (with ansible_runner 2.3.5), some don't (also with ansible_runner 2.3.5), and those who don't always fail (a relaunch doesn't change anything).
I think the problem comes from the fact that child.expect(pexpect.EOF, timeout=5) never encounter
pexpect.EOF
in thechild
output.It might be because the pexpect.spawn fails so child.expect(pexpect.EOF, timeout=5) has no output to find
EOF
(just guessing, I don't know much aboutpexpect
).Regards
Thanks
The text was updated successfully, but these errors were encountered: