Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CheckWhenObserve mode broken b/c ansible logs are not valid json #290

Open
d-honeybadger opened this issue Jan 17, 2024 · 7 comments
Open
Labels
bug Something isn't working

Comments

@d-honeybadger
Copy link
Collaborator

d-honeybadger commented Jan 17, 2024

What happened?

In CheckWhenObserve mode, reconciliation always fails with observe failed: Error decoding results\n\tinvalid character error. This error comes from provider-ansible attempting to parse ansible logs:

res, err := results.ParseJSONResultsStream(stdoutBuf)

The function results.ParseJSONResultsStream expects logs in json format (e.g. something you'd get using json callback. so, when trying to parse "classic" non-json ansible logs it, natually, fails.

How can we reproduce it?

Create any ansiblerun with CheckWhenObserve policy.
Here's my example:

apiVersion: ansible.crossplane.io/v1alpha1
kind: AnsibleRun
metadata:
  annotations:
    ansible.crossplane.io/runPolicy: CheckWhenObserve
  name: example-add-line
spec:
  forProvider:
    inventoryInline: |2
      cluster:
        hosts:
          default-0:
            ansible_host: <IP>
    playbookInline: |2
      - name: add-line-to-file
        hosts: cluster
        tasks:
        - name: Add a line to file
          ansible.builtin.lineinfile:
            path: /root/testfile
            line: added-line
            state: present
    vars:
      ansible_ssh_private_key_file: ./ssh_id
  providerConfigRef:
    name: example

Observe the following status:

status:
  atProvider: {}
  conditions:
  - lastTransitionTime: "2024-01-17T21:57:05Z"
    message: "observe failed: Error decoding results\n\tinvalid character 'P' looking
      for beginning of value"
    reason: ReconcileError
    status: "False"
    type: Synced

Note that the same playbook succeeds in ObserveAndDelete mode.

What environment did it happen in?

Crossplane version:
1.14.0

@d-honeybadger d-honeybadger added the bug Something isn't working label Jan 17, 2024
@d-honeybadger
Copy link
Collaborator Author

Was trying to figure out how to get those ansible logs parsed for another issue (extracting failure message for the Ready condition) and not seeing any quick and elegant solutions tbh. Here's what I explored:

  1. Use ansible's json callback to have it output logs in json format that can then be parsed by go-ansible package. Doesn't work cause ansible-runner always sets its own custom awx_display callback, and only one callback can be enabled.
  2. Same as above, but use go-ansible instead of ansible-runner, so that we can have json callback and integrate it nicely with go. Doesn't work cause go-ansible lacks some features that provider-ansible depends on, like executable inventory.
  3. Try to use awx_display output. It's also in json, but a completely different json structure than vanilla ansible json, so there's no package for parsing it, so would require defining all the custom types for parsing json events.

Basically, No.3 is the only option. It's a lot of work for a seemingly small thing, but there's nothing fundamentally wrong with it.

Lmk if I have a green light to work on it :)

@d-honeybadger
Copy link
Collaborator Author

@morningspace Since you were looking at my other issues and PRs (thank you!), maybe you could check out this one too?

@dfry
Copy link

dfry commented Jul 23, 2024

Hi @d-honeybadger , is there any progress on being able to use CheckWhenObserve policy? Cheers

@d-honeybadger
Copy link
Collaborator Author

d-honeybadger commented Jul 26, 2024

Hi @d-honeybadger , is there any progress on being able to use CheckWhenObserve policy? Cheers

There's some progress in parsing logs in the awx_display format (that's what ansible-runner produces by default), but it isn't comprehensive, it just extracts a few structs that are relevant for determining success/failure. The solution for this bug should probably build up on that parsing to also extract "changed" field from running tasks in check mode.

I can take a look in a bit now that there's interest in this ticket :) but also if you're at all interested in tackling it yourself I'll be happy to guide you through it

@dfry
Copy link

dfry commented Jul 26, 2024

Thanks for the update. I am pretty swamped right now. I am interested in figuring out how to construct my role in order to play nicely with CheckWhenObserve mode. There is only one task in my role that returns different results that I would want to use to trigger another run. I was hoping that there is someway with the current release that I could achieve that end goal.

cheers

@dfry
Copy link

dfry commented Oct 3, 2024

@d-honeybadger please let me know if you can give me some more guidance on fixing this, I want to see if I can take this on, or someone else on my team. thanks

@d-honeybadger
Copy link
Collaborator Author

d-honeybadger commented Oct 7, 2024

@dfry Thanks for considering it!

First, to be clear, I think this functionality has never worked in the first place, so look at it as feature work rather than bug fix.

Here's a walk-through:
For CheckWhenObserve to work, we need to

  1. run ansible in check mode
  2. collect logs from that run
  3. parse those logs to determine if there were any steps with changed=1

Provider-ansible is using ansible-runner under the hood, rather than plain ansible. This limits our logging options, b/c ansible-runner is hard-wired to use awx_display logging callback. The logs that are produced are, thankfully, in json, but they are NOT the same as as regular Ansible logs produced with the json callback.
So far I haven't found any libraries for parsing awx_display-structured logs, so I started some minimal parsing in this PR. The structure of awx_display logs is documented to some extent here https://github.com/ansible/awx/blob/devel/docs/job_events.md but not comprehensively so we'd need to just inspect a bunch of logs and deduce the structure from there. Or read awx code 🤷
Although not very well documented, those logs are quite comprehensive, so I'm pretty sure that the changed fields is available for each task, so we'll just need to aggregate all those to determine if a run as a whole leads to any changes.

To explore those logs, I'd just exec into the provider-ansible pod, where, under the ansibleDir directories, there is a directory for each AnsibleRun resource (name of subdirectory is the UID of the AnsibleRun), and under that, there's a directory for each invocation of that AnsibleRun, and under there there is a directory with so called "job_events" - those are the logs to be parsed.

Hope this was enough to give you an idea. I'd start just execing into provider-ansible and poking at ansibleDir. Once you make sense of those so-called "job_events" it should be straightforward to do a bit more parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants