Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ferrum 0.15.0 causes crash in CI and test suite collapse #470

Open
pond opened this issue Jul 3, 2024 · 11 comments
Open

Ferrum 0.15.0 causes crash in CI and test suite collapse #470

pond opened this issue Jul 3, 2024 · 11 comments
Assignees

Comments

@pond
Copy link

pond commented Jul 3, 2024

Describe the bug

I'm not sure how to supply a useful bug report here given the bizarre behaviour but - we have a large RSpec test suite including lots of headless Chrome tests that run on AWS CI (CodeBuild / CodePipeline) triggered off GitHub commits. Recently, we updated our bundle which took Cuprite from 0.15.0 to 0.15.1; this in turn requires Ferrum 0.15.0. Our test suite started failing spectacularly, but intermittently (maybe a 70% failure rate, at arbitrary points in the suite, apparently regardless of seed). I'll explain more elsewhere in the template since this section is meant to be "brief" - we see this:

#<Thread:0x00007f4e63f053a8 /usr/local/bundle/gems/ferrum-0.15/lib/ferrum/utils/thread.rb:9 run> terminated with exception (report_on_exception is true):
/usr/local/bundle/gems/ferrum-0.15/lib/ferrum/client.rb:90:in `send_message': Timed out waiting for response. It's possible that this happened because something took a very long time (for example a page load was slow). If so, setting the :timeout option to a higher value might help. (Ferrum::TimeoutError)
    from /usr/local/bundle/gems/ferrum-0.15/lib/ferrum/client.rb:22:in `command'

Part of Ferrum appears to then have crashed, or the Chrome instance has, because all subsequent tests fail with:

Ferrum::NoSuchTargetError:
  Ferrum::NoSuchTargetError

To Reproduce

This is the problem; it replicates easily in AWS CI but we can't reproduce it locally. It seems that a TimeoutError comes from client.rb line 90, exactly as it does if I were to, say, deliberately set the timeout option to something very low. When I do this on local machine or in CI in an attempt to provoke replication, I just see things failing "as expected" with:

Ferrum::TimeoutError:
  Timed out waiting for response. It's possible that this happened because something took a very long time (for example a page load was slow). If so, setting the :timeout option to a higher value might help.

In case it is important - we do then notice that after a few hundred failures like this, suddenly the message changes and subsequent tests say:

 Errno::ENOMEM:
   Cannot allocate memory - /usr/bin/chromium

I do not know if this is important or just an unrelated minor bug arising from setting timeout so low and Ferrum perhaps not closing down old Chromium instances if its comms time out too soon, since it occurs when we were trying but failing to replicate the nasty crash error.

We have yet to persuade localhost to show the unhandled exception that crashes something out badly, with the "canary" error seen in CI of:

#<Thread:0x00007f4e63f053a8 /usr/local/bundle/gems/ferrum-0.15/lib/ferrum/utils/thread.rb:9 run> terminated with exception (report_on_exception is true):
/usr/local/bundle/gems/ferrum-0.15/lib/ferrum/client.rb:90:in `send_message': Timed out waiting for response. It's possible that this happened because something took a very long time (for example a page load was slow). If so, setting the :timeout option to a higher value might help. (Ferrum::TimeoutError)
    from /usr/local/bundle/gems/ferrum-0.15/lib/ferrum/client.rb:22:in `command'

Since this comes from the same piece of code, it's hard to see how it could arise in such different ways unless perhaps the code paths used to reach this part of client.rb are very different in each case and one is missing an exception handler.

Expected behavior

I would not expect timeouts at all. The suite should run normally. It does with Ferrum 0.14.0 and has for many years with that and prior versions. 0.15.0 introduces the new behaviour. We've pinned to Cuprite 0.15.0 / Ferrum 0.14.0 for now, and CI is working as usual.

If a timeout for real did happen, then I'd expect it to be handled i the usual way:

Ferrum::TimeoutError:
  Timed out waiting for response. It's possible that this happened because something took a very long time (for example a page load was slow). If so, setting the :timeout option to a higher value might help.

...rather than a thread termination.

Screenshots

It doesn't really help much but to prove we're not making it up 😂 here's an AWS CI screenshot from the point where things break.

Screenshot 2024-07-04 at 09 32 16

Desktop (please complete the following information):

  • OS: CodeBuild On-demand, Dockerised
  • Browser: bullseye/main amd64 chromium amd64 120.0.6099.224-1~deb11u1 [72.1 MB]
  • Version: v 0.15.0

Additional context

Note that we don't think this is Cuprite, but since Cuprite 0.15.0 only works with Ferrum 0.14.0 and not 0.15.0 and, conversely, Cuprite 0.15.1 only works with Ferrum 0.15.0 and 0.14.0, we can only upgrade or downgrade those two gems in lockstep. I couldn't see anything in Cuprite's CHANGELOG.md that looked like it might be a cause, but quickly saw in Ferrum's (excellent, detailed) CHANGELOG.md some potential causes, yet these could be red herrings. All of them are in one big PR: #432 - a large change set related to comms, threading and exceptions.

There is a WebSocket constraint change, but WebSocket was and is still on 0.7 (latest) and works fine, so this is not involved.

@pond
Copy link
Author

pond commented Jul 3, 2024

As an additional note, should it be useful, we can fork the gem, revert any changes that might be suspected of causing the problem and run CI on a branch of our main code base that's using our forked Ferrum copy.

@DuncSmith
Copy link

We've had a very similar experience. We have a large test suite and use custom GitHub action runners on AWS EC2 instances to run headless tests. We see ~4/10 runs fail, typically with something like

1.1) Failure/Error: page.driver.wait_for_network_idle
  
            Ferrum::NoSuchTargetError:
              Ferrum::NoSuchTargetError
            # /usr/local/bundle/bundler/gems/ferrum-19767d0885af/lib/ferrum/context.rb:51:in `create_target'
            # /usr/local/bundle/bundler/gems/ferrum-19767d0885af/lib/ferrum/context.rb:20:in `default_target'
            # /usr/local/bundle/gems/cuprite-0.15.1/lib/capybara/cuprite/browser.rb:2[46](https://github.com/fac/freeagent/actions/runs/9809089333/job/27086534389?pr=50903#step:8:48):in `attach_page'
            # /usr/local/bundle/gems/cuprite-0.15.1/lib/capybara/cuprite/browser.rb:33:in `page'
            # /usr/local/bundle/gems/cuprite-0.15.1/lib/capybara/cuprite/driver.rb:262:in `wait_for_network_idle'

We suspect something in Ferrum is causing the issue, we're finding it difficult to replicate locally. We have also tried pinning Ferrum to the latest master commit but see the same failures

@route
Copy link
Member

route commented Jul 6, 2024

Can you try setting flatten: false? That's the recent major change in the release

@pond
Copy link
Author

pond commented Jul 7, 2024

@route We've had two full successful CI runs on Cuprite 0.15.1 / Ferrum 0.15.0 using flatten: false. I'm re-running a couple more (since the failure rate wasn't 100% beforehand so I'd prefer a bit more confidence), but it's looking good so far.

This suggests that perhaps flatten: true has a more limited scope use than its documentation suggests (so maybe it should be false by default), or maybe that it has a bug in the new implementation?

@pond
Copy link
Author

pond commented Jul 8, 2024

Four passes now, so I'm confident that flatten: true makes our CI crash while flatten: false resolves the problem.

@DuncSmith
Copy link

We're also seeing a much better success rate with the flatten: false setting. Thank you.

Should this setting be a recommendation, if running under certain conditions?

@route
Copy link
Member

route commented Jul 8, 2024

That’s only to triage the issue for now. They should work the absolutely the same way, so I’m not sure where the issue lies for now.

@route
Copy link
Member

route commented Jul 9, 2024

I'll be back at work next week, please bear with me. Meanwhile any debug logs would be appreciated.

@dgholz
Copy link

dgholz commented Aug 12, 2024

@route just to confirm, setting the FERRUM_DEBUG environment variable to 1 (or any value) should be enough to enable debug logs? Where should I look for them to share with you?

@route
Copy link
Member

route commented Aug 13, 2024

@dgholz yes, they gonna be in the output

@dmorgan-fa
Copy link

Logs for comparison

@route, on behalf of @dgholz, here's the CDP logs.

It looks like we're failing on the first command in prepare_page:

ferrum/lib/ferrum/page.rb

Lines 432 to 434 in 5ca5e9e

def prepare_page
command("Page.enable")
command("Runtime.enable")

It keeps trying to get a response until it eventually hits the timeout:

CDP logs for a failure to `prepare_page`

▶ 8.290704247999997 {"method":"Target.setDiscoverTargets","params":{"discover":true},"id":1}
    ◀ 8.392297378999984 {"id":1,"result":{}}


▶ 8.39271608300001 {"method":"Target.createBrowserContext","params":{},"id":2}
    ◀ 8.394013621999989 {"id":2,"result":{"browserContextId":"693DAF0FB64DCF11113BFAA24D5F72D5"}}


▶ 8.394404509999987 {"method":"Target.createTarget","params":{"browserContextId":"693DAF0FB64DCF11113BFAA24D5F72D5","url":"about:blank"},"id":3}
    ◀ 8.454402695999988 {"method":"Target.targetCreated","params":{"targetInfo":{"targetId":"9C62F26633D6DED03E88E28EFD7BA9AC","type":"page","title":"","url":"","attached":false,"canAccessOpener":false,"browserContextId":"693DAF0FB64DCF11113BFAA24D5F72D5"}}}
    ◀ 8.63926736900001 {"id":3,"result":{"targetId":"9C62F26633D6DED03E88E28EFD7BA9AC"}}


▶ 8.640660305000011 {"method":"Page.enable","params":{},"id":1}
    ◀ 8.643854226999991 {"method":"Target.targetInfoChanged","params":{"targetInfo":{"targetId":"9C62F26633D6DED03E88E28EFD7BA9AC","type":"page","title":"","url":"about:blank","attached":true,"canAccessOpener":false,"browserContextId":"693DAF0FB64DCF11113BFAA24D5F72D5"}}}
    ◀ 10.715372056000007 {"method":"Target.targetInfoChanged","params":{"targetInfo":{"targetId":"9C62F26633D6DED03E88E28EFD7BA9AC","type":"page","title":"about:blank","url":"about:blank","attached":true,"canAccessOpener":false,"browserContextId":"693DAF0FB64DCF11113BFAA24D5F72D5"}}}
    ◀ 10.792934142999997 {"method":"Page.frameStoppedLoading","params":{"frameId":"9C62F26633D6DED03E88E28EFD7BA9AC"}}

▶ 23.641489156000006 {"method":"Page.enable","params":{},"id":2}


▶ 38.670252793 {"method":"Page.enable","params":{},"id":3}


▶ 53.67084481400002 {"method":"Page.enable","params":{},"id":4}


▶ 68.689855981 {"method":"Page.enable","params":{},"id":5}


▶ 83.69056183999999 {"method":"Page.enable","params":{},"id":6}


▶ 98.70807829299997 {"method":"Page.enable","params":{},"id":7}


▶ 113.70876533799998 {"method":"Page.enable","params":{},"id":8}


▶ 128.723454742 {"method":"Page.enable","params":{},"id":9}


▶ 143.72415697600002 {"method":"Page.enable","params":{},"id":10}


▶ 158.740797757 {"method":"Page.enable","params":{},"id":11}


▶ 173.74139239599998 {"method":"Page.enable","params":{},"id":12}

When it's successful, it gets a response quickly from the Page.enable call, and continues on with the nexts command in prepare_page, Runtime.enable:

CDP logs for a run that successfully `prepare_page`


▶ 0.1534243080000124 {"method":"Target.setDiscoverTargets","params":{"discover":true},"id":1}
    ◀ 0.20593191800000454 {"id":1,"result":{}}


▶ 0.20637974699999972 {"method":"Target.createBrowserContext","params":{},"id":2}
    ◀ 0.20729468999999767 {"id":2,"result":{"browserContextId":"5AF4DEE7FBDD6B226726A7BF8BEDF510"}}


▶ 0.2077075720000039 {"method":"Target.createTarget","params":{"browserContextId":"5AF4DEE7FBDD6B226726A7BF8BEDF510","url":"about:blank"},"id":3}
    ◀ 0.21541300900000238 {"method":"Target.targetCreated","params":{"targetInfo":{"targetId":"173C3D821079164F5689C7429487AB98","type":"page","title":"","url":"","attached":false,"canAccessOpener":false,"browserContextId":"5AF4DEE7FBDD6B226726A7BF8BEDF510"}}}
    ◀ 0.21854109400001676 {"id":3,"result":{"targetId":"173C3D821079164F5689C7429487AB98"}}


▶ 0.220262273000003 {"method":"Page.enable","params":{},"id":1}
    ◀ 0.22053473100001497 {"method":"Target.targetInfoChanged","params":{"targetInfo":{"targetId":"173C3D821079164F5689C7429487AB98","type":"page","title":"","url":"about:blank","attached":true,"canAccessOpener":false,"browserContextId":"5AF4DEE7FBDD6B226726A7BF8BEDF510"}}}
    ◀ 0.237396044999997 {"method":"Target.targetInfoChanged","params":{"targetInfo":{"targetId":"173C3D821079164F5689C7429487AB98","type":"page","title":"about:blank","url":"about:blank","attached":true,"canAccessOpener":false,"browserContextId":"5AF4DEE7FBDD6B226726A7BF8BEDF510"}}}
    ◀ 0.2710266029999957 {"method":"Page.frameStoppedLoading","params":{"frameId":"173C3D821079164F5689C7429487AB98"}}
    ◀ 0.2710766719999924 {"id":1,"result":{}}


▶ 0.2714279319999946 {"method":"Runtime.enable","params":{},"id":2}
    ◀ 0.27341712300000154 {"method":"Runtime.executionContextCreated","params":{"context":{"id":1,"origin":"://","name":"","uniqueId":"-5443455750106636604.399399801198688419","auxData":{"isDefault":true,"type":"default","frameId":"173C3D821079164F5689C7429487AB98"}}}}
    ◀ 0.2736788459999957 {"id":2,"result":{}}


▶ 0.273712321000005 {"method":"Page.getFrameTree","params":{},"id":3}


▶ 0.2738927900000192 {"method":"DOM.enable","params":{},"id":4}
    ◀ 0.2740514879999978 {"id":3,"result":{"frameTree":{"frame":{"id":"173C3D821079164F5689C7429487AB98","loaderId":"AE26E3932520EEF3AF5A0DBC4A9258C7","url":"about:blank","domainAndRegistry":"","securityOrigin":"://","mimeType":"text/html","adFrameStatus":{"adFrameType":"none"},"secureContextType":"InsecureScheme","crossOriginIsolatedContextType":"NotIsolated","gatedAPIFeatures":[]}}}}
    ◀ 0.2742913430000158 {"id":4,"result":{}}

Also worth noting that when the error occurs, it looks to be the first feature spec that the node is running, rather than having ran a few specs then starting to fail.

We saw a huge spike in failures with the latest vesion of Chrome v127, and subsequently had to roll back to using v126 in our actions, otherwise we have a failure rate of about 25%.
On v126, we see intermittent failures maybe once a month with around 2.5-3.5k runs.

Please let me know if there's anything else we can share in terms of debugging, to make it clearer where the bug is.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants