Skip to content

[Housekeeping] Flyte mutli-cluster network issues on k8s submission - kube-api-client - allow retries #6605

@punkerpunker

Description

@punkerpunker

Describe the issue

When running Flyte in a multi-cluster environment (separate flyte-controlplane, separate flyte-dataplane), when FlyteAdmin is trying to create a FlyteWorkflow CRD in a dataplane cluster, sometimes it could encounter errors, e.g:

failed to create workflow in propeller Post "https://kube-apiserver.<domain>/apis/flyte.lyft.com/v1alpha1/namespaces/ft-production/flyteworkflows?timeout=30s": net/http: TLS handshake timeout

It happens when K8sWorkflowExecutor is sending a POST to the kube-apiserver of the dataplane cluster:

_, err = targetCluster.FlyteClient.FlyteworkflowV1alpha1().FlyteWorkflows(data.Namespace).Create(ctx, flyteWf, v1.CreateOptions{})
if err != nil {
if !k8_api_err.IsAlreadyExists(err) {
logger.Debugf(context.TODO(), "Failed to create execution [%+v] in cluster: %s", data.ExecutionID, targetCluster.ID)
return interfaces.ExecutionResponse{}, errors.NewFlyteAdminErrorf(codes.Internal, "failed to create workflow in propeller %v", err)
}
}

And the workflow appears to have failed from the very first attempt, without a good mechanism to retry submission (other than recovering the job manually).

I guess in a multi-cluster setup, network issues are somehow inevitable, so it would be appreciated to have some retry mechanism in there. I saw there's a very comprehensive retry handler for cases where execution node failures occur, but this is happening before that, and there's no retry machinery.

I am happy to implement myself and open a PR, just wanted to hear opinions on that matter. Thanks folks!

What if we do not do this?

Flyte multi-cluster setups would suffer badly from some intermittent issues (esp. related to network)

Related component(s)

flyteadmin
flytepropeller

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    housekeepingIssues that help maintain flyte and keep it tech-debt free

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions