-
Notifications
You must be signed in to change notification settings - Fork 246
FLINK-5725: Add extra Flink details to paasta status #4063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FLINK-5725: Add extra Flink details to paasta status #4063
Conversation
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 This PR adds yelpsoa and srv links to paasta status verbose output for flink to speed up gathering of details for the cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue Plan to add more in future PR's New paasta status output (ran locally) ``` paasta status -s sqlclient -i excluded_assignments_log_counts_by_dimensions -c pnw-prod -vvv Traceback (most recent call last): File "/nail/home/nathanleigh/source/paasta/.tox/py38-linux/bin/paasta ``` Output ``` paasta status -s sqlclient -i excluded_assignments_log_counts_by_dimensions -c pnw-prod -vvv sqlclient.excluded_assignments_log_counts_by_dimensions in pnw-prod (EKS) Version: c74dd64a (desired) Config SHA: 2ba0c242 Flink version: 1.17.2 c0027e5 @ 2023-11-09T13:24:38+01:00 URL: https://flink-eks-pnw-prod.yelpcorp.com/sqlclient-c9d8cdc84/ Yelpsoa configs: https://sourcegraph.yelpcorp.com/sysgit/yelpsoa-configs/-/tree/sqlclient Srv configs: https://sourcegraph.yelpcorp.com/sysgit/srv-configs/-/tree/ecosystem/pnw-prod/sqlclient State: Running Pods: 14 running, 0 evicted, 0 other Jobs: 1 running, 0 finished, 0 failed, 0 cancelled 12 taskmanagers, 12/72 slots available ... ```
Testing copilot review out of curiosity (haven't used before) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances paasta status output for Flink deployments by adding Yelpsoa and Srv configuration links to facilitate quicker diagnostics. The changes propagate through both the CLI status commands and the corresponding tests.
- Introduces new output lines for Yelpsoa and Srv configs in status display.
- Updates tests to validate the new output format.
- Modifies function signatures in status commands to include an additional "cluster" parameter for proper URL generation.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/cli/test_cmds_status.py | Adds expected Yelpsoa and Srv configs to test output |
paasta_tools/cli/cmds/status.py | Adds extra config links and updates function calls |
paasta_tools/cli/cmds/status.py
Outdated
output.append( | ||
f" Yelpsoa configs: https://sourcegraph.yelpcorp.com/sysgit/yelpsoa-configs/-/tree/{service}" | ||
) | ||
cluster_without_pnw_dash = cluster.replace("pnw-", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using .replace("pnw-", "") may remove multiple occurrences of the substring instead of only the prefix. Consider using a conditional check or a method that only removes the prefix if present.
cluster_without_pnw_dash = cluster.replace("pnw-", "") | |
cluster_without_pnw_dash = cluster.removeprefix("pnw-") |
Copilot uses AI. Check for mistakes.
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 This commit adds git and sourcegraphj links to paasta status verbose output for flink to speed up gathering of details for the cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue Plan to add more in future PR's New paasta status output (ran locally) ``` paasta status -s sqlclient -i excluded_assignments_log_counts_by_dimensions -c pnw-prod -vvv Traceback (most recent call last): File "/nail/home/nathanleigh/source/paasta/.tox/py38-linux/bin/paasta ``` Output ``` paasta status -s sqlclient -i excluded_assignments_log_counts_by_dimensions -c pnw-prod -vvv sqlclient.excluded_assignments_log_counts_by_dimensions in pnw-prod (EKS) Version: c74dd64a (desired) Config SHA: 2ba0c242 Repo(git): https://github.yelpcorp.com/services/sqlclient Repo(sourcegraph): https://sourcegraph.yelpcorp.com/services/sqlclient Flink version: 1.17.2 c0027e5 @ 2023-11-09T13:24:38+01:00 URL: https://flink-eks-pnw-prod.yelpcorp.com/sqlclient-c9d8cdc84/ Yelpsoa configs: https://sourcegraph.yelpcorp.com/sysgit/yelpsoa-configs/-/tree/sqlclient Srv configs: https://sourcegraph.yelpcorp.com/sysgit/srv-configs/-/tree/ecosystem/prod/sqlclient State: Running Pods: 14 running, 0 evicted, 0 other Jobs: 1 running, 0 finished, 0 failed, 0 cancelled 12 taskmanagers, 12/72 slots available Jobs: Job Name State Job ID Started excluded_assignments_log_counts_by_dimensions Running d6c20dd9e9741c9646a41daa2f5b88aa 2025-05-13 05:11:30 (4 hours ago) https://flink-eks-pnw-prod.yelpcorp.com/sqlclient-c9d8cdc84/#/jobs/d6c20dd9e9741c9646a41daa2f5b88aa Pods: Pod Name Host Phase Uptime sqlclient-c9d8cdc84-jobmanager-5ff9cd5ff5-469m2 ip-10-69-177-166.us-west-2.compute.internal Running 0d4h59m1s ... ```
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 This commit adds flink logs commands to paasta status verbose output for flink to speed up gathering of details for the cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue New paasta status output (ran locally) ``` paasta status -s sqlclient -i excluded_assignments_log_counts_by_dimensions -c pnw-prod -vvv Traceback (most recent call last): File "/nail/home/nathanleigh/source/paasta/.tox/py38-linux/bin/paasta ``` Output ``` paasta status -s sqlclient -i excluded_assignments_log_counts_by_dimensions -c pnw-prod -vvv sqlclient.excluded_assignments_log_counts_by_dimensions in pnw-prod (EKS) Version: c74dd64a (desired) Config SHA: 2ba0c242 Repo(git): https://github.yelpcorp.com/services/sqlclient Repo(sourcegraph): https://sourcegraph.yelpcorp.com/services/sqlclient Flink version: 1.17.2 c0027e5 @ 2023-11-09T13:24:38+01:00 URL: https://flink-eks-pnw-prod.yelpcorp.com/sqlclient-c9d8cdc84/ Yelpsoa configs: https://sourcegraph.yelpcorp.com/sysgit/yelpsoa-configs/-/tree/sqlclient Srv configs: https://sourcegraph.yelpcorp.com/sysgit/srv-configs/-/tree/ecosystem/prod/sqlclient Flink Log Commands: Service: paasta logs -a 1h -c pnw-prod -s sqlclient -i excluded_assignments_log_counts_by_dimensions Taskmanager: paasta logs -a 1h -c pnw-prod -s sqlclient -i excluded_assignments_log_counts_by_dimensions.TASKMANAGER Jobmanager: paasta logs -a 1h -c pnw-prod -s sqlclient -i excluded_assignments_log_counts_by_dimensions.JOBMANAGER Supervisor: paasta logs -a 1h -c pnw-prod -s sqlclient -i excluded_assignments_log_counts_by_dimensions.SUPERVISOR State: Running Pods: 14 running, 0 evicted, 0 other Jobs: 1 running, 0 finished, 0 failed, 0 cancelled 12 taskmanagers, 12/72 slots available ... ```
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 This commit adds flink grafana monitoring links to paasta status verbose output for flink to speed up gathering of details for the cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue New paasta status output (ran locally) ``` paasta status -s sqlclient -i excluded_assignments_log_counts_by_dimensions -c pnw-prod -vvv Traceback (most recent call last): File "/nail/home/nathanleigh/source/paasta/.tox/py38-linux/bin/paasta ``` Output ``` paasta status -s sqlclient -i excluded_assignments_log_counts_by_dimensions -c pnw-prod -vvv sqlclient.excluded_assignments_log_counts_by_dimensions in pnw-prod (EKS) Version: c74dd64a (desired) Config SHA: 2ba0c242 Repo(git): https://github.yelpcorp.com/services/sqlclient Repo(sourcegraph): https://sourcegraph.yelpcorp.com/services/sqlclient Flink version: 1.17.2 c0027e5 @ 2023-11-09T13:24:38+01:00 URL: https://flink-eks-pnw-prod.yelpcorp.com/sqlclient-c9d8cdc84/ Yelpsoa configs: https://sourcegraph.yelpcorp.com/sysgit/yelpsoa-configs/-/tree/sqlclient Srv configs: https://sourcegraph.yelpcorp.com/sysgit/srv-configs/-/tree/ecosystem/prod/sqlclient Flink Log Commands: Service: paasta logs -a 1h -c pnw-prod -s sqlclient -i excluded_assignments_log_counts_by_dimensions Taskmanager: paasta logs -a 1h -c pnw-prod -s sqlclient -i excluded_assignments_log_counts_by_dimensions.TASKMANAGER Jobmanager: paasta logs -a 1h -c pnw-prod -s sqlclient -i excluded_assignments_log_counts_by_dimensions.JOBMANAGER Supervisor: paasta logs -a 1h -c pnw-prod -s sqlclient -i excluded_assignments_log_counts_by_dimensions.SUPERVISOR Flink Monitoring: Job Metrics: https://grafana.yelpcorp.com/d/flink-metrics/flink-job-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=excluded_assignments_log_counts_by_dimensions&var-job=All&from=now-24h&to=now Container Metrics: https://grafana.yelpcorp.com/d/flink-container-metrics/flink-container-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=excluded_assignments_log_counts_by_dimensions&from=now-24h&to=now JVM Metrics: https://grafana.yelpcorp.com/d/flink-jvm-metrics/flink-jvm-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=excluded_assignments_log_counts_by_dimensions&from=now-24h&to=now State: Running Pods: 14 running, 0 evicted, 0 other Jobs: 1 running, 0 finished, 0 failed, 0 cancelled ... ```
paasta_tools/cli/cmds/status.py
Outdated
if verbose: | ||
output.append(f" Flink Monitoring:") | ||
output.append( | ||
f" Job Metrics: https://grafana.yelpcorp.com/d/flink-metrics/flink-job-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-{cluster_without_pnw_dash}&var-service={service}&var-instance={instance}&var-job=All&from=now-24h&to=now" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea in general, the only concern is that these are yelp urls while changes are in the public repository. Is this repository in use outside of Yelp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anohovsky I'm not sure, I imagine the chances of people using it AND Flink would be very low
This paasta tools already has a lot of 'yelpy' stuff such as yelpsoa parsing and we already have yelpy url's in the code
https://github.com/Yelp/paasta/blob/1fb40c62b5199f8da2305f98619a2a28d506e1fe/paasta_tools/cli/cmds/status.py#L816C1-L819C1
# Annotation "flink.yelp.com/dashboard_url" is populated by flink-operator
dashboard_url = metadata["annotations"].get("flink.yelp.com/dashboard_url")
output.append(f" URL: {dashboard_url}/")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anohovsky I have asked compute infra(paasta owners) just to double check
#4063 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no one outside of yelp uses paasta (nor are we expecting anyone to): it's more of a "developed in the open" type of thing than anything else
we do normally try to limit some of the yelpiness when possible by making things configurable through SystemPaastaConfig - but that's not really a hard line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., if we wanted to stop hardcoding some of these URLs in paasta, we could do something like store the base urls in system paasta config and then build/append the query string here
paasta_tools/cli/cmds/status.py
Outdated
@@ -809,6 +817,47 @@ def _print_flink_status_from_job_manager( | |||
dashboard_url = metadata["annotations"].get("flink.yelp.com/dashboard_url") | |||
output.append(f" URL: {dashboard_url}/") | |||
|
|||
cluster_without_pnw_dash = cluster.replace("pnw-", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is somewhat brittle and will not work for things like infrastage
i think we might want a mapping somewhere of cluster -> ecosystem + cluster -> region (perhaps system paasta config if there's not already something for this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a very small amount of services running on (non) pnw clusters so thought I would omit adding for these cases. But have added to fully work for all mappings
56cdcc5
paasta_tools/cli/cmds/status.py
Outdated
) | ||
|
||
# Print Flink Log Commands | ||
if verbose: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a bunch of if verbose:
blocks in a row - should these be a single block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, it was more for just formatting/separating
Tbf it's getting to the point where could maybe refactor whole function, will consider it in a new PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can separate with comments inside a single if verbose:
block :)
that said ++ to more refactoring in another PR
paasta_tools/cli/cmds/status.py
Outdated
f" Yelpsoa configs: https://sourcegraph.yelpcorp.com/sysgit/yelpsoa-configs/-/tree/{service}" | ||
) | ||
output.append( | ||
f" Srv configs: https://sourcegraph.yelpcorp.com/sysgit/srv-configs/-/tree/ecosystem/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd probably link to GH here since it's easier to go back to the PR that edited a line from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we also never really type these as "srv configs" and "yelpsoa configs" on CI, but i guess whatever your users are expecting is fine 😔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to github links in
02c4363
paasta_tools/cli/cmds/status.py
Outdated
if verbose: | ||
output.append(f" Flink Monitoring:") | ||
output.append( | ||
f" Job Metrics: https://grafana.yelpcorp.com/d/flink-metrics/flink-job-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-{cluster_without_pnw_dash}&var-service={service}&var-instance={instance}&var-job=All&from=now-24h&to=now" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no one outside of yelp uses paasta (nor are we expecting anyone to): it's more of a "developed in the open" type of thing than anything else
we do normally try to limit some of the yelpiness when possible by making things configurable through SystemPaastaConfig - but that's not really a hard line
paasta_tools/cli/cmds/status.py
Outdated
if verbose: | ||
output.append(f" Flink Monitoring:") | ||
output.append( | ||
f" Job Metrics: https://grafana.yelpcorp.com/d/flink-metrics/flink-job-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-{cluster_without_pnw_dash}&var-service={service}&var-instance={instance}&var-job=All&from=now-24h&to=now" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g., if we wanted to stop hardcoding some of these URLs in paasta, we could do something like store the base urls in system paasta config and then build/append the query string here
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To address review comment ``` this is somewhat brittle and will not work for things like infrastage i think we might want a mapping somewhere of cluster -> ecosystem + cluster -> region (perhaps system paasta config if there's not already something for this) ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enriches the paasta status output for Flink clusters by adding additional Flink resource links and log commands to assist in debugging and troubleshooting.
- Tests have been updated to reflect the new outputs including repo links, log commands, and monitoring URLs.
- A mapping has been added for determining the ecosystem based on the Flink cluster, and the status printing functions have been updated to utilize it.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
tests/cli/test_cmds_status.py | Updated expected output strings for enhanced Flink details. |
paasta_tools/utils.py | Added mapping SUPPERREGION_TO_ECOSYSTEM_MAPPINGS for ecosystem inference. |
paasta_tools/cli/cmds/status.py | Adjusted status functions to include extra Flink details using the new mapping. |
Comments suppressed due to low confidence (1)
paasta_tools/utils.py:185
- [nitpick] The mapping variable name 'SUPPERREGION_TO_ECOSYSTEM_MAPPINGS' appears to be inconsistent with the comment (which references 'superregion'). Consider renaming it to 'SUPERREGION_TO_ECOSYSTEM_MAPPINGS' for clarity.
SUPPERREGION_TO_ECOSYSTEM_MAPPINGS = {
paasta_tools/cli/cmds/status.py
Outdated
) | ||
|
||
# Print Flink Log Commands | ||
if verbose: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can separate with comments inside a single if verbose:
block :)
that said ++ to more refactoring in another PR
paasta_tools/utils.py
Outdated
# https://github.yelpcorp.com/sysgit/srv-configs/tree/master/superregion | ||
SUPPERREGION_TO_ECOSYSTEM_MAPPINGS = { | ||
"norcal-devc": "devc", | ||
"norcal-stagef": "stagef", | ||
"norcal-stageg": "stageg", | ||
"nova-prod": "prod", | ||
"pnw-devc": "devc", | ||
"pnw-prod": "prod", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be accessed through SystemPaastaConfig - i.e., live in puppet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this approach will still fail for infrastage - not all clusters are named after a superregion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nemacysts we only have 1 service/instance in infrastage that we use for testing
https://sourcegraph.yelpcorp.com/search?q=repo:%5Esysgit/yelpsoa-configs%24+flinkeks-infrastage&patternType=keyword&sm=0
Infrastage would require a bit of extra logic,(srv and soa links don't follow pattern) so I decided to remove it
I am not sure what I would be accessing via SystemPaastaConfig
https://sourcegraph.yelpcorp.com/search?q=repo:%5EYelp/paasta%24+SystemPaastaConfigDict%28&patternType=keyword&sm=0
SystemPaastaConfigDict(
https://sourcegraph.yelpcorp.com/Yelp/paasta/-/blob/paasta_tools/utils.py?L1948-1950
tests/cli/test_cmds_status.py
Outdated
@@ -2702,7 +2702,7 @@ def test_output_stopping_jobmanager( | |||
output = [] | |||
mock_flink_status["status"]["state"] = "Stoppingjobmanager" | |||
print_flink_status( | |||
cluster="fake_cluster", | |||
cluster="pnw-devc", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once the mapping is moved to SPC, we can probably mock the getter and have a fake ecosystem for fake_cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nleigh we still wanna do this: using a non-existent cluster ensures that if folks make a mistake wrt their mocks, we never actually hit a real paasta api
i guess for now this is still technically safe since we don't have any service called fake_service
- but it would definitely make things safer (both for the existing tests and for folks copying this and potentially adding real service names/instances in new tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see thanks
Updated in 2720594
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue Paasta status output ``` paasta status -s sqlclient -i excluded_assignments_log_counts_by_dimensions -c pnw-devc -vvv sqlclient.excluded_assignments_log_counts_by_dimensions in pnw-devc (EKS) Version: c74dd64a (desired) Config SHA: c8ee0154 Owner: fxp Repo(git): https://github.yelpcorp.com/services/sqlclient Repo(sourcegraph): https://sourcegraph.yelpcorp.com/services/sqlclient Flink version: 1.17.2 c0027e5 @ 2023-11-09T13:24:38+01:00 URL: http://flink.eks.pnw-devc.paasta:31080/sqlclient-c9d8cdc84/ Yelpsoa configs: https://github.yelpcorp.com/sysgit/yelpsoa-configs/tree/master/sqlclient Srv configs: https://github.yelpcorp.com/sysgit/srv-configs/tree/master/ecosystem/devc/sqlclient Flink Log Commands: Service: paasta logs -a 1h -c pnw-devc -s sqlclient -i excluded_assignments_log_counts_by_dimensions Taskmanager: paasta logs -a 1h -c pnw-devc -s sqlclient -i excluded_assignments_log_counts_by_dimensions.TASKMANAGER Jobmanager: paasta logs -a 1h -c pnw-devc -s sqlclient -i excluded_assignments_log_counts_by_dimensions.JOBMANAGER Supervisor: paasta logs -a 1h -c pnw-devc -s sqlclient -i excluded_assignments_log_counts_by_dimensions.SUPERVISOR Flink Monitoring: Job Metrics: https://grafana.yelpcorp.com/d/flink-metrics/flink-job-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=sqlclient&var-instance=excluded_assignments_log_counts_by_dimensions&var-job=All&from=now-24h&to=now Container Metrics: https://grafana.yelpcorp.com/d/flink-container-metrics/flink-container-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=sqlclient&var-instance=excluded_assignments_log_counts_by_dimensions&from=now-24h&to=now JVM Metrics: https://grafana.yelpcorp.com/d/flink-jvm-metrics/flink-jvm-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=sqlclient&var-instance=excluded_assignments_log_counts_by_dimensions&from=now-24h&to=now State: Running Pods: 7 running, 0 evicted, 0 other Jobs: 0 running, 0 finished, 8 failed, 0 cancelled 4 taskmanagers, 4/4 slots available ```
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue Paasta status output ``` paasta status -s sqlclient -i ad_indexing_service_area_place_ids -c pnw-prod -vvv sqlclient.ad_indexing_service_area_place_ids in pnw-prod (EKS) Version: c74dd64a (desired) Config SHA: 3ac51a85 Flink Pool: flink Owner: ranking_ingestion Repo(git): https://github.yelpcorp.com/services/sqlclient Repo(sourcegraph): https://sourcegraph.yelpcorp.com/services/sqlclient Flink version: 1.17.2 c0027e5 @ 2023-11-09T13:24:38+01:00 URL: https://flink-eks-pnw-prod.yelpcorp.com/sqlclient-c6b979558/ Yelpsoa configs: https://github.yelpcorp.com/sysgit/yelpsoa-configs/tree/master/sqlclient Srv configs: https://github.yelpcorp.com/sysgit/srv-configs/tree/master/ecosystem/prod/sqlclient Flink Log Commands: Service: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids Taskmanager: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids.TASKMANAGER Jobmanager: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids.JOBMANAGER Supervisor: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids.SUPERVISOR Flink Monitoring: Job Metrics: https://grafana.yelpcorp.com/d/flink-metrics/flink-job-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=ad_indexing_service_area_place_ids&var-job=All&from=now-24h&to=now Container Metrics: https://grafana.yelpcorp.com/d/flink-container-metrics/flink-container-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=ad_indexing_service_area_place_ids&from=now-24h&to=now JVM Metrics: https://grafana.yelpcorp.com/d/flink-jvm-metrics/flink-jvm-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=ad_indexing_service_area_place_ids&from=now-24h&to=now State: Running Pods: 12 running, 0 evicted, 0 other Jobs: 1 running, 0 finished, 0 failed, 1 cancelled ```
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue Paasta status output ``` paasta status -s sqlclient -i ad_indexing_service_area_place_ids -c pnw-prod -vvv sqlclient.ad_indexing_service_area_place_ids in pnw-prod (EKS) Version: c74dd64a (desired) Config SHA: 3ac51a85 Flink Pool: flink Owner: ranking_ingestion Flink Runbook: y/rb-ring-sqlclient Repo(git): https://github.yelpcorp.com/services/sqlclient Repo(sourcegraph): https://sourcegraph.yelpcorp.com/services/sqlclient Flink version: 1.17.2 c0027e5 @ 2023-11-09T13:24:38+01:00 URL: https://flink-eks-pnw-prod.yelpcorp.com/sqlclient-c6b979558/ Yelpsoa configs: https://github.yelpcorp.com/sysgit/yelpsoa-configs/tree/master/sqlclient Srv configs: https://github.yelpcorp.com/sysgit/srv-configs/tree/master/ecosystem/prod/sqlclient Flink Log Commands: Service: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids Taskmanager: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids.TASKMANAGER Jobmanager: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids.JOBMANAGER Supervisor: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids.SUPERVISOR Flink Monitoring: Job Metrics: https://grafana.yelpcorp.com/d/flink-metrics/flink-job-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=ad_indexing_service_area_place_ids&var-job=All&from=now-24h&to=now Container Metrics: https://grafana.yelpcorp.com/d/flink-container-metrics/flink-container-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=ad_indexing_service_area_place_ids&from=now-24h&to=now JVM Metrics: https://grafana.yelpcorp.com/d/flink-jvm-metrics/flink-jvm-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=ad_indexing_service_area_place_ids&from=now-24h&to=now State: Running Pods: 12 running, 0 evicted, 0 other Jobs: 1 running, 0 finished, 0 failed, 1 cancelled ```
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue Paasta status output ``` paasta status -s sqlclient -i ad_indexing_service_area_place_ids -c pnw-prod -vvv sqlclient.ad_indexing_service_area_place_ids in pnw-prod (EKS) Version: c74dd64a (desired) Config SHA: 3ac51a85 Flink Pool: flink Owner: ranking_ingestion Flink Runbook: y/rb-ring-sqlclient Repo(git): https://github.yelpcorp.com/services/sqlclient Repo(sourcegraph): https://sourcegraph.yelpcorp.com/services/sqlclient Flink version: 1.17.2 c0027e5 @ 2023-11-09T13:24:38+01:00 URL: https://flink-eks-pnw-prod.yelpcorp.com/sqlclient-c6b979558/ Yelpsoa configs: https://github.yelpcorp.com/sysgit/yelpsoa-configs/tree/master/sqlclient Srv configs: https://github.yelpcorp.com/sysgit/srv-configs/tree/master/ecosystem/prod/sqlclient Flink Log Commands: Service: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids Taskmanager: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids.TASKMANAGER Jobmanager: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids.JOBMANAGER Supervisor: paasta logs -a 1h -c pnw-prod -s sqlclient -i ad_indexing_service_area_place_ids.SUPERVISOR Flink Monitoring: Job Metrics: https://grafana.yelpcorp.com/d/flink-metrics/flink-job-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=ad_indexing_service_area_place_ids&var-job=All&from=now-24h&to=now Container Metrics: https://grafana.yelpcorp.com/d/flink-container-metrics/flink-container-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=ad_indexing_service_area_place_ids&from=now-24h&to=now JVM Metrics: https://grafana.yelpcorp.com/d/flink-jvm-metrics/flink-jvm-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-prod&var-service=sqlclient&var-instance=ad_indexing_service_area_place_ids&from=now-24h&to=now State: Running Pods: 12 running, 0 evicted, 0 other Jobs: 1 running, 0 finished, 0 failed, 1 cancelled ```
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue ========== Links to the paasta service utilisation dashboard Adds diveder to paasta status for easier readability Renames of function Paasta status output ``` paasta status -s acorn -i main -c pnw-devc -v acorn.main in pnw-devc (EKS) Version: 6563d637 (desired) Config SHA: 00ce0453 Flink Pool: flink-spot Owner: streaming-infrastructure Flink Runbook: y/rb-acorn Repo(git): https://github.yelpcorp.com/services/acorn Repo(sourcegraph): https://sourcegraph.yelpcorp.com/services/acorn Flink version: 1.13.5 0ff28a7 @ 2021-12-14T23:26:04+01:00 URL: http://flink.eks.pnw-devc.paasta:31080/acorn-7f797b79f6/ Yelpsoa configs: https://github.yelpcorp.com/sysgit/yelpsoa-configs/tree/master/acorn Srv configs: https://github.yelpcorp.com/sysgit/srv-configs/tree/master/ecosystem/devc/acorn ================================================================== Flink Log Commands: Service: paasta logs -a 1h -c pnw-devc -s acorn -i main Taskmanager: paasta logs -a 1h -c pnw-devc -s acorn -i main.TASKMANAGER Jobmanager: paasta logs -a 1h -c pnw-devc -s acorn -i main.JOBMANAGER Supervisor: paasta logs -a 1h -c pnw-devc -s acorn -i main.SUPERVISOR ================================================================== Flink Monitoring: Job Metrics: https://grafana.yelpcorp.com/d/flink-metrics/flink-job-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=acorn&var-instance=main&var-job=All&from=now-24h&to=now Container Metrics: https://grafana.yelpcorp.com/d/flink-container-metrics/flink-container-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=acorn&var-instance=main&from=now-24h&to=now JVM Metrics: https://grafana.yelpcorp.com/d/flink-jvm-metrics/flink-jvm-metrics?orgId=1&var-datasource=Prometheus-flink&var-region=uswest2-devc&var-service=acorn&var-instance=main&from=now-24h&to=now Flink Cost: https://splunk.yelpcorp.com/en-US/app/yelp_computeinfra/paasta_service_utilization?form.service=acorn&form.field1.earliest=-30d%40d&form.field1.latest=now&form.instance=main&form.cluster=pnw-devc ================================================================== State: Running Pods: 5 running, 0 evicted, 0 other Jobs: 1 running, 0 finished, 0 failed, 0 cancelled 3 taskmanagers, 8/30 slots available Jobs: Job Name State Started heartbeat_s3_checkpoint Running 2025-05-15 15:42:57 (11 hours ago) Pods: Pod Name Host Phase Uptime acorn-7f797b79f6-jobmanager-5859456947-qv2ml ip-10-81-17-237.us-west-2.compute.internal Running 0d11h3m9s acorn-7f797b79f6-supervisor-p9l6n ip-10-81-19-13.us-west-2.compute.internal Running 0d17h11m23s acorn-7f797b79f6-taskmanager-5bf76bf45f-2js7q ip-10-81-23-26.us-west-2.compute.internal Running 0d2h54m15s acorn-7f797b79f6-taskmanager-5bf76bf45f-76x24 ip-10-81-23-26.us-west-2.compute.internal Running 0d2h54m15s acorn-7f797b79f6-taskmanager-5bf76bf45f-rkh9w ip-10-81-19-237.us-west-2.compute.internal Running 0d7h42m18s ```
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue ========== paasta_tools/flink_tools.py:114: error: Function is missing a return type annotation paasta_tools/flink_tools.py:125: error: TypedDict "FlinkDeploymentConfigDict" has no key 'spot'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances paasta status output for Flink clusters by adding extra resource links, log commands, and owner/runbook details. Key changes include:
- Updating tests to cover new Flink status details and configuration behaviors.
- Adding new utility methods in both paasta_tools/utils.py and paasta_tools/flink_tools.py.
- Updating the CLI status commands to include extra Flink resource links and formatted divider outputs.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
tests/test_utils.py | Added tests for retrieving ecosystem and runbook details. |
tests/test_flink_tools.py | Added tests for new Flink pool determination logic. |
tests/cli/test_cmds_status.py | Updated tests to reflect new output format and resource links. |
requirements-minimal.txt | Added dependency on the environment-tools package. |
paasta_tools/utils.py | Added get_runbook and get_ecosystem_for_cluster methods. |
paasta_tools/flink_tools.py | Added a new get_pool method that selects between two Flink pools. |
paasta_tools/cli/cmds/status.py | Updated Flink status output logic to include extra resource links. |
Comments suppressed due to low confidence (1)
tests/cli/test_cmds_status.py:2470
- [nitpick] There is an inconsistency in the tests between using 'fake_cluster' and 'fake-cluster' for the cluster name. Consider standardizing the naming for clarity and consistency.
cluster="fake_cluster",
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue ==========
…DetailsToFlinkPaastaStatus' into u/nathanleigh/FLINK-5725/AddMoreDetailsToFlinkPaastaStatus
tests/test_utils.py
Outdated
@@ -591,6 +592,36 @@ def test_SystemPaastaConfig_get_cluster_fqdn_format(): | |||
assert actual == expected | |||
|
|||
|
|||
@patch("paasta_tools.utils.convert_location_type", autospec=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for newer tests, we tend to prefer using the context manager form of patching rather than the decorator form
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated in e0388bb
tests/test_utils.py
Outdated
@@ -21,6 +21,7 @@ | |||
from typing import Any | |||
from typing import Dict | |||
from typing import List | |||
from unittest.mock import patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wow, we're behind on modernizing things - we're still using mock
in this file/repo!
that said: can we do the "wrong" thing here and keep using mock
so that we can swap this file to unittest.mock
in one fell swoop later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, yes happy to modernise in another PR
tests/test_utils.py
Outdated
@patch("paasta_tools.utils.convert_location_type", autospec=True) | ||
def test_SystemPaastaConfig_get_ecosystem_for_cluster(mock_convert_location_type): | ||
# Mock convert_location_type to return the expected ecosystem | ||
mock_convert_location_type.return_value = ["devc"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, you can also set return_value
in the patch() call - but this is also fine as-is
Co-authored-by: Luis Pérez <[email protected]>
Co-authored-by: Luis Pérez <[email protected]>
Co-authored-by: Luis Pérez <[email protected]>
Co-authored-by: Luis Pérez <[email protected]>
Co-authored-by: Luis Pérez <[email protected]>
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue ========== Address review comment @patch("paasta_tools.utils.convert_location_type", autospec=True) for newer tests, we tend to prefer using the context manager form of patching rather than the decorator form
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue ========== Address review comment ``` FAILED tests/cli/test_cmds_status.py::TestPrintFlinkStatus::test_error_no_flink_config - AttributeError: 'NoneType' object has no attribute 'get_pool' FAILED tests/cli/test_cmds_status.py::TestPrintFlinkStatus::test_error_no_flink_overview - AttributeError: 'NoneType' object has no attribute 'get_pool' FAILED tests/cli/test_cmds_status.py::TestPrintFlinkStatus::test_successful_return_value - AttributeError: 'NoneType' object has no attribute 'get_pool' ```
Co-authored-by: Luis Pérez <[email protected]>
Co-authored-by: Luis Pérez <[email protected]>
Co-authored-by: Luis Pérez <[email protected]>
Co-authored-by: Luis Pérez <[email protected]>
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue ========== Address review comment
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue ========== Address review comment flink_instance_config Should always be populated
https://jira.yelpcorp.com/browse/FLINK-5725 HACKATHON-39 To speed up gathering of details for a Flink cluster/job When running paasta status, you are normally debugging something (something is broke) You then may need to check related resources for the flink job such as yelpsoa/srv Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue ========== Test failing with ``` > flink_pool = flink_instance_config.get_pool() E AttributeError: 'NoneType' object has no attribute 'get_pool' ``` Fix by populating object Remove duplication by moving object flink_instance_config to test ficture
May be easier to review first per commit
https://jira.yelpcorp.com/browse/FLINK-5725
HACKATHON-39
To speed up gathering of details for a Flink cluster/job, this PR adds
When running paasta status, you are normally debugging something (something is broke)
You then may need to check related resources for the flink job such as yelpsoa/srv
Also when in a slack thread this information can useful to share in thread, reduce time to spot the issue
Output Example
Prod
Full Output tests
May 29th output
paasta status -s acorn -i main -c pnw-devc -v
paasta status -s acorn -i main -c pnw-prod -v
paasta status -s acorn -i main_eks -c infrastage -v
paasta status -s acorn -i main_eks -c norcal-stageg -v
paasta status -s acorn -i main_eks -c norcal-stagef -v
https://fluffy.yelpcorp.com/i/TV5vKLJgjmHZpwqPcjFJ7frmCm9sf8mM.html