Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

* refactor(sdk): added option for custom metric collector for tune in… #2406

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

prakhar479
Copy link

@prakhar479 prakhar479 commented Aug 9, 2024

… katlib_client.py

Signed-off-by: prakhar479 [email protected]

added custom_collector field to metrics_collector_config in tune to allow for users to specify custom metrics collector for example prometheus

fixes #2402

Copy link

google-cla bot commented Aug 9, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@prakhar479
Copy link
Author

@Electronic-Waste can you review and let me know if any changes are required in this. Thanks a lot!

@Electronic-Waste
Copy link
Member

@prakhar479 Yes, of course. Thanks for your great effort to Katib!

I'll look into this PR in the next few days.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@andreyvelich
Copy link
Member

/rerun-all

@google-oss-prow google-oss-prow bot added size/S and removed size/XS labels Aug 13, 2024
@prakhar479
Copy link
Author

I have corrected some oversights from my side and need approval for testing. Thanks!

@Electronic-Waste
Copy link
Member

/rerun-all

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job @prakhar479 , I left some comments for you!

Comment on lines 254 to 255
for using custom metric collectors use "custom_collector" key,
for example, `metrics_collector_config = {"custom_collector": "prometheus"}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess prometheus is not a custom collector, since we define it here:

PrometheusMetricCollector CollectorKind = "PrometheusMetric"
DefaultPrometheusPath string = "/metrics"
DefaultPrometheusPort int = 8080
CustomCollector CollectorKind = "Custom"

Here are some relevant resources which may be helpful for you:
https://github.com/kubeflow/katib/blob/master/examples/v1beta1/metrics-collector/custom-metrics-collector.yaml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, custom_collector accepts V1Container as its input, not a str:

class V1beta1CollectorSpec(object):
"""NOTE: This class is auto generated by OpenAPI Generator.
Ref: https://openapi-generator.tech
Do not edit the class manually.
"""
"""
Attributes:
openapi_types (dict): The key is attribute name
and the value is attribute type.
attribute_map (dict): The key is attribute name
and the value is json key in definition.
"""
openapi_types = {
'custom_collector': 'V1Container',
'kind': 'str'
}
attribute_map = {
'custom_collector': 'customCollector',
'kind': 'kind'
}

Can you add some comments to remind users of this usage?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you add some e2e tests like this (an example for StdOut Collector)?

def run_e2e_experiment_create_by_tune(

Copy link
Author

@prakhar479 prakhar479 Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Electronic-Waste, I have made the appropriate changes in the function comments to reflect the correct usage

for e2e tests, should I modify the existing run-e2e-tune-api.py test to use metrics_collector_config or should I create a separate test script for it altogether

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prakhar479 I think modifying the existing run-e2e-tune-api.py test is better since it's hard to pass the collector config to the test script: https://github.com/kubeflow/katib/blob/master/test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.sh. WDYT👀 @andreyvelich @tenzen-y @johnugeorge

Also a small tip for reference: you can build images you need here

# Testing image for tune function
if "$TUNE_API"; then
echo -e "\nPulling and building testing image for tune function..."
_build_containers "suggestion-hyperopt" "$CMD_PREFIX/suggestion/hyperopt/$VERSION/Dockerfile"
fi

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Electronic-Waste for the helpful insights 😄

@Electronic-Waste
Copy link
Member

PTAL👀 @andreyvelich @tenzen-y @johnugeorge when you have time.

@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Aug 17, 2024
@prakhar479
Copy link
Author

prakhar479 commented Aug 17, 2024

I have modified the comment on usage of the custom metric param as well e2e test for tune Api.
For e2e test I have currently modified build-load.sh as suggested by @Electronic-Waste to build an image of custom metric using dummy-collector.py script and Dockerfile.dummy-collector file for building image for dummy collector container. Finally, I have modified run-e2e-tune-api.py adding the custom collector image as a V1 Container passed as a param to tune Api.

I was a bit confused with placement for these new files and have placed all of them in gh-action directory. Let me know for any modifications, changes and fixes I need to make further.

… dummy-collector image

Signed-off-by: Prakhar Singhal <[email protected]>
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prakhar479 Sorry for my late response. I'm busy with other affairs these two weeks.

I left a few comments for you. Thanks for your greate contribution!

@@ -251,8 +254,8 @@ def tune(
pip_index_url: The PyPI url from which to install Python packages.
metrics_collector_config: Specify the config of metrics collector,
for example, `metrics_collector_config = {"kind": "Push"}`.
Currently, we only support `StdOut` and `Push` metrics collector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to tell users about the supported types of MC. So can you re-add this line?

Comment on lines 257 to 258
for using custom metric collectors use "custom_collector" key and provide instance of custom V1Container as value,
for example, `metrics_collector_config = {"kind" : "Custom", "custom_collector": <Instance of V1Container>}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can reorganize these comments and explain each field in metrics_collector_config? Like

kind: specify the kind of Metrics Collector (currently we support...)
custom_collector: ...

Comment on lines 39 to 50
# [3] Create Katib Experiment with 4 Trials and 2 CPUs per Trial.
# [3] Create a dummy metric collector (DOES NOT HAVE A IMAGE)
metric_collector = V1Container(
name="dummy-collector",
image="dummy-collector:latest",
command=["python", "/app/dummy-collector.py"],
args=["--metric-name=result"],
env=[
client.V1EnvVar(name="EXPERIMENT_NAME", value=exp_name),
client.V1EnvVar(name="EXPERIMENT_NAMESPACE", value=exp_namespace)
]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can create another function run_e2e_experiment_create_by_tune_custom to run e2e tests for custom collector, rather than delete original e2e test for StdOut collector. Then we can run these e2e tests together in this file:)

WDYT👀 @prakhar479 @andreyvelich @tenzen-y @johnugeorge

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I think its a good idea .Will make this change in a few days

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made corresponding changes to run both e2e tests (one with custom metrics collector and another with default metric collector [StdOut])

@Electronic-Waste
Copy link
Member

/rerun-all

@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Aug 27, 2024
@prakhar479
Copy link
Author

I have made the neccesary changes that should also solve failing tests. Let me know about any further changes/suggestions @Electronic-Waste @andreyvelich @tenzen-y @johnugeorge.

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @prakhar479 . I left some comments for you.

@@ -35,7 +96,7 @@ def objective(parameters):
"b": search.double(min=0.1, max=0.2)
}

# [3] Create Katib Experiment with 4 Trials and 2 CPUs per Trial.
# [4] Create Katib Experiment with 4 Trials and 2 CPUs per Trial.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why its numer is 4? I guess 3 is more suitable.

Comment on lines 145 to 154
try:
run_e2e_experiment_create_by_tune(katib_client, exp_name, exp_namespace)
exp_name = "tune-example-default-metrics-collector"
logging.info(f"Runnning E2E for Experiment created by tune: {exp_namespace}/{exp_name}")
run_e2e_experiment_create_by_tune_default_metrics_collector(katib_client, exp_name, exp_namespace)
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}")

exp_name = "tune-example-custom-metrics-collector"
logging.info(f"Runnning E2E for Experiment created by tune: {exp_namespace}/{exp_name}")
run_e2e_experiment_create_by_tune_custom_metrics_collector(katib_client, exp_name, exp_namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe splitting these two e2e tests into separate try-catch clauses is better to identify error :)

@@ -93,4 +162,4 @@ def objective(parameters):
# Delete the Experiment.
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
katib_client.delete_experiment(exp_name, exp_namespace)
katib_client.delete_experiment(exp_name, exp_namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to delete the former experiment before we run another experiment. Otherwise, we may run into xxx experiment alreay exists error.

@prakhar479
Copy link
Author

@Electronic-Waste I have fixed these issues lmk if anything else is needed. thanks!

@Electronic-Waste
Copy link
Member

/rerun-all

@Electronic-Waste
Copy link
Member

@prakhar479 Can you please fix the lint error and the error in tune API?

@Electronic-Waste
Copy link
Member

/rerun-all

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @prakhar479!
I left a few comments.
cc @kubeflow/wg-automl-leads

`metrics_collector_config`: Specify the configuration for the metrics collector with following keys:
- **kind**: Specify the kind of Metrics Collector. Currently supported values are:
- `StdOut`: Collects metrics from standard output.
- `None`: No metrics collection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not supported.

Copy link
Author

@prakhar479 prakhar479 Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Can you lead me to where I can find all supported metric collector. For the current comment I had referenced https://github.com/kubeflow/katib/blob/master/pkg/ui/v1beta1/frontend/src/app/models/experiment.k8s.model.ts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, UI also needs to be updated with the latest changes cc @Electronic-Waste
Please ref the official CRDs APIs for Metrics Collector spec: https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/common/v1beta1/common_types.go#L207-L227

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andrey, I'll update UI with the latest changes.

Comment on lines 269 to 270
- **custom_collector**: If the `kind` is set to `Custom`, you must provide an instance of a custom `V1Container` as the value. For example:
`metrics_collector_config = {"kind" : "Custom", "custom_collector": <Instance of V1Container>}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add support for custom_collectors in the followup PRs when we get user requests ?
I feel that Data Scientists who will use tune API doesn't need such functionality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but can you elaborate on this a bit I am not quite sure what this intends since custom metric is already supported by underlying framework for tune function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that until we find use-cases when tune function needs to be used with Custom metrics collector, we can introduce it.
For example, I can see the value for File metrics collector, since users can write metrics into specific File during their training script, similar as for TensorFlow events.
However, for custom metrics collector, I can't see a use-cases when it might be useful with tune function.

Any thoughts @shannonbradshaw @prakhar479 @johnugeorge @tenzen-y ?


RUN pip install kubernetes

CMD ["python", "dummy-collector.py"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this container ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This container is meant for serving as a dummy metric collector container intended for e2e test.

@@ -11,8 +12,68 @@
# The default logging config.
logging.basicConfig(level=logging.INFO)

def run_e2e_experiment_create_by_tune_custom_metrics_collector(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need another E2E just to test this functionality.
We already run E2Es using these YAML files for various metrics collectors: https://github.com/kubeflow/katib/tree/master/examples/v1beta1/metrics-collector
I think for your feature, you just need to add unit tests for Katib Client to verify that Experiment has correct specification: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/katib_client_test.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! should I then remove the e2e test with custom metric collector from run-e2e-tune-api.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Prakhar Singhal <[email protected]>
Signed-off-by: Prakhar Singhal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SDK] Support More MCKind in tune
3 participants