-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Add clean,deploy and port-forward scripts and values for Jaeger + OpenSearch + OTel Demo #7516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…OTel Demo Signed-off-by: danish9039 <[email protected]>
``` | ||
|
||
|
||
## Automatic port-forward using scrpit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a small typo in the heading: scrpit
should be script
in "Automatic port-forward using script"
## Automatic port-forward using scrpit | |
## Automatic port-forward using script |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
examples/otel-demo/deploy-all.sh
Outdated
done | ||
log "Cleanup complete" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation of the done
statement appears to be misaligned with its corresponding for
loop. The current indentation suggests it's closing the inner loop, but based on the code structure, it should be aligned with the outer for
loop. This could cause confusion during maintenance or potentially lead to unexpected behavior if modified later.
# Current:
for ns in jaeger otel-demo opensearch; do
for i in {1..120}; do
if ! kubectl get namespace "$ns" >/dev/null 2>&1; then
break
fi
sleep 2
done
done
log "Cleanup complete"
# Suggested:
for ns in jaeger otel-demo opensearch; do
for i in {1..120}; do
if ! kubectl get namespace "$ns" >/dev/null 2>&1; then
break
fi
sleep 2
done
done
log "Cleanup complete"
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7516 +/- ##
=======================================
Coverage 96.52% 96.52%
=======================================
Files 385 385
Lines 23316 23316
=======================================
Hits 22506 22506
Misses 623 623
Partials 187 187
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Metrics Comparison SummaryTotal changes across all snapshots: 53 Detailed changes per snapshotsummary_metrics_snapshot_cassandra📊 Metrics Diff SummaryTotal Changes: 53
🆕 Added Metrics
View diff sample+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="+Inf",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="0",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="100",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="1000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="25",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
... View diff sample+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="+Inf",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.005",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.01",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.025",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.05",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.075",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.1",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
... View diff sample+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="+Inf",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="0",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="100",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="1000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
+http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="25",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="0.62.0",server_address="localhost",server_port="13133",url_scheme="http"}
... |
minor fix Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com> Signed-off-by: hippie-danish <[email protected]>
persistence: | ||
enabled: true | ||
size: "10Gi" | ||
storageClass: "oci-bv" # Using Oracle Cloud Block Volume storage class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded cloud-specific storage class creates deployment failures. The storage class 'oci-bv' is specific to Oracle Cloud Infrastructure and will cause PVC creation failures on other Kubernetes platforms (AWS, GCP, Azure, minikube, etc.). Either remove this line to use the default storage class, or make it configurable via environment variable or parameter.
storageClass: "oci-bv" # Using Oracle Cloud Block Volume storage class | |
storageClass: "" # Set this to your cluster's storage class (e.g., "gp2" for AWS, "standard" for GCP, "oci-bv" for Oracle Cloud) |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
while true; do | ||
sleep 10 | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infinite loop without proper signal handling creates resource leak. The while true; do sleep 10; done
loop runs indefinitely and may not properly handle all termination signals, potentially leaving port-forward processes running as orphans. The trap on line 166 only handles INT signal. Add proper cleanup for other termination signals (TERM, QUIT) and consider using wait
instead of infinite sleep to make the script more responsive to signals.
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
Signed-off-by: danish9039 <[email protected]>
for i in {1..120}; do | ||
if ! kubectl get namespace "$ns" >/dev/null 2>&1; then | ||
break | ||
fi | ||
sleep 2 | ||
done | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition in namespace deletion wait loop. The loop uses a fixed iteration count (120) with 2-second sleeps, but if namespace deletion takes longer than 240 seconds, the script continues without ensuring namespaces are fully deleted. This can cause subsequent deployments to fail with resource conflicts. Should add explicit timeout handling and error checking after the loop completes.
for i in {1..120}; do | |
if ! kubectl get namespace "$ns" >/dev/null 2>&1; then | |
break | |
fi | |
sleep 2 | |
done | |
done | |
# Wait up to 4 minutes for namespace deletion | |
timeout_seconds=240 | |
start_time=$(date +%s) | |
while true; do | |
current_time=$(date +%s) | |
elapsed_time=$((current_time - start_time)) | |
if ! kubectl get namespace "$ns" >/dev/null 2>&1; then | |
echo "Namespace $ns successfully deleted." | |
break | |
fi | |
if [ $elapsed_time -ge $timeout_seconds ]; then | |
echo "ERROR: Timed out waiting for namespace $ns to be deleted after ${timeout_seconds} seconds." | |
echo "You may need to manually check and delete the namespace before retrying." | |
exit 1 | |
fi | |
sleep 2 | |
done | |
done |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
echo "Deleting namespaces..." | ||
for ns in otel-demo jaeger opensearch; do | ||
if kubectl get namespace "$ns" > /dev/null 2>&1; then | ||
kubectl delete namespace "$ns" --force --grace-period=0 2>/dev/null || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dangerous use of --force --grace-period=0 flags for namespace deletion. This forces immediate termination without allowing pods to shut down gracefully, which can lead to data corruption, incomplete cleanup of resources, and potential resource leaks. The --force flag should be removed to allow proper graceful shutdown, or used only as a last resort with additional safety checks.
kubectl delete namespace "$ns" --force --grace-period=0 2>/dev/null || true | |
kubectl delete namespace "$ns" 2>/dev/null || true |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
|
||
# Stop any existing port forwards first | ||
echo " Stopping any existing port-forward processes..." | ||
pkill -f "kubectl port-forward" 2>/dev/null || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overly broad process killing with 'pkill -f kubectl port-forward' will terminate ALL kubectl port-forward processes on the system, not just those started by this script. This can disrupt other users' port-forwards or other applications. Should track PIDs of started processes and kill only those specific processes, or use more specific process identification.
pkill -f "kubectl port-forward" 2>/dev/null || true | |
# Kill any port-forward processes started by this script | |
if [ -f ".port-forward-pids" ]; then | |
while read -r pid; do | |
kill "$pid" 2>/dev/null || true | |
done < ".port-forward-pids" | |
rm ".port-forward-pids" | |
fi | |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
Signed-off-by: danish9039 <[email protected]>
- name: OTEL_EXPORTER_OTLP_ENDPOINT | ||
value: http://jaeger-collector.jaeger.svc.cluster.local:4317 | ||
- name: OTEL_EXPORTER_OTLP_PROTOCOL | ||
value: grpc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For gRPC connections, the OTLP endpoint for the checkout service should not include the http://
prefix. The correct format for gRPC endpoints is just the host and port without a protocol prefix. Please update to:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: jaeger-collector.jaeger.svc.cluster.local:4317
This ensures proper gRPC communication between the checkout service and the Jaeger collector.
- name: OTEL_EXPORTER_OTLP_ENDPOINT | |
value: http://jaeger-collector.jaeger.svc.cluster.local:4317 | |
- name: OTEL_EXPORTER_OTLP_PROTOCOL | |
value: grpc | |
- name: OTEL_EXPORTER_OTLP_ENDPOINT | |
value: jaeger-collector.jaeger.svc.cluster.local:4317 | |
- name: OTEL_EXPORTER_OTLP_PROTOCOL | |
value: grpc |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
- name: OTEL_TRACES_EXPORTER | ||
value: otlp | ||
|
||
# Component-specific configurations using correct schema names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to redefine all of this? It's going to be very fragile - if upstream demo changes our settings will become stale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro The OTEL demo comes with a built-in OTEL Collector that receives telemetry data from each service and provides a connection point to access traces, metrics, and logs for other services.
Architecture of the OTEL demo out-of-the-box: https://opentelemetry.io/docs/demo/architecture/
In our earlier discussions about the architecture of the otel-demo-jaeger-opensearch stack #7326 (comment) , we decided to drop the built-in OTEL Collector and instead use Jaeger v2, which also serves the purpose of a collector. Because of this, we had to wire each service in the OTEL demo manually to send traces directly to our Jaeger instance.
Had we chosen to keep the built-in OTEL Collector, we would only have needed to configure the collector with trace endpoints to forward traces to our Jaeger instance.
https://opentelemetry.io/docs/demo/kubernetes-deployment/#bring-your-own-backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If having the collector results in less configuration I would keep it. Does it also provide some enrichment of telemetry based on k8s information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro Sure, we can use the otel collector . Regarding enrichment of telemetry: by default otel collector in demo chart provides basic resource attributes (docker, environment, etc.) using the resourcedetection processor
resourcedetection example in otel-demo config
we can also use something like the k8sattributes processor
k8sattributes processor documentation
Signed-off-by: danish9039 <[email protected]>
Signed-off-by: danish9039 <[email protected]>
|
||
extensions: | ||
healthcheckv2: | ||
use_v2: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
healthcheckv2
needs to be told to use v2
?
@@ -0,0 +1,126 @@ | |||
# official schema for otel demo is at https://raw.githubusercontent.com/open-telemetry/opentelemetry-helm-charts/main/charts/opentelemetry-demo/values.yaml | |||
|
|||
# Keep bundled infra disabled (we deploy Jaeger/OpenSearch separately) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate why we want to deploy other services separately? I can understand Jaeger, but what are we doing differently about others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate why we want to deploy other services separately? I can understand Jaeger, but what are we doing differently about others?
@yurishkuro There are several reasons that effect intended functionality
In our demo, we deploy OpenSearch and OpenSearch Dashboards separately alongside Jaeger. The default OpenTelemetry demo does not include the dashboard as a dependency.
Another reason for deploying OpenSearch separately is the limited customization options available in the default OpenTelemetry demo Helm chart. We are constrained by the small number of parameters exposed for OpenSearch configuration.
You can see the allowed parameters here:
opentelemetry-demo values.yaml
While we could technically override Helm values for OpenSearch, this approach would quickly become complex and fragile, especially when dealing with nested configurations.
Additionally, namespace isolation is another key reason for deploying OpenSearch separately. The default OpenTelemetry demo deploys all components into the default namespace and wires them together accordingly (default architecture reference).
In our setup, we are using three separate namespaces to enable better isolation using NetworkPolicies , particularly for the OpenSearch namespace, which requires stronger security measures since the OpenSearch Dashboard will be exposed to the internet, as you suggested earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this explanation into comments in the config file
opensearch: | ||
enabled: false | ||
|
||
# Preserve default.env (to keep OTEL_SERVICE_NAME and OTEL_COLLECTOR_NAME) and override only what we need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from this point down, why do we need all these settings? doesn't OTEL Demo already provide them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from this point down, why do we need all these settings? doesn't OTEL Demo already provide them?
All these settings under the components section need to be added because we encountered the same issue we resolved earlier , related to the new CRI-O container runtime introduced in the updated nodes with v1.34 #7553 (comment)
Many images in the default OpenTelemetry demo chart (mostly init container images) do not use fully qualified image names, which causes pull failures under CRI-O. To fix this, I need to explicitly override those image references in the configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put a PR to otel demo to use qualified image names? Would we still need these settings after that?
fi | ||
} | ||
|
||
echo "Starting Port Forwarding for OpenSearch Observability Stack" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't OTEL Demo already do these forwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the OTEL Demo doesn’t handle port forwards automatically. This script is meant specifically for local development , to allow us to test changes in the setup without having to configure an Ingress each time
Which problem is this PR solving?
part of mentorship work
Description of the changes
Summary
This PR introduces a complete, repeatable deployment for the OpenTelemetry Demo with Jaeger , HotROD app and OpenSearch (including Dashboards) under examples/otel-demo. It provides a single entrypoint script that supports both upgrade and clean install modes, plus the necessary Helm values and a ClusterIP service for Jaeger Query.
Deployment script
• Added deploy-all.sh with modes: upgrade (default) and clean (uninstall + fresh install)
◦ Pre-flight checks for required CLIs (bash, git, curl, kubectl, helm) and cluster availability
◦ Validates presence of required values files
◦ Clones jaegertracing/helm-charts v2 and builds dependencies
◦ Deploys in order: OpenSearch -> OpenSearch Dashboards -> Jaeger (all-in-one) -> OTel Demo
◦ Waits for StatefulSets/Deployments to be ready with rollout status and timeouts
◦ Creates a dedicated ClusterIP service for Jaeger Query
Added values files:
◦ opensearch-values.yaml
◦ opensearch-dashboard-values.yaml
◦ jaeger-values.yaml
◦ jaeger-config.yaml (userconfig for Jaeger)
◦ otel-demo-values.yaml
• Added Jaeger Query service:
◦ jaeger-query-service.yaml (ClusterIP service in jaeger namespace)
How was this change tested?
Checklist
jaeger
:make lint test
jaeger-ui
:npm run lint
andnpm run test