Skip to content

Conversation

@tsloughter
Copy link
Member

Not ready because still using local changes in the deps but wanted to open it up for review while those changes are being merged and released.

@tsloughter
Copy link
Member Author

Forgot I needed to make meter a per-metric option. Though that isn't actually going to work since the main purpose of this is supporting stuff like Phoenix which isn't going to add meter to the reporter opts of telemetry events...

So may just add a warning in the docs that Scope is going to be wrong for all your Telemetry.Metrics

Copy link

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@AndrewDryga
Copy link

Can I help you guys somehow? Have to submit metrics to Google Metrics for alerting and, since our metrics are not mission critical, we can try to pilot this PR in production.

@tsloughter
Copy link
Member Author

@AndrewDryga yes, definitely! Are you on slack? I hit a bug in metrics that has kept me from having time to spend on docs but could help you on slack to get going.

@tsloughter
Copy link
Member Author

WHOOPS. I started a job shortly after this and forgot about this PR, wow. Trying to resurrect it.

@tsloughter tsloughter marked this pull request as ready for review September 30, 2024 09:42
@tsloughter
Copy link
Member Author

Marked as ready for review tho I don't remember if it really is. But based on the tests it has I guess it works, hehe.

Copy link
Contributor

@bryannaegele bryannaegele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update the deps but let's get this in the hands of folks

@ethangunderson
Copy link

I'm very interested in getting this merged. How can I help?

@tsloughter
Copy link
Member Author

@ethangunderson tried it by chance? That'd make it easier to feel confident merging.

@ethangunderson
Copy link

@tsloughter I haven't yet, but I have the perfect app for it. I'll make time for it this week and report back.

@ethangunderson
Copy link

Alright, a bit late, but I tried this out yesterday.

I have an existing application that uses the OTLP collectors from a Datadog agent. Traces work correctly, but I can't get metrics working. If I swap out the reader for otel_metric_exporter_pid, I do see the expected metrics in my console, so this is probably a DD config issue on my part. That being said, I'm not sure how to troubleshoot that any further. Do you have a metrics application that you use locally for verification of this stuff?

If the application does not have opentelemetry_experimental configured the Telemetry handler is detached with the following error.

15:25:07.397 [error] Handler {OtelTelemetryMetrics, [:telemetry, :event], #PID<0.935.0>} has failed and has been detached. Class=:error
Reason=:undef
Stacktrace=[
  {:otel_meter_noop, :record,
   [
     %{
       {:otel_tracer, :span_ctx} => {:span_ctx,
        260427163911476008642819364899241647806, 15918744878643449198, 1,
        {:tracestate, []}, true, false, true,
        {:otel_span_ets,
         #Function<2.20827891/1 in :otel_tracer_server.on_end/1>}}
     },

@tsloughter
Copy link
Member Author

@ethangunderson are you sending the metrics to the collector which then sends to DataDog? Can you verify if they are getting to the cluster by adding the debug exporter to the collector metrics pipeline.

@zkayser
Copy link

zkayser commented Dec 26, 2024

@tsloughter I work with @ethangunderson and just got a little test using this working. We're actually sending metrics directly to a Datadog Agent over grpc on port 4317 and I got it to work fine, just a few things that tripped me up along the way. I'll provide the three main things I ran into (I hacked a couple of them into working just to prove this out as a PoC).

First, here's the config we're using for reference

config :opentelemetry_experimental,
  readers: [
    %{
      module: :otel_metric_reader,
      config: %{
        exporter:
          {:otel_exporter_metrics_otlp,
           %{
             endpoint: System.get_env("OTLP_ENDPOINT"),
             protocol: :grpc,
             compression: :gzip
           }}
      }
    }
  ]

We set up a little test with a single Telemetry.Metrics.counter/2 metric:

defmodule Vantage.Otel.Metrics do
  @moduledoc """
  Experimental usage of OtelTelemetryMetrics for a PoC
  """
  use Supervisor

  import Telemetry.Metrics

  def start_link(opts) do
    Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  @impl Supervisor
  def init(opts) do
    children = [
      {OtelTelemetryMetrics, metrics: Keyword.get(opts, :metrics, metrics())}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end

  defp metrics do
    [
      counter("vantage.test.metric.count",
        # unit: :count, ====> Needed to set this explicitly to avoid an error in `otel_otlp_metrics.erl`
        event_name: [:vantage, :test, :metric],
        description: "just a test",
        measurement: 1
      )
    ]
  end
end

1) ArgumentError on :unicode.characters_to_binary/1 without explicitly passing a value to :unit on Telemetry.Metrics opts

Without explicitly setting a value for :unit in the Telemetry.Metrics options, we hit this exception:

10:17:01.357 [error] GenServer #PID<0.428.0> terminating
** (ArgumentError) errors were found at the given arguments:

  * 1st argument: not valid character data (an iodata term)

    (stdlib 6.1.2) unicode.erl:1219: :unicode.characters_to_binary(1)
    (opentelemetry_experimental 0.5.1) /Users/zkayser/dev/carburetor/deps/opentelemetry_experimental/apps/opentelemetry_experimental/src/otel_otlp_metrics.erl:72: :otel_otlp_metrics.to_proto/1

We were able to get around this by passing a value under the :unit option on the Telemetry.Metrics definition.

2) Undefined instrumentation scope on default meter

This may be user error on my part or we should have been using our own #meter{}, but with the default meter we ran into this exception:

10:29:09.651 [error] GenServer #PID<0.493.0> terminating
** (stop) {:badrecord, :undefined}
    (opentelemetry_experimental 0.5.1) /Users/zkayser/dev/carburetor/deps/opentelemetry_experimental/apps/opentelemetry_experimental/src/otel_otlp_metrics.erl:51: anonymous fn/3 in :otel_otlp_metrics.to_proto_by_scope/1

I hacked this one a bit to get around this issue. This may not be an issue if there is something similar to create_application_tracers/1 but for meters instead. I hacked this by adding instrumentation_scope = opentelemetry:instrumentation_scope(<<>>, <<>>, <<>>) to the default meter in otel_meter_server.erl.

  1. Undefined description on metric caused error on generated metric service protobuf module

The last error I ran into was that without a #metric record having its description field set, it failed in opentelemetry_exporter_metrics_service_pb.erl due to a function clause missing error:

10:35:37.249 [error] GenServer #PID<0.468.0> terminating
** (FunctionClauseError) no function clause matching in :opentelemetry_exporter_metrics_service_pb.is_empty_string/1
    (opentelemetry_exporter 1.8.0) /Users/zkayser/dev/carburetor/deps/opentelemetry_exporter/src/opentelemetry_exporter_metrics_service_pb.erl:2171: :opentelemetry_exporter_metrics_service_pb.is_empty_string(:undefined)

There is not much in this feedback that pertains specifically to this PR other than maybe getting tripped up on the :unit value. I admit that we are just now getting started with Otel Metrics (we've been reporting metrics over Statsd for years now) and that there may indeed be some user error on our part.

If there is anything at all I can do to help make this feedback more actionable, or if there are any issues I can work on, I would be happy to help out in whatever way we can!

@cheerfulstoic
Copy link

I was able to get this working, though I had to make some changes 😅

Firstly, I'm currently using the experimental code from GitHub because not everything is released to hex

      {:opentelemetry_experimental,
       git: "https://github.com/cheerfulstoic/opentelemetry-erlang.git",
       sparse: "apps/opentelemetry_experimental",
       override: true},
      {:opentelemetry_api_experimental,
       git: "https://github.com/cheerfulstoic/opentelemetry-erlang.git",
       sparse: "apps/opentelemetry_api_experimental",
       override: true},

But I also needed to get the "metric" differently than how the code in this PR works. Instead of calling :opentelemetry_experimental.get_meter() I had to use :opentelemetry_experimental.get_meter(:opentelemetry.get_application_scope(__MODULE__)) which I got from looking at the counter.ex implementation. Here are my changes:

defmodule OtelTelemetryMetrics do
  @moduledoc """
  BASED ON THIS PR:
  https://github.com/open-telemetry/opentelemetry-erlang-contrib/pull/303

  If we can get this generally working, it would be nice to contribute the changes back.

  `OtelTelemetryMetrics.start_link/1` creates OpenTelemetry Instruments for
  `Telemetry.Metric` metrics and records to them when their corresponding
  events are triggered.

      metrics = [
        last_value("vm.memory.binary", unit: :byte),
        counter("vm.memory.total"),
        counter("db.query.duration", tags: [:table, :operation]),
        summary("http.request.response_time",
          tag_values: fn
            %{foo: :bar} -> %{bar: :baz}
          end,
          tags: [:bar],
          drop: fn metadata ->
            metadata[:boom] == :pow
          end
        ),
        sum("telemetry.event_size.metadata",
          measurement: &__MODULE__.metadata_measurement/2
        ),
        distribution("phoenix.endpoint.stop.duration",
          measurement: &__MODULE__.measurement/1
        )
      ]

      {:ok, _} = OtelTelemetryMetrics.start_link([metrics: metrics])

  Then either in your Application code or a dependency execute `telemetry`
  events conataining the measurements. For example, an event that will result
  in the metrics `vm.memory.total` and `vm.memory.binary` being recorded to:

      :telemetry.execute([:vm, :memory], %{binary: 100, total: 200}, %{})

  OpenTelemetry does not support a `summary` type metric, the `summary`
  `http.request.response_time` is recorded as a single bucket histogram.

  In `Telemetry.Metrics` the `counter` type refers to counting the number of
  times an event is triggered, this is represented as a `sum` in OpenTelemetry
  and when recording the value is sent as a `1` every time.

  Metrics of type `last_value` are ignored because `last_value` is not yet an
  aggregation supported on synchronous instruments in Erlang/Elixir
  OpenTelemetry. When it is added to the SDK this library will be updated to
  no longer ignore metrics of this type.
  """

  require Logger
  use GenServer

  @doc """
  """
  def start_link(options) do
    GenServer.start_link(__MODULE__, options, name: __MODULE__)
  end

  @impl true
  def init(options) do
    Process.flag(:trap_exit, true)

    meter = options[:meter] || get_meter()

    metrics = options[:metrics] || []

    handler_ids = create_instruments_and_attach(meter, metrics)

    {:ok, %{handler_ids: handler_ids}}
  end

  @impl true
  def terminate(_, %{handler_ids: handler_ids}) do
    Enum.each(handler_ids, fn id -> :telemetry.detach(id) end)
  end

  defp create_instruments_and_attach(meter, metrics) do
    metrics
    |> Enum.group_by(& &1.event_name)
    |> Enum.map(fn {event_name, metrics} ->
      metrics_by_measurement = Enum.group_by(metrics, &List.last(&1.name))

      for metric <- metrics do
        create_instrument(metric, meter, %{
          unit: unit(metric.unit),
          description: format_description(metric)
        })
      end

      handler_id = {__MODULE__, event_name, self()}

      :ok =
        :telemetry.attach(
          handler_id,
          event_name,
          &__MODULE__.handle_event/4,
          %{metrics_by_measurement: metrics_by_measurement}
        )

      handler_id
    end)
  end

  defp create_instrument(%Telemetry.Metrics.Counter{} = metric, meter, opts) do
    :otel_counter.create(meter, format_name(metric), opts)
  end

  # a summary is represented as an explicit histogram with a single bucket
  defp create_instrument(%Telemetry.Metrics.Summary{} = metric, meter, opts) do
    :otel_histogram.create(
      meter,
      format_name(metric),
      Map.put(opts, :advisory_params, %{explicit_bucket_boundaries: []})
    )
  end

  defp create_instrument(%Telemetry.Metrics.Distribution{} = metric, meter, opts) do
    :otel_histogram.create(meter, format_name(metric), opts)
  end

  defp create_instrument(%Telemetry.Metrics.Sum{} = metric, meter, opts) do
    :otel_counter.create(meter, format_name(metric), opts)
  end

  # waiting on
  defp create_instrument(%Telemetry.Metrics.LastValue{} = metric, _meter, _) do
    Logger.info(
      "Ignoring metric #{inspect(metric.name)} because LastValue aggregation is not supported in this version of OpenTelemetry Elixir"
    )

    nil
  end

  defp unit(:unit), do: "1"
  defp unit(unit), do: "#{unit}"

  defp format_description(metric) do
    metric.description || "#{format_name(metric)}"
  end

  defp format_name(metric) do
    metric.name
    |> Enum.join(".")
    |> String.to_atom()
  end

  def handle_event(event_name, measurements, metadata, %{
        metrics_by_measurement: metrics_by_measurement
      }) do
    for {measurement, metrics} <- metrics_by_measurement,
        metric <- metrics do
      if value = keep?(metric, metadata) && extract_measurement(metric, measurements, metadata) do
        ctx = OpenTelemetry.Ctx.get_current()
        tags = extract_tags(metric, metadata)

        meter = get_meter()

        name =
          (event_name ++ [measurement])
          |> Enum.map_join(".", &to_string/1)
          |> String.to_atom()

        :ok = :otel_meter.record(ctx, meter, name, value, tags)
      end
    end
  end

  defp get_meter do
    :opentelemetry_experimental.get_meter(:opentelemetry.get_application_scope(__MODULE__))
  end

  defp keep?(%{keep: nil}, _metadata), do: true
  defp keep?(%{keep: keep}, metadata), do: keep.(metadata)

  defp extract_measurement(%Telemetry.Metrics.Counter{}, _measurements, _metadata) do
    1
  end

  defp extract_measurement(metric, measurements, metadata) do
    case metric.measurement do
      nil ->
        nil

      fun when is_function(fun, 1) ->
        fun.(measurements)

      fun when is_function(fun, 2) ->
        fun.(measurements, metadata)

      key ->
        measurements[key] || 1
    end
  end

  defp extract_tags(metric, metadata) do
    tag_values = metric.tag_values.(metadata)
    Map.take(tag_values, metric.tags)
  end
end

@cheerfulstoic
Copy link

Also important, here is my configuration! It took me quite a while to figure it out, so hopefully it's useful!

The "cumulative" settings under default_temporality_mapping are there because it seems like prometheus needs them. So anybody following along may not want those.

  config :opentelemetry,
    span_processor: :batch,
    traces_exporter: :otlp

  config :opentelemetry_exporter,
    otlp_protocol: :http_protobuf,
    otlp_endpoint: opentelemetry_endpoint

  config :opentelemetry_experimental,
    readers: [
      %{
        module: :otel_metric_reader,
        config: %{
          export_interval_ms: 1_000,
          exporter: {
            :otel_exporter_metrics_otlp,
            %{endpoints: [opentelemetry_endpoint], protocol: :http_protobuf, ssl_options: []}
          },
          # For Prometheus
          default_temporality_mapping: %{
            counter: :temporality_cumulative,
            observable_counter: :temporality_cumulative,
            updown_counter: :temporality_cumulative,
            observable_updowncounter: :temporality_cumulative,
            histogram: :temporality_cumulative,
            observable_gauge: :temporality_cumulative
          }
        }
      }
    ]

@tsloughter
Copy link
Member Author

@cheerfulstoic thanks!

The default_temporality_mapping should only be needed if you need to change the temporality for the backend it is exported to. Looks like yours is sending to a backend that only support cumulative?

The configuration can certainly be improved and hopefully will as we adopt https://github.com/open-telemetry/opentelemetry-configuration/ -- I plan to support the otel config format as Erlang/Elixir terms first (in our usual configuration files) and then it'll support converting from json to internal configuration. Verifying configuration will be done through the tool that is in opentelemetry-configuration repo as a release hook if the user wants.

@cheerfulstoic
Copy link

👍 Cool

Yeah, I was using the grafana/otel-lgtm container to get the metrics and when I turned on ENABLE_LOGS_PROMETHEUS I was getting errors about the metrics until I changed them to cumulative. So I left them that way because we use prometheus in production as well, though we're still working to understand how the metrics are coming out into grafana.

@cheerfulstoic
Copy link

I was putting these change into a couple of different project and I didn't want to copy/paste, so I extracted it into a library. I don't want the library to be permanent and I expect that whatever changes work should eventually end up here again. Just doing this in the meantime!

https://github.com/TV4/otel_telemetry_metrics

https://hex.pm/packages/otel_telemetry_metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants