Skip to content

Conversation

@StacieClark-Elastic
Copy link
Member

@StacieClark-Elastic StacieClark-Elastic commented Oct 9, 2025

Proposed commit message

Added OTEL metrics to cel input to support collection of metrics per input periodic run in agentless environment.

Produces http and cel input metrics using the OTEL SDK and pushes the metrics to either a defined endpoint or the console at the end of each periodic run. No metrics are produced if no environment variables are set.
Produces a count for each defined metric for every periodic run. Each metric set is for a single periodic run.
Histograms are exported as Exponential Histograms.
If the environment variable OTEL_EXPORTER_OTLP_ENDPOINT is set, OTEL OTLP metrics will be exported after each periodic run using the to the endpoint defined in OTEL_EXPORTER_OTLP_ENDPOINT.

Each input has a unique resource attribute set. Any attributes set in the environment variable OTEL_RESOURCE_ATTRIBUTES are added to the input attribute set. Existing keys will not be overwritten

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

The default is to produce no metrics.
The only place that I made changes that could possibly effect behavior outside of this change is: We are wrapping the http transport for http metrics. We have many nested transport wrappers. I do not expect the other transport wrappers
to be effected. But it's something to look out for.

Author's Checklist

How to test this PR locally

Reviewing this PR requires building beats, building elastic-agent, then running elastic-agent standalone against a cluster.
I used the serverless cluster on prod as this has a managed OTLP endpoint. You can also run this against a 9.3.0-SNAPSHOT using elastic-package. To use elastic-package you will need to enable the APM server. Create another profiel with apm enabled. Note, if you are MacOS, you will need to change the docker file to expose a different port than 8200 and rebuild elastic-package because MacOS uses that port for another service.

There are two ways to test this.

  • Replace the agentbeat in an elastic-agent distro with one built in beats.
  • Build elastic-agent so that it pulls from the beats repo. elastic-agent and beats need to be in the same directory
  1. To build agentbeat: checkout branch, cd ../beats/x-pack/agentbeat and run
    DEV=true SNAPSHOT=true PLATFORMS=darwin/arm64 mage build
    replace PLATFORMS with correct platform for builds on non MacOS machines.
    Not required unless you are overwriting the agentbeat in an existing elastic-agent installation.
    If so, overwrite the agentbeat at /data/elastic-agent-/components

  2. To build elastic-agent: cd into elastic-agent repo and Build elastic-agent
    DEV=true EXTERNAL=false SNAPSHOT=true PLATFORMS=darwin/arm64 PACKAGES=tar.gz mage -v package
    replace PLATFORMS with correct platform for builds on non MAC machines. Make sure that beats repo is in the same directory as the elastic-agent repo since the EXTERNAL=false will pull beats code from the co-located beats repo instead of from github.

  3. in elastic-agent repo
    cd ./build/distributions
    tar -xvzf <elastic-agent-.tar.gz>
    cd into untarred directory elastic-agent-
    .
    rm elastic-agent.yml (we will replace this before running the elastic-agent)

  4. The rest of the directions are for serverless. Create an observability serverless cluster

  5. Get environment variables for APM
    On Bottom left side click "Add Data"
    Choose "Application" from choices of "What do you want to monitor?"
    Choose "OpenTelemetry" from "Monitor your Application using:"
    Copy the 3 environment variables from section 2. Values from OTEL_RESOURCE_ATTRIBUTES are added to the resource object that each CEL input creates to identify itself. It's presence is required.
    In the OTEL_RESOURCE_ATTRIBUTES template, replace with elastic-agent, app-version with the version being used. You may choose to override deployment.environment.
    Each CEL input behaves like it's own application. All CEL applications require OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS values as well.

  6. Add an integration. I have a simple CEL integration package that requires no configuration if you want an easy one to use.
    a. On lower right chose "Install Elastic Agent"
    b. In the first paragraph in the next page click the link that says “standalone mode”
    c. This takes you to the configuration page for the integration.
    d. After filling out configuration, on lower right click“Save and Continue”
    e. ON configure Agent page: Create API Key
    f. Download policy
    e. Do not install agent. Leave page

  7. Copy the downloaded policy to elastic-agent.yml into ./build/distributions/elastic-agent--

  8. Start the agent in development mode. In ./build/distributions/elastic-agent*
    sudo OTEL_RESOURCE_ATTRIBUTES="<value>" OTEL_EXPORTER_OTLP_ENDPOINT="<value>" OTEL_EXPORTER_OTLP_HEADERS="<value>" ./elastic-agent run -e --develop &> output.txt

  9. Check for data in the cluster.
    On Left choose "Discover"
    in Date View, use dropdown to select 'metrics-*'
    Filter by package and datastream name: package.datastream : "<package_name>.<datastream.name>"
    All the metrics for periodic run will have the same timestamp. For any timestamp there will be 19 metrics:
    "input.cel.periodic.run"
    "input.cel.periodic.program.run.started"
    "input.cel.periodic.program.run.success"
    "input.cel.periodic.batch.generated"
    "input.cel.periodic.batch.published"
    "input.cel.periodic.event.generated"
    "input.cel.periodic.event.published"
    "input.cel.periodic.run.duration"
    "input.cel.periodic.cel.duration"
    "input.cel.periodic.event.publish.duration"
    "input.cel.program.batch.processed"
    "input.cel.program.batch.published"
    "input.cel.program.event.generated"
    "http.client.request.body.size"
    "http.client.request.duration"
    Verify that metrics exist for each of these names.

    Look for metrics beginning with
    input.cel.periodic.* (cel processing metrics for each periodic run) (all are counters)
    input.cel.program.* (cel processing metrics for each program run. Most are histograms across all the program runs for the periodic run)
    CEL periodic and program metrics.input.cel.*
    To look at http metrics that are generated from the SDK

Other filtering options:
For instance if the id in the elastic-agent.yml is "- id: cel-cel_simple-d78ef7a8-0757-4606-902e-c6a7f9320013"
then you can filter by
resource.attributes.service.instance.id : "cel-cel_simple.fakedts-d78ef7a8-0757-4606-902e-c6a7f9320013"

Related issues

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 9, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2025

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Oct 9, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @StacieClark-Elastic? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@mergify
Copy link
Contributor

mergify bot commented Oct 12, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b Add-metrics-CEL-609 upstream/Add-metrics-CEL-609
git merge upstream/main
git push upstream Add-metrics-CEL-609

@narph narph added the Team:Security-Service Integrations Security Service Integrations Team label Nov 17, 2025
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 17, 2025
@mergify
Copy link
Contributor

mergify bot commented Nov 19, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b Add-metrics-CEL-609 upstream/Add-metrics-CEL-609
git merge upstream/main
git push upstream Add-metrics-CEL-609

@StacieClark-Elastic StacieClark-Elastic marked this pull request as ready for review November 26, 2025 22:15
@StacieClark-Elastic StacieClark-Elastic requested review from a team as code owners November 26, 2025 22:15
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@StacieClark-Elastic StacieClark-Elastic marked this pull request as draft November 26, 2025 23:42
… to a Sum metricv so it can be visualized in APM
Added a check for an environment variable 'APM_OTLP'. if set, all metric histograms will be exported as Sum (Counter) type. This is support sending metrics to both the APM OTLP endpoint and the managed OTLP endpoint
Histogram defaults to exponential type. Can be changed to use regular histograms by setting environment variable USE_NON_EXPONENTIAL_HISTOGRAMS.
Removed flush and shutdown functions due to the exporter being shared.
Shortened metric names. Removed option to export as plain histograms. Cleaned up README. Added a PNG of where metrics are collected
@StacieClark-Elastic StacieClark-Elastic marked this pull request as ready for review December 2, 2025 22:01
// record and log execution coverage.
RecordCoverage bool `config:"record_coverage"`

Package map[string]string `config:"package"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs godoc.

Comment on lines +60 to +70
v2 "github.com/elastic/beats/v7/filebeat/input/v2"
inputcursor "github.com/elastic/beats/v7/filebeat/input/v2/input-cursor"
"github.com/elastic/beats/v7/libbeat/beat"
"github.com/elastic/beats/v7/libbeat/feature"
"github.com/elastic/beats/v7/libbeat/management/status"
"github.com/elastic/beats/v7/libbeat/statestore"
"github.com/elastic/beats/v7/libbeat/version"
"github.com/elastic/beats/v7/x-pack/filebeat/input/internal/httplog"
"github.com/elastic/beats/v7/x-pack/filebeat/input/internal/httpmon"
"github.com/elastic/beats/v7/x-pack/filebeat/otel"
"github.com/elastic/beats/v7/x-pack/libbeat/common/aws"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that this was moved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad IDE settings. I fixed that.

Comment on lines +109 to +112
srcp, ok := src.(*source)
if !ok {
return fmt.Errorf("input type %T is not a source", src)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't do this. If this panics something has gone terribly wrong and the program should fail fatally. If absolutely necessary, you can add a //nolint:errcheck // If this assertion fails, the program is incorrect and should panic..

Comment on lines +124 to +128
srcP, ok := src.(*source)
if !ok {
return errors.New("inputcursor.Source is not a *source type")
}
dataStreamName := srcP.cfg.DataStream // May be empty.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Comment on lines +329 to 330
otelMetrics.AddProgramExecution(ctx, 1)
metrics.executions.Add(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general pattern, it seems to me that if metrics held otelMetrics and maybe the context, the metrics notes could be rolled into methods on metrics that do both the current metrics publication and the OTel metrics publication. This would either be with an explicit context being passed in, or with the held context. I don't really mind which.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that metrics are going to be deprecated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an even better argument for doing it that way then since it reduces diff churn in the body of the CEL input an places it in the metrics code.

Comment on lines +1009 to 1046
func GetResourceAttributes(env v2.Context, cfg config) []attribute.KeyValue {
attrs := []attribute.KeyValue{semconv.ServiceInstanceID(env.IDWithoutName),
attribute.String("package.name", cfg.GetPackageData("name")),
attribute.String("package.version", cfg.GetPackageData("version")),
attribute.String("package.data_stream", cfg.DataStream),
attribute.String("agent.version", env.Agent.Version),
attribute.String("agent.id", env.Agent.ID.String())}

usedKeys := make(map[string]struct{})

for _, attr := range attrs {
// Access the Key field of the KeyValue struct
usedKeys[string(attr.Key)] = struct{}{}
}
attributesStr, ok := os.LookupEnv("OTEL_RESOURCE_ATTRIBUTES")
if ok && len(attributesStr) > 0 {
attributes := make([]attribute.KeyValue, 0)
pairs := strings.Split(attributesStr, ",")
for _, pair := range pairs {
kv := strings.SplitN(pair, "=", 2)
if len(kv) == 2 {
key := strings.TrimSpace(kv[0])
value := strings.TrimSpace(kv[1])
if key != "" {
// don't overwrite existing keys
_, used := usedKeys[key]
if !used {
attributes = append(attributes, attribute.String(key, value))
}
}
}
}
attrs = append(attrs, attributes...)
}

return attrs

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func GetResourceAttributes(env v2.Context, cfg config) []attribute.KeyValue {
attrs := []attribute.KeyValue{semconv.ServiceInstanceID(env.IDWithoutName),
attribute.String("package.name", cfg.GetPackageData("name")),
attribute.String("package.version", cfg.GetPackageData("version")),
attribute.String("package.data_stream", cfg.DataStream),
attribute.String("agent.version", env.Agent.Version),
attribute.String("agent.id", env.Agent.ID.String())}
usedKeys := make(map[string]struct{})
for _, attr := range attrs {
// Access the Key field of the KeyValue struct
usedKeys[string(attr.Key)] = struct{}{}
}
attributesStr, ok := os.LookupEnv("OTEL_RESOURCE_ATTRIBUTES")
if ok && len(attributesStr) > 0 {
attributes := make([]attribute.KeyValue, 0)
pairs := strings.Split(attributesStr, ",")
for _, pair := range pairs {
kv := strings.SplitN(pair, "=", 2)
if len(kv) == 2 {
key := strings.TrimSpace(kv[0])
value := strings.TrimSpace(kv[1])
if key != "" {
// don't overwrite existing keys
_, used := usedKeys[key]
if !used {
attributes = append(attributes, attribute.String(key, value))
}
}
}
}
attrs = append(attrs, attributes...)
}
return attrs
}
func getResourceAttributes(env v2.Context, cfg config) []attribute.KeyValue {
attrs := []attribute.KeyValue{
semconv.ServiceInstanceID(env.IDWithoutName),
attribute.String("package.name", cfg.GetPackageData("name")),
attribute.String("package.version", cfg.GetPackageData("version")),
attribute.String("package.data_stream", cfg.DataStream),
attribute.String("agent.version", env.Agent.Version),
attribute.String("agent.id", env.Agent.ID.String()),
}
attributes := os.Getenv("OTEL_RESOURCE_ATTRIBUTES")
if attributes == "" {
return attrs
}
seen := make(map[attribute.Key]bool)
for _, attr := range attrs {
seen[attr.Key] = true
}
pairs := strings.Split(attributes, ",")
for _, pair := range pairs {
key, val, ok := strings.Cut(pair, "=")
if !ok || key == "" || seen[attribute.Key(key)] {
continue
}
attrs = append(attrs, attribute.String(key, val))
}
return attrs
}

Comment on lines +988 to +1005
resource := resource.NewWithAttributes(
semconv.SchemaURL, GetResourceAttributes(env, cfg)...,
)

log.Infof("created cel input resource %s", resource.String())
exporter, exporterType, err := otel.GetGlobalExporterFactory(log).GetExporter(ctx)
if err != nil {
log.Errorw("failed to get exporter", "error", err)
}
if err != nil {
log.Errorw("failed to get collection period", "error", err)
}
log.Infof("created OTEL cel input exporter %s for input %s", exporterType, env.IDWithoutName)
otelMetrics, otelTransport, err := otel.NewOTELCELMetrics(log, env.IDWithoutName, *resource, c.Transport, exporter)
if err != nil {
return nil, nil, nil, err
}
c.Transport = otelTransport
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be done in a helper outside newClient on its return? Then we don't need to return the otelMetrics value here.

func addOtelMetrics(ctx context.Context, cli *http.Client, cfg config, env v2.Context, log *logp.Logger) (*otel.OTELCELMetrics, error) {
	resource := resource.NewWithAttributes(semconv.SchemaURL, getResourceAttributes(env, cfg)...)

	log.Infow("created cel input resource", "resource", resource)
	exporter, typ, err := otel.GetGlobalExporterFactory(log).GetExporter(ctx)
	if err != nil {
		log.Errorw("failed to get exporter", "error", err)
	}
	if err != nil {
		log.Errorw("failed to get collection period", "error", err)
	}
	log.Infow("created OTEL cel input exporter", "type", typ, "id", env.IDWithoutName)
	metrics, transport, err := otel.NewOTELCELMetrics(log, env.IDWithoutName, *resource, cli.Transport, exporter)
	if err != nil {
		return nil, err
	}
	cli.Transport = transport
	return metrics, nil
}

with the call site looking like

	client, trace, err := newClient(ctx, cfg, log, reg)
	if err != nil {
		return err
	}
	otelMetrics, err := addOtelMetrics(ctx, client, cfg, env, log)
	if err != nil {
		return err
	}

This may not be completely possible; the helper is, but where it happens may need to still be in newClient. The existing metrics is added to the round-tripper chain before the retries, but my suggestion and the existing proposal add the OTel metrics after the retries. This means they are measuring different things.

Comment on lines +56 to +58
o.exportLock.Lock() // Acquire the lock
defer o.exportLock.Unlock()
o.started = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
o.exportLock.Lock() // Acquire the lock
defer o.exportLock.Unlock()
o.started = true
o.exportLock.Lock()
o.started = true
o.exportLock.Unlock()

programCelDurationHistogram: programCELDuration,
programEventPublishDurationHistogram: programPublishDuration,
}, transport, nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines +28 to +29
exportLock sync.Mutex
started bool
Copy link
Contributor

@efd6 efd6 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutex is protecting only the boolean, correct? If that's the case, the mutex can go away and we can use an atomic.Bool. Like

diff --git a/x-pack/filebeat/otel/cel_metrics.go b/x-pack/filebeat/otel/cel_metrics.go
index dbb5b66847..f23abcff61 100644
--- a/x-pack/filebeat/otel/cel_metrics.go
+++ b/x-pack/filebeat/otel/cel_metrics.go
@@ -9,7 +9,7 @@ import (
        "encoding/json"
        "fmt"
        "net/http"
-       "sync"
+       "sync/atomic"
        "time"
 
        "github.com/elastic/elastic-agent-libs/logp"
@@ -25,8 +25,7 @@ import (
 type OTELCELMetrics struct {
        log                                  *logp.Logger
        manualExportFunc                     func(context.Context) error
-       exportLock                           sync.Mutex
-       started                              bool
+       started                              atomic.Bool
        periodicRunCount                     metric.Int64Counter
        periodicBatchGeneratedCount          metric.Int64Counter
        periodicBatchPublishedCount          metric.Int64Counter
@@ -53,27 +52,24 @@ type OTELCELMetrics struct {
 // running periodic runs. However, test environments with
 // small intervals could potentially cause this to happen.
 func (o *OTELCELMetrics) StartPeriodic() {
-       o.exportLock.Lock() // Acquire the lock
-       defer o.exportLock.Unlock()
-       o.started = true
+       o.started.Store(true)
 }
 
 // EndPeriodic ends the periodic metrics collection and manually exports metrics if a manual export function is set.
 func (o *OTELCELMetrics) EndPeriodic(ctx context.Context) {
-       o.exportLock.Lock() // Acquire the lock
-       defer o.exportLock.Unlock()
-       if o.started {
-               o.log.Debug("OTELCELMetrics EndPeriodic called")
-               o.started = false
-               if o.manualExportFunc != nil {
-                       o.log.Debug("OTELCELMetrics manual export started")
-                       err := o.manualExportFunc(ctx)
-                       if err != nil {
-                               o.log.Errorf("error exporting metrics: %v", err)
-                       }
-                       o.log.Debug("OTELCELMetrics manual export ended")
-               }
+       o.log.Debug("OTELCELMetrics EndPeriodic called")
+       if o.manualExportFunc == nil {
+               return
+       }
+       if !o.started.CompareAndSwap(true, false) {
+               return
+       }
+       o.log.Debug("OTELCELMetrics manual export started")
+       err := o.manualExportFunc(ctx)
+       if err != nil {
+               o.log.Errorf("error exporting metrics: %v", err)
        }
+       o.log.Debug("OTELCELMetrics manual export ended")
 }
 
 func (o *OTELCELMetrics) AddPeriodicRun(ctx context.Context, count int64) {

(note that the manualExportFunc field is never written to after construction, so it does not need protection and it can be used as a non-fenced check before the atomic operation — even if the atomic.Bool approach cannot be used, this movement can be done to reduce the locking costs in the case that there is no export function)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm forcing serialization of access to the manualExportFunc due to the collect() function not being concurrent. The exporter can be run concurrently, but collect() cannot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the o.manualExportFunc is constructed by NewOTELCELMetrics, it could close over a mutex and do the locking itself.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Dec 3, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Security-Service Integrations Security Service Integrations Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants