Skip to content

S3 CRT client metrics are not properly reported every 5 seconds #1410

Open
@dannycjones

Description

@dannycjones

Mountpoint for Amazon S3 version

mount-s3 1.16.2

AWS Region

us-east-1

Describe the running environment

Just running in EC2.

Mountpoint options

mount-s3 <BUCKET> <MNT> --log-metrics

What happened?

Mountpoint only emits metrics for S3 CRT client at the start of each S3 meta request. Later, that is emitted by our metrics sink. However, there's no more metrics even where meta requests are in use for longer than 5 seconds.

We should move from polling metrics at the start of meta requests to polling the client when metrics are emitted, as we do for process-level metrics like memory.

S3 CRT client metrics can be "polled" here:

fn poll_client_metrics(s3_client: &Client) {
let metrics = s3_client.poll_client_metrics();
metrics::gauge!("s3.client.num_requests_being_processed").set(metrics.num_requests_tracked_requests as f64);
metrics::gauge!("s3.client.num_requests_being_prepared").set(metrics.num_requests_being_prepared as f64);
metrics::gauge!("s3.client.request_queue_size").set(metrics.request_queue_size as f64);
metrics::gauge!("s3.client.num_auto_default_network_io").set(metrics.num_auto_default_network_io as f64);
metrics::gauge!("s3.client.num_auto_ranged_get_network_io").set(metrics.num_auto_ranged_get_network_io as f64);
metrics::gauge!("s3.client.num_auto_ranged_put_network_io").set(metrics.num_auto_ranged_put_network_io as f64);
metrics::gauge!("s3.client.num_auto_ranged_copy_network_io")
.set(metrics.num_auto_ranged_copy_network_io as f64);
metrics::gauge!("s3.client.num_total_network_io").set(metrics.num_total_network_io() as f64);
metrics::gauge!("s3.client.num_requests_stream_queued_waiting")
.set(metrics.num_requests_stream_queued_waiting as f64);
metrics::gauge!("s3.client.num_requests_streaming_response")
.set(metrics.num_requests_streaming_response as f64);
// Buffer pool metrics
let start = Instant::now();
let buffer_pool_stats = s3_client.poll_buffer_pool_usage_stats();
metrics::histogram!("s3.client.buffer_pool.get_usage_latency_us").record(start.elapsed().as_micros() as f64);
metrics::gauge!("s3.client.buffer_pool.mem_limit").set(buffer_pool_stats.mem_limit as f64);
metrics::gauge!("s3.client.buffer_pool.primary_cutoff").set(buffer_pool_stats.primary_cutoff as f64);
metrics::gauge!("s3.client.buffer_pool.primary_used").set(buffer_pool_stats.primary_used as f64);
metrics::gauge!("s3.client.buffer_pool.primary_allocated").set(buffer_pool_stats.primary_allocated as f64);
metrics::gauge!("s3.client.buffer_pool.primary_reserved").set(buffer_pool_stats.primary_reserved as f64);
metrics::gauge!("s3.client.buffer_pool.primary_num_blocks").set(buffer_pool_stats.primary_num_blocks as f64);
metrics::gauge!("s3.client.buffer_pool.secondary_reserved").set(buffer_pool_stats.secondary_reserved as f64);
metrics::gauge!("s3.client.buffer_pool.secondary_used").set(buffer_pool_stats.secondary_used as f64);
metrics::gauge!("s3.client.buffer_pool.forced_used").set(buffer_pool_stats.forced_used as f64);
}

We should perform the poll somewhere around here:

let publisher_thread = {
let inner = Arc::clone(&sink);
thread::spawn(move || {
loop {
match rx.recv_timeout(AGGREGATION_PERIOD) {
Ok(()) | Err(RecvTimeoutError::Disconnected) => break,
Err(RecvTimeoutError::Timeout) => {
poll_process_metrics(&mut sys);
inner.publish()
}
}
}
// Drain metrics one more time before shutting down. This has a chance of missing
// any new metrics data after the sink shuts down, but we assume a clean shutdown
// stops generating new metrics before shutting down the sink.
poll_process_metrics(&mut sys);
inner.publish();
})
};

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions