Skip to content

[RFC] Riak CS and Stanchion metrics 2.1

UENISHI Kota edited this page Jul 27, 2015 · 9 revisions

Developer docs;

Summary: Since 2.1, Riak CS has introduced a new metric system to monitor the system in much better way and to diagnose system issues. Stanchion has also introduced a metric system that is analogous to that of Riak CS, including stanchion-admin command and /stats HTTP endpoint.

Status of this document: Requesting for comments. Although most implementation has been finished, it is still easy to add/remove items to cover minor improvements that reflect real need from operation viewpoint. As well as changing item names for comprehensive English would be appreciated.

The new metric system has more items than previous system to see

  • Statistics of API requests - count, latency, success and errors
  • Statistics of Riak PB API performance - count, latency, success and errors
  • Statistics of accessing Stanchion
  • Waiting time and service time of Stanchion serialization queue
  • System metrics like OTP, memory, and
  • Mochiweb metrics.

This document will describe basic ideas of new metric system and try to help maintain Riak CS and Stanchion 2.1 system.

Usage and What's new

Interface to retrive stats information from Riak CS has been command riak-cs-admin status and HTTP/JSON API /riak-cs/stats, which is not changed. But a lot of new detailed items were added as well as system metrics. See example results of riak-cs-admin status and HTTP/JSON API /riak-cs/stats .

To Stanchion a new corresponding command and HTTP/JSON API endpoint have been added. They have same granularity of items as that of Riak CS. See example results of stanchion-admin status and HTTP/JSON API /stats .

Terminology and categories

Terms:

  • in, out - these are for each API call, metrics taken when a request has started and when a request has finished.
  • error - if a term 'error' is in metric item name, it stands that the request has failed. Successful requests are for corresponding item that does not include 'error'. Note that most irregular responses with code 50x are not being counted neither.
  • one, total - Major suffix for counting. 'one' stands for a time window that is decided by exometer; while 'total' stands for accumulated value since a node has started.
  • time - stats for latency from when a request has started and has finished. All followed by suffix 95, 99, 100, mean and median.

Categories:

  • S3 API stats - items staring with prefix service, bucket, list, multiple_delete, object and multipart (names for S3 APIs) are stats for those APIs, typically followed by a term like put, get or delete.
  • Stanchion access stats - items starting with prefix velvet stand for latency and counts accusing Stanchion process for creating/updating/deleting buckets or creating users. They are useful to know major latency of slow requests are in Stanchion or not.
  • riakc - items starting with prefix riakc stand for latency and call counts to Riak PB API. riakc usually followed by operations like put or get and their targets like manifests or blocks. They are also useful to know where major latency comes from, like getting user record, bucket record, or updating manifests and so on.

Riak CS

As there are > 1000 items, this section is for describing major prefixes.

  • service_get - GET Service

  • bucket_(put|head|delete) - PUT, HEAD, DELETE Bucket

  • bucket_acl_(get|put) - PUT, GET Bucket ACL

  • bucket_policy_(get|put|delete) - PUT, GET, DELETE Bucket Policy

  • bucket_location_get - GET Bucket Location

  • list_uploads - listing all multipart uploads

  • multiple_delete - Delete Multiple Objects

  • list_objects - listing all objects in a bucket, equally GET Bucket

  • object_(get|put|delete) - GET, PUT, DELETE, HEAD Objects

  • object_put_copy - PUT Copy Object

  • object_acl - GET, PUT Object ACL

  • multipart_post - Initiate a multipart upload

  • multipart_upload_put - PUT Multipart Upload, putting a part of an object by copying from existing object

  • multipart_upload_post - complete a multipart upload

  • multipart_upload_delete - delete a part of a multipart upload

  • multipart_upload_get - get a list of parts in a multipart upload

  • velvet_create_user - requesting creating a user to Stanchion

  • velvet_update_user - requesting updating a user to Stanchion

  • velvet_create_bucket - requesting creating a bucket to Stanchion

  • velvet_delete_bucket - requesting deleting a bucket to Stanchion

  • velvet_set_bucket_acl - requesting updating a bucket ACL to Stanchion

  • velvet_set_bucket_policy - requesting putting a new bucket policy to Stanchion

  • velvet_delete_bucket_policy - requesting deleting a policy of the bucket to Stanchion

  • riakc_ping - ping PB API. invoked by /riak-cs/ping

  • riakc_get_cs_bucket - getting a bucket record

  • riakc_get_cs_user_strong - getting a user record with PR=all

  • riakc_get_cs_user - getting a user record with R=quorum and PR=one

  • riakc_put_cs_user - putting a user record after create/deleting a bucket

  • riakc_get_manifest - getting a manifest

  • riakc_put_manifest - putting a manifest

  • riakc_delete_manifest - deleting a manifest (invoked via GC)

  • riakc_get_block_n_one - getting a block with N=1 without sloppy quorum

  • riakc_get_block_n_all - getting a block with N=3 after N=1 get failed

  • riakc_get_block_remote - getting a block after N=3 get resulted in not found

  • riakc_get_block_legacy - getting a block when N=1 get is turned off

  • riakc_put_block - putting a block

  • riakc_put_block_resolved - putting a block when block siblings resolution is invoked

  • riakc_head_block - heading a block, invoked via GC

  • riakc_delete_block_constrained - first trial to delete block with PW=all

  • riakc_delete_block_secondary - second trial to delete block with PW=quorum, after PW=all failed

  • riakc_(get|put)_gc_manifest_set - invoked when a manifest is being moved to GC bucket

  • riakc_(get|delete)_gc_manifest_set - invoked when manifests are being collected

  • riakc_(get|put)_access - getting access stats, putting access stats

  • riakc_(get|put)_storage - getting storage stats, putting storage stats

  • riakc_fold_manifest_objs - invoked inside GET Bucket (listing objects within a bucket)

  • riakc_mapred_storage - stats on each MapReduce job performance

  • riakc_list_all_user_keys - all users are listed out when starting storage calculation

  • riakc_list_all_manifest_keys - only used when deleting a bucket to verify it's empty

  • riakc_list_users_receive_chunk - listing users invoked via /riak-cs/users API.

  • riakc_get_uploads_by_index

  • riakc_get_user_by_index

  • riakc_get_gc_keys_by_index

  • riakc_get_cs_buckets_by_index

  • riakc_get_clusterid - invoked when for the first time when a proxy_get is performed

  • manifest_siblings_bp_sleep

  • pool

  • object_web_active_sockets

  • object_web_waiting_acceptors - number of inactive acceptor processes in mochiweb

  • object_web_port

  • memory_* - memory stats same as Riak

  • nodename, connected_nodes - same as Riak, but useless in CS

  • sys_* - system stats same as Riak

Stanchion

  • bucket_create
  • bucket_delete
  • bucket_put_acl
  • user_create
  • user_update
  • riakc_ping
  • riakc_get_cs_bucket
  • riakc_put_cs_bucket
  • riakc_delete_cs_bucket
  • riakc_get_cs_user_strong
  • riakc_get_cs_user
  • riakc_put_cs_user
  • riakc_get_manifest
  • riakc_list_all_user_keys
  • riakc_list_all_manifest_keys
  • riakc_list_users_receive_chunk
  • riakc_get_user_by_index
  • riakc_get_gc_keys_by_index
  • riakc_get_cs_buckets_by_index
  • stanchion_server_msgq_len
  • waiting_time

Related Issues

Riak CS

  • Enrich stats items [JIRA: RCS-217] #961
  • Introduce Exometer #1165
  • Add latency stats items to S3 API and velvet calls [JIRA: RCS-220] #1180
  • Add latency stats for riak pb client operations [JIRA: RCS-243] #1189
  • Add status around PB pools, memory, system and mochiweb [RCS-244] #1194
  • Add stanchion stats test for API and command #1199

Stanchion

  • Introduce metrics. #92

  • Add stats to Stanchion #98

  • Feature/stats2 #99

  • [http://docs.basho.com/riakcs/latest/cookbooks/Monitoring-and-Metrics/]

Clone this wiki locally