-
Notifications
You must be signed in to change notification settings - Fork 95
[RFC] Riak CS and Stanchion metrics 2.1
Developer docs;
Summary: Since 2.1, Riak CS has introduced a new metric system to monitor the system in much better way and to diagnose system issues. Stanchion has also introduced a metric system that is analogous to that of Riak CS, including stanchion-admin
command and /stats
HTTP endpoint.
Status of this document: Requesting for comments. Although most implementation has been finished, it is still easy to add/remove items to cover minor improvements that reflect real need from operation viewpoint. As well as changing item names for comprehensive English would be appreciated.
The new metric system has more items than previous system to see
- Statistics of API requests - count, latency, success and errors
- Statistics of Riak PB API performance - count, latency, success and errors
- Statistics of accessing Stanchion
- Waiting time and service time of Stanchion serialization queue
- System metrics like OTP, memory, and
- Mochiweb metrics.
This document will describe basic ideas of new metric system and try to help maintain Riak CS and Stanchion 2.1 system.
Interface to retrive stats information from Riak CS has been command riak-cs-admin status
and HTTP/JSON API /riak-cs/stats
, which is not changed. But a lot of new detailed items were added as well as system metrics. See example results of riak-cs-admin status
and HTTP/JSON API /riak-cs/stats
.
To Stanchion a new corresponding command and HTTP/JSON API endpoint have been added. They have same granularity of items as that of Riak CS. See example results of stanchion-admin status
and HTTP/JSON API /stats
.
Terms:
- in, out - these are for each API call, metrics taken when a request has started and when a request has finished.
- error - if a term 'error' is in metric item name, it stands that the request has failed. Successful requests are for corresponding item that does not include 'error'. Note that most irregular responses with code 50x are not being counted neither.
- one, total - Major suffix for counting. 'one' stands for a time window that is decided by exometer; while 'total' stands for accumulated value since a node has started.
- time - stats for latency from when a request has started and has finished. All followed by suffix
95
,99
,100
,mean
andmedian
.
Categories:
- S3 API stats - items staring with prefix
service
,bucket
,list
,multiple_delete
,object
andmultipart
(names for S3 APIs) are stats for those APIs, typically followed by a term likeput
,get
ordelete
. - Stanchion access stats - items starting with prefix
velvet
stand for latency and counts accusing Stanchion process for creating/updating/deleting buckets or creating users. They are useful to know major latency of slow requests are in Stanchion or not. - riakc - items starting with prefix
riakc
stand for latency and call counts to Riak PB API.riakc
usually followed by operations likeput
orget
and their targets likemanifests
orblocks
. They are also useful to know where major latency comes from, like getting user record, bucket record, or updating manifests and so on.
As there are > 1000 items, this section is for describing major prefixes.
-
service_get - GET Service
-
bucket_(put|head|delete) - PUT, HEAD, DELETE Bucket
-
bucket_acl_(get|put) - PUT, GET Bucket ACL
-
bucket_policy_(get|put|delete) - PUT, GET, DELETE Bucket Policy
-
bucket_location_get - GET Bucket Location
-
list_uploads - listing all multipart uploads
-
multiple_delete - Delete Multiple Objects
-
list_objects - listing all objects in a bucket, equally GET Bucket
-
object_(get|put|delete) - GET, PUT, DELETE, HEAD Objects
-
object_put_copy - PUT Copy Object
-
object_acl - GET, PUT Object ACL
-
multipart_post - Initiate a multipart upload
-
multipart_upload_put - PUT Multipart Upload, putting a part of an object by copying from existing object
-
multipart_upload_post - complete a multipart upload
-
multipart_upload_delete - delete a part of a multipart upload
-
multipart_upload_get - get a list of parts in a multipart upload
-
velvet_create_user - requesting creating a user to Stanchion
-
velvet_update_user - requesting updating a user to Stanchion
-
velvet_create_bucket - requesting creating a bucket to Stanchion
-
velvet_delete_bucket - requesting deleting a bucket to Stanchion
-
velvet_set_bucket_acl - requesting updating a bucket ACL to Stanchion
-
velvet_set_bucket_policy - requesting putting a new bucket policy to Stanchion
-
velvet_delete_bucket_policy - requesting deleting a policy of the bucket to Stanchion
-
riakc_ping -
ping
PB API. invoked by/riak-cs/ping
-
riakc_get_cs_bucket - getting a bucket record
-
riakc_get_cs_user_strong - getting a user record with PR=all
-
riakc_get_cs_user - getting a user record with R=quorum and PR=one
-
riakc_put_cs_user - putting a user record after create/deleting a bucket
-
riakc_get_manifest - getting a manifest
-
riakc_put_manifest - putting a manifest
-
riakc_delete_manifest - deleting a manifest (invoked via GC)
-
riakc_get_block_n_one - getting a block with N=1 without sloppy quorum
-
riakc_get_block_n_all - getting a block with N=3 after N=1 get failed
-
riakc_get_block_remote - getting a block after N=3 get resulted in not found
-
riakc_get_block_legacy - getting a block when N=1 get is turned off
-
riakc_put_block - putting a block
-
riakc_put_block_resolved - putting a block when block siblings resolution is invoked
-
riakc_head_block - heading a block, invoked via GC
-
riakc_delete_block_constrained - first trial to delete block with PW=all
-
riakc_delete_block_secondary - second trial to delete block with PW=quorum, after PW=all failed
-
riakc_(get|put)_gc_manifest_set - invoked when a manifest is being moved to GC bucket
-
riakc_(get|delete)_gc_manifest_set - invoked when manifests are being collected
-
riakc_(get|put)_access - getting access stats, putting access stats
-
riakc_(get|put)_storage - getting storage stats, putting storage stats
-
riakc_fold_manifest_objs - invoked inside GET Bucket (listing objects within a bucket)
-
riakc_mapred_storage - stats on each MapReduce job performance
-
riakc_list_all_user_keys - all users are listed out when starting storage calculation
-
riakc_list_all_manifest_keys - only used when deleting a bucket to verify it's empty
-
riakc_list_users_receive_chunk - listing users invoked via
/riak-cs/users
API. -
riakc_get_uploads_by_index
-
riakc_get_user_by_index
-
riakc_get_gc_keys_by_index
-
riakc_get_cs_buckets_by_index
-
riakc_get_clusterid - invoked when for the first time when a proxy_get is performed
-
manifest_siblings_bp_sleep
-
pool
-
object_web_active_sockets
-
object_web_waiting_acceptors - number of inactive acceptor processes in mochiweb
-
object_web_port
-
memory_* - memory stats same as Riak
-
nodename, connected_nodes - same as Riak, but useless in CS
-
sys_* - system stats same as Riak
- bucket_create
- bucket_delete
- bucket_put_acl
- user_create
- user_update
- riakc_ping
- riakc_get_cs_bucket
- riakc_put_cs_bucket
- riakc_delete_cs_bucket
- riakc_get_cs_user_strong
- riakc_get_cs_user
- riakc_put_cs_user
- riakc_get_manifest
- riakc_list_all_user_keys
- riakc_list_all_manifest_keys
- riakc_list_users_receive_chunk
- riakc_get_user_by_index
- riakc_get_gc_keys_by_index
- riakc_get_cs_buckets_by_index
- stanchion_server_msgq_len
- waiting_time
Riak CS
- Enrich stats items [JIRA: RCS-217] #961
- Introduce Exometer #1165
- Add latency stats items to S3 API and velvet calls [JIRA: RCS-220] #1180
- Add latency stats for riak pb client operations [JIRA: RCS-243] #1189
- Add status around PB pools, memory, system and mochiweb [RCS-244] #1194
- Add stanchion stats test for API and command #1199
Stanchion
-
Introduce metrics. #92
-
Add stats to Stanchion #98
-
Feature/stats2 #99
-
[http://docs.basho.com/riakcs/latest/cookbooks/Monitoring-and-Metrics/]