-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect height and state data from chain #96
Conversation
Example result:
|
981b6ca
to
51578be
Compare
Rebased |
Help: "Chain state hash in specific height", | ||
}, | ||
[]string{ | ||
"host", "hash", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know which block number this state corresponds to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of this metric is a block number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neofs_net_monitor_chain_state{hash="0e194a031464d894a0857ab7699992d50bbc86e962513f0183a06c0c0f8cb64d",host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.464418e+06
neofs_net_monitor_chain_state{hash="0e194a031464d894a0857ab7699992d50bbc86e962513f0183a06c0c0f8cb64d",host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.464418e+06
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure @532910 will be excited about it, but we can try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, we should set some float/int value for the metric value. It can be 1
, and we move the block number to the tags or use this number as a metric value. I consider block number as a value is a good choice, but I don't mind changing it, if required
} | ||
|
||
func neoGoClient(ctx context.Context, endpoint string, opts rpcclient.Options) (*rpcclient.Client, error) { | ||
cli, err := rpcclient.New(ctx, endpoint, opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This context better be time-limited. But in order to make use of it you need to create a client for every set of requests (#3026). While it's suboptimal, it's still a good enough choice for the purpose, we query nodes every N seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context is still not limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context is limited here when Dialing
pkg/multinodepool/pool.go
Outdated
} | ||
|
||
mu.Lock() | ||
states = append(states, monitor.StateData{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe channels for this data? Locks are not really fun and you can avoid work groups as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
51578be
to
39f6a54
Compare
Help: "Chain state hash in specific height", | ||
}, | ||
[]string{ | ||
"host", "hash", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure @532910 will be excited about it, but we can try.
pkg/pool/pool.go
Outdated
if err != nil { | ||
return nil, fmt.Errorf("can't init neo-go client: %w", err) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're touch ordinary pool
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returned
pkg/multinodepool/pool.go
Outdated
func (c *Pool) FetchState(height uint32) []monitor.StateData { | ||
var ( | ||
states []monitor.StateData | ||
wg sync.WaitGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wg
can be avoided, just read enough data from the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried this way before and unfortunately, the channel will be closed faster than any data will be written to it. As a result panic with: write to the closed channel
pkg/multinodepool/pool.go
Outdated
wg sync.WaitGroup | ||
) | ||
|
||
stateChan := make(chan monitor.StateData, len(c.clients)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not in var
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
} | ||
|
||
func neoGoClient(ctx context.Context, endpoint string, opts rpcclient.Options) (*rpcclient.Client, error) { | ||
cli, err := rpcclient.New(ctx, endpoint, opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context is still not limited.
7b812f6
to
ccae84b
Compare
heightData := m.heightFetcher.FetchHeight() | ||
|
||
for _, d := range heightData { | ||
chainHeight.WithLabelValues(d.Host).Set(float64(d.Value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why reset states but not reset the heights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neofs_net_monitor_chain_height{host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_height{host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_state{hash="9c112458c3b4731abccba255e4c80dbc288bf13567dca9f58d00a12909b8dd53",host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_state{hash="9c112458c3b4731abccba255e4c80dbc288bf13567dca9f58d00a12909b8dd53",host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.399947e+06
The state contains a host and a hash and this hash every time is different. This fact leads us to the situation when Prometheus data grows for each request. In this case, eventually, the application will be terminated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but what if one host is down? for example, it can be down even forever but a monitor instance has not been restarted since. does it mean that the metrics will still have this host's height? and its height will be constant (and that will be not a true height cause there is no host and no height at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. If the host is down, then the data will be represented with the constant value. For the monitoring dashboard, it will be a straight line. In some ways, the situations are equivalent - the host has the wrong height.
There is question to the dashboard owner and viewer - should we reset data or left old values
@@ -204,6 +230,9 @@ func (m *Monitor) Job(ctx context.Context) { | |||
|
|||
m.processContainersNumber() | |||
|
|||
minHeight := m.processChainHeight() | |||
m.processChainState(minHeight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what thoughts are behind getting the minimal height's state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nodes are not synchronized perfectly, we want to compare their states, some common index should be used for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, neo-go's understanding of "state" may differ from mine, ok
what if some node got stacked at some height forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think exactly this situation we want to detect. In this case, alerting should notify us about different states in the block or different heights on the nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case its height shouldn't be considered for the purpose of minHeight
calculation. But it can be the source of data for alerts (this difference in height).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just skip if get an error.
but it wont be an error i guess. just a node that always responds with the same height
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the wrong constant value may be a good point in this case. According to the newly added statistics, the state metric will stuck for all nodes in the same value (because minimal is a constant). This is a moment when something is already going wrong. Is a statistic about height from other nodes useful at this moment? I'm asking because not familiar with these moments of the system.
On the other hand, height metric will have differences and alert about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i asked, cause i think that the current state (a node has N height, and the state at the node is X) should be shown in the metrics. any logic about a state at the minHeight
is confusing, IMO. but i am not a person who looks at metrics (especially neo-go metrics). @roman-khimov is the right person to make a decision here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need something we can compare, therefore all of this logic. But nodes aren't perfectly synchornized, so exporting just the latest is a problem. Maybe we can export some set of values but that would mean a set of requests as well. Let's add an issue to improve this logic, we can add some filters for stale nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ccae84b
to
75b8f8d
Compare
Closes #95. Signed-off-by: Evgenii Baidakov <[email protected]>
75b8f8d
to
ae56655
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rebased it, should be OK for the initial implementation.
Closes #95.