Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect height and state data from chain #96

Merged
merged 1 commit into from
Nov 14, 2023

Conversation

smallhive
Copy link
Contributor

Closes #95.

@smallhive
Copy link
Contributor Author

Example result:

# HELP neofs_net_monitor_chain_height Chain height in blocks
# TYPE neofs_net_monitor_chain_height gauge
neofs_net_monitor_chain_height{host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_height{host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.399947e+06

# HELP neofs_net_monitor_chain_state Chain state hash in specific height
# TYPE neofs_net_monitor_chain_state gauge
neofs_net_monitor_chain_state{hash="9c112458c3b4731abccba255e4c80dbc288bf13567dca9f58d00a12909b8dd53",host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_state{hash="9c112458c3b4731abccba255e4c80dbc288bf13567dca9f58d00a12909b8dd53",host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.399947e+06

@smallhive smallhive force-pushed the 95-bring-state-height-and-root-metrics branch 2 times, most recently from 981b6ca to 51578be Compare October 18, 2023 07:13
@smallhive
Copy link
Contributor Author

smallhive commented Oct 18, 2023

Rebased

Help: "Chain state hash in specific height",
},
[]string{
"host", "hash",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know which block number this state corresponds to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of this metric is a block number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neofs_net_monitor_chain_state{hash="0e194a031464d894a0857ab7699992d50bbc86e962513f0183a06c0c0f8cb64d",host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.464418e+06
neofs_net_monitor_chain_state{hash="0e194a031464d894a0857ab7699992d50bbc86e962513f0183a06c0c0f8cb64d",host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.464418e+06

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure @532910 will be excited about it, but we can try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, we should set some float/int value for the metric value. It can be 1, and we move the block number to the tags or use this number as a metric value. I consider block number as a value is a good choice, but I don't mind changing it, if required

pkg/multinodepool/pool.go Show resolved Hide resolved
pkg/multinodepool/pool.go Outdated Show resolved Hide resolved
}

func neoGoClient(ctx context.Context, endpoint string, opts rpcclient.Options) (*rpcclient.Client, error) {
cli, err := rpcclient.New(ctx, endpoint, opts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This context better be time-limited. But in order to make use of it you need to create a client for every set of requests (#3026). While it's suboptimal, it's still a good enough choice for the purpose, we query nodes every N seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is still not limited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context is limited here when Dialing

pkg/multinodepool/pool.go Outdated Show resolved Hide resolved
}

mu.Lock()
states = append(states, monitor.StateData{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe channels for this data? Locks are not really fun and you can avoid work groups as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@smallhive smallhive force-pushed the 95-bring-state-height-and-root-metrics branch from 51578be to 39f6a54 Compare October 23, 2023 04:29
Help: "Chain state hash in specific height",
},
[]string{
"host", "hash",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure @532910 will be excited about it, but we can try.

pkg/multinodepool/pool.go Outdated Show resolved Hide resolved
pkg/pool/pool.go Outdated
if err != nil {
return nil, fmt.Errorf("can't init neo-go client: %w", err)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're touch ordinary pool here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returned

func (c *Pool) FetchState(height uint32) []monitor.StateData {
var (
states []monitor.StateData
wg sync.WaitGroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wg can be avoided, just read enough data from the channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried this way before and unfortunately, the channel will be closed faster than any data will be written to it. As a result panic with: write to the closed channel

wg sync.WaitGroup
)

stateChan := make(chan monitor.StateData, len(c.clients))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not in var above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved

}

func neoGoClient(ctx context.Context, endpoint string, opts rpcclient.Options) (*rpcclient.Client, error) {
cli, err := rpcclient.New(ctx, endpoint, opts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is still not limited.

@smallhive smallhive force-pushed the 95-bring-state-height-and-root-metrics branch 2 times, most recently from 7b812f6 to ccae84b Compare October 24, 2023 05:21
cmd/neofs-net-monitor/monitor.go Outdated Show resolved Hide resolved
heightData := m.heightFetcher.FetchHeight()

for _, d := range heightData {
chainHeight.WithLabelValues(d.Host).Set(float64(d.Value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why reset states but not reset the heights?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neofs_net_monitor_chain_height{host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_height{host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.399947e+06

neofs_net_monitor_chain_state{hash="9c112458c3b4731abccba255e4c80dbc288bf13567dca9f58d00a12909b8dd53",host="https://rpc1.morph.t5.fs.neo.org:51331"} 2.399947e+06
neofs_net_monitor_chain_state{hash="9c112458c3b4731abccba255e4c80dbc288bf13567dca9f58d00a12909b8dd53",host="https://rpc2.morph.t5.fs.neo.org:51331"} 2.399947e+06

The state contains a host and a hash and this hash every time is different. This fact leads us to the situation when Prometheus data grows for each request. In this case, eventually, the application will be terminated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what if one host is down? for example, it can be down even forever but a monitor instance has not been restarted since. does it mean that the metrics will still have this host's height? and its height will be constant (and that will be not a true height cause there is no host and no height at all)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. If the host is down, then the data will be represented with the constant value. For the monitoring dashboard, it will be a straight line. In some ways, the situations are equivalent - the host has the wrong height.
There is question to the dashboard owner and viewer - should we reset data or left old values

@@ -204,6 +230,9 @@ func (m *Monitor) Job(ctx context.Context) {

m.processContainersNumber()

minHeight := m.processChainHeight()
m.processChainState(minHeight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what thoughts are behind getting the minimal height's state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nodes are not synchronized perfectly, we want to compare their states, some common index should be used for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, neo-go's understanding of "state" may differ from mine, ok

what if some node got stacked at some height forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think exactly this situation we want to detect. In this case, alerting should notify us about different states in the block or different heights on the nodes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case its height shouldn't be considered for the purpose of minHeight calculation. But it can be the source of data for alerts (this difference in height).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skip if get an error.

but it wont be an error i guess. just a node that always responds with the same height

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the wrong constant value may be a good point in this case. According to the newly added statistics, the state metric will stuck for all nodes in the same value (because minimal is a constant). This is a moment when something is already going wrong. Is a statistic about height from other nodes useful at this moment? I'm asking because not familiar with these moments of the system.

On the other hand, height metric will have differences and alert about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i asked, cause i think that the current state (a node has N height, and the state at the node is X) should be shown in the metrics. any logic about a state at the minHeight is confusing, IMO. but i am not a person who looks at metrics (especially neo-go metrics). @roman-khimov is the right person to make a decision here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need something we can compare, therefore all of this logic. But nodes aren't perfectly synchornized, so exporting just the latest is a problem. Maybe we can export some set of values but that would mean a set of requests as well. Let's add an issue to improve this logic, we can add some filters for stale nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smallhive smallhive force-pushed the 95-bring-state-height-and-root-metrics branch from ccae84b to 75b8f8d Compare October 25, 2023 03:53
Closes #95.

Signed-off-by: Evgenii Baidakov <[email protected]>
@roman-khimov roman-khimov force-pushed the 95-bring-state-height-and-root-metrics branch from 75b8f8d to ae56655 Compare November 14, 2023 19:55
Copy link
Member

@roman-khimov roman-khimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rebased it, should be OK for the initial implementation.

@roman-khimov roman-khimov merged commit bfb2f33 into master Nov 14, 2023
8 checks passed
@roman-khimov roman-khimov deleted the 95-bring-state-height-and-root-metrics branch November 14, 2023 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bring state height and root metrics into the project
3 participants