Hubspot cell balancer & normalizer #126

szabowexler · 2024-12-03T20:21:47Z

Overview

HubSpot is experimenting internally with a cellular architecture for some of our most critical hbase tables. At a high level, this involves using the first two bytes of a row key to partition the set of rows into n (fixed to 360 in this early draft) distinct partitions. Invariants that establish that regions are entirely contained by a single cell (so no merging across cell lines) combined with an update to the balancer that makes region to server assignment cell aware allow us to trade off performance (maximized by a uniform distribution of data to regions to servers) against isolation (maximized by placing all cell data on a minimal set of servers with no overlap with other cells), with a minimal (or even zero) impact on the size of the cluster required, or the operational burden of running it. There are two major parts to this PR, and I'll address them in separate sections. At a high level, here are the outcomes this PR enables:

All regions contain the data for exactly one cell, regardless of how long the cluster runs, and how many merge or split operations occur
Subject to configuration, the number of regions per server will generally be uniform
Subject to configuration, the number of cells per server will generally be uniform
Subject to configuration, the number of cells per server will be capped offering the ability to control the performance to isolation tradeoff dynamically

In testing, the cluster we use has 550-600 region servers, 31,000-34,000 regions, and is about 130TB compressed.

Normalizer

We assume that the initial set of regions in the cluster are cell-aligned (that is, all of the data within every region comes from a single cell, but a single cell may (and probably will) require more than one region to represent it). Then, we update the normalizer such that routine cluster operations (merging regions that are too small, or splitting regions that are too large) cannot break it. Because every region starts contained within a cell, we do not need to adjust the split logic. We do need to adjust merging. In particular: when computing runs of small regions that may be merged, cell aware tables should stop when a run crosses a boundary and subsequent regions no longer belong to the same cell. These changes are contained within the SimpleRegionNormalizer, and are relatively simple.

Balancer

The cell-aware balancer represents the bulk of the complexity for this change. A quick review of hbase's balancer. hbase models the balance of a cluster (or table on a cluster) as an objective function, which is a scaled sum of constituent parts. A static threshold is set that determines when the table is unbalanced enough to require rebalancing. When a rebalance is required, a stochastic greedy algorithm is used which approximates a method of gradient descent - generator functions are randomly selected to propose iterative cluster state changes, which are accepted only if they improve the global balance state for the cluster (differentiating the approach from e.g. simulated annealing which might permit a temporary worsening, to avoid getting caught in a local optimum).

So, this PR introduces a new cost function (the HubSpotCellCostFunction) and a new generator (HubSpotCellBasedCandidateGenerator) to propose iterative mutative steps for the cluster.

Cell cost function

Our cost function captures two guiding principles for a cellular table, with s servers, r regions, and c cells:

The number of regions per server should be uniform (i.e. one of ⌊r/s⌋ or ⌊r/s⌋ + 1)
The number of cells per server should be maximized, but no greater than a configurable bound

Our cost function has two parts, one for each of these principles. To capture (1), it includes the raw count of the number of servers with a count of regions outside of the two permissible values listed above. To capture (2), we consider the concept of a server-cell (i.e. the unique instance of a given cell on a given server), and consider a desired capacity defined by the max cells per server scaled by s. In our cluster (with 585 servers), if the bound on cells per server were 36 (10%), then the capacity of server cells would be 21,060. The imbalance of the cluster (from a cell to server perspective) is represented as the number of server-cells overloaded on the cluster, scaled by that capacity. A server-cell is overloaded if the server (excluding that cell) already has the configurable bound per server of cells. In a concrete example: if a server has 50 regions (each of which is for a different cell), 50 distinct cells, and the bound is 36, then that server will have 14 excess server-cells. The sum of all excess server-cells over all servers, scaled by the overall capacity of server-cells, provides a normalized (ranging from 0-1) metric describing imbalance.

This construction relies on the step generator acting to maximize the count of distinct cells per server, and the stochastic harness to discard proposed solutions that might go over the configured limit.

Cell step generator

The step generator is by far the most complex component of this change. It is highly stochastic: at every step when "pick" is used, we utilize a reservoir random online sampling approach to do an efficient single-pass random selection (subject to some filtering/optimizing criterion). With that in mind, the step generator operates as follows.

Consider a cellular table, with s servers, r regions, and c cells:

If there are any "underloaded" servers (count of region is less than ⌊r/s⌋):
i.If there are any servers with excess supply (count of region is more than ⌊r/s⌋), pick a server with the most regions, and move a region from it to the underloaded server
If there are any "overloaded" servers (count of region is greater than ⌊r/s⌋ + 1)
i. If there are any servers with excess capacity (count of region is exactly ⌊r/s⌋), pick one such server and move a region from the overloaded server
If all servers have exactly the target cell count, return NOOP
If any servers have more than the target cell count:
i. pick a server with the most cells
ii. pick the least frequent cell (by representing region) on that server
iii. pick another server and cell such that the other server's cell is present on the picked server from (i)
iv. swap our selected server/cell pairs
If any servers have less than the target cell count:
i. pick a server with the fewest cells
ii. pick the most frequent cell (by representing region) on that server
iii. pick another server and cell such that the other cell is not present on our selected server
iv. swap our selected server/cell pairs

This is not recommended for target cell per server counts approaching 1 (anything <4 is likely to not converge), but should work very well for bounds that represent 30-70% of a given server's region capacity.

Future work

The cell candidate generator is currently combining two distinct priorities: (1) that all servers have the same number of regions, and (2) that those regions obey certain properties. This overlaps other balancer functions, and is not fully aligned with the intent of the design of the balancer. At the same time, it dramatically simplifies the logic of cell distribution to be able to assume that when the candidate generator begins shuffling cells it knows exactly how many regions exist per server, and the only way this can be achieved is to force that balance to cope with other action generators mutating the cluster state between rounds of step generation by the cell balancer
Cell based balancing is really just balancing subject to specific constraints/groupings on the first n bytes of rowkeys. It should be possible to generalize this concept to effectively be a "prefix-based" balance, where enabling this mode, setting the number of bytes, and defining the grouping mode and constraints are passed via TableDescriptor. We are already starting to think about the concept of balancer conditionals (see initial work here), and a future implementation of this, well generalized, might build on that concept. In keeping with this concept, the normalizer currently offers restrictions on how regions can be split (with e.g. the DelimitedKeyPrefixRegionSplitRestriction ), but there is no corresponding restriction on merges. A simple prefix-based merge restriction could also be implemented to simplify how we maintain groupings and isolation.
When a given server has reached saturation (either at ⌊r/s⌋ cells, or the configured ceiling whichever is lower), if it has excess capacity for regions the current logic randomizes the distribution of cells for those regions. In a fictitious case where a server has 50 regions and a cap of 36 cells, it's possible that we have one region for each of 35 cells, and 15 regions for the remaining cell. This may be highly undesirable from a performance standpoint (and we would be better served by a distribution with 2 regions for each of 14 cells, and 1 region for the remaining 22). This would require that the logic for redistributing cell-regions takes into account that we want to minimize the distinct regions per cell per server, which is not present today.

…xactly one cell

…al review & merge"

Revert "Remove all the debugging changes, generally make ready for real review & merge"

Split single function into two, and try simple random shuffling

Add the ratio and costs

Use dispersion to describe the concept

…m table regions

Ray Mattingly and others added 30 commits October 16, 2024 14:09

HubSpotCellCostFunction

9c62045

Adjust to handle little endian cell encoding

e4f5a14

Mark as private

aad121f

Revert to big endian, simplify heuristics

9b5002b

Fix NPE, add logging, run spotless

9a954dd

Clean up

6271c26

Add init debug

995b8cb

Clarify expectations via preconditions

d94d862

Update debug and add guard for non default tables

8202674

Emit setup at info level to ensure we see it

f83dc2e

Add info state dump on every cost calc call

8c6c48c

Add some debug so we can see why regionlocation would be null

ab52ea6

emit if we disable locationfinder

0ac73bb

Ensure the region finder is set if the cell cost function exists

275ba6b

Emit the multiplier

1f58743

Missed one spot

57205da

Fix debug

a9e1547

skip any that snuck in, emit better logs, and fail more obviously here

b59c17c

include count w/o servers

840496d

Make it legible

d7081eb

list details of the unknown region

856e440

Emit which table

7df5fc2

Skip if empty region server mapping, assume it's empty for now

a1d849f

Emit the cells in the region here

45c8182

Tell us about which cells this region holds

1a67f81

Add emission for region size

5328064

Make clear if we skip any non-empty regions

4490e1b

Include this

580e31c

If the first two bytes of start/stop are the same, the region holds e…

ba831d9

…xactly one cell

Correct how we calculate the cells

cdd6e77

Elias Szabo and others added 30 commits December 16, 2024 11:23

Emit the cluster state at the end of balance

71ecc2b

Revert "Remove all the debugging changes, generally make ready for re…

da0834e

…al review & merge"

Merge pull request #133 from HubSpot/revert-129-clean-up-for-merge

59b41c2

Revert "Remove all the debugging changes, generally make ready for real review & merge"

Measure this distance by region count from balanced

25e0cf9

Set target to 20% of cells

c2fde1d

Split single function into two, and try simple random shuffling

9719fa7

Start by just testing live

282bbf2

Start at 50%

1f302b1

Merge pull request #138 from HubSpot/split-and-simplify-prefix-balance

6af20a2

Split single function into two, and try simple random shuffling

Add the ratio and costs

af8ebd5

Merge pull request #139 from HubSpot/split-and-simplify-prefix-balance

4722617

Add the ratio and costs

We were computing perf:iso instead of iso:perf

b0d303c

Add a simple debug

b19d456

Use dispersion to describe the concept

b8056a4

Fix the formula

8d6bc95

Better log

a5418a9

Allow to run using build, for now

de9ed00

Merge pull request #140 from HubSpot/swap-to-dispersion

70ad820

Use dispersion to describe the concept

Include table names

7fb24a4

Fix mismatch

08b6b76

Format that

943a423

Mark that we only need the cost function if all regions are not syste…

661c8ff

…m table regions

Only needed if multipler is nonzero as well

d0641d8

Emit the initial cost

1b821c1

Avoid a divide by zero, just treat this server as balanced

5873254

Have to init this

982eba5

Make sure it's initialized

ce958fe

Only do full prep if needed

c23cf77

Include table names

7f9b302

Try setting to 0.5

6945130

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hubspot cell balancer & normalizer #126

Hubspot cell balancer & normalizer #126

szabowexler commented Dec 3, 2024 •

edited

Loading

Hubspot cell balancer & normalizer #126

Are you sure you want to change the base?

Hubspot cell balancer & normalizer #126

Conversation

szabowexler commented Dec 3, 2024 • edited Loading

Overview

Normalizer

Balancer

Cell cost function

Cell step generator

Future work

szabowexler commented Dec 3, 2024 •

edited

Loading