-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-29585 Add row-level cache for the get operation #7291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea, thanks for sharing it. I do have some comments, though:
-
Can the RowCacheService be an implementation of BlockCache? Maybe a wrapper to the LRUBlockCache. I'm a bit worried about introducing a whole new layer for intercepting all read/write operations at the RPC service with cache specific logic, however this class is not the cache implementation itself. Seems a bit confusing to have a complete separate entry point to the cache.
-
Are we accepting to have same row data in multiple cache? In the current code, I haven't see any checks to avoid that. Maybe if we implement RowCacheService as a block cache implementation, so that the cache operations happen from the inner layers of the read/write operations, it would be easier to avoid duplication.
-
Why not simply evict the row that got mutated? I guess we cannot simply override it in the cache because mutation can happen on individual cells.
-
Are we accepting to have data duplicated over separate caches? I don't see any logic to avoid caching a whole block containing a region for a Get in the L2 cache, still we'll be cache the row in the row cache. Similarly, we might re-cache a row that's in the memstore in the row cache.
-
One problem of adding such small units (a single row) in the cache is that we need to keep a map index for each entry. So, the smaller the row in size, more rows would fit in the cache, but more key objects would be retained in the map. In your tests, assuming the default block cache size of 40% of the heap, it would give a 12.8GB of block cache. Have you managed to measure the block cache usage by the row cache, in terms of number of rows in the cache, byte size of the L1 cache and the total heap usage? Maybe wort collecting a heapdump to analyse the map index size in the heap.
|
||
RegionScannerImpl scanner = getScannerInternal(region, scan, results); | ||
|
||
// The row cache is ineffective when the number of store files is small. If the number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on this? Is it really a matter of number of files or total store file size? For a single CF table, where a given region, after major compaction, has a 10GB store file, wouldn't this be more efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get performance is more affected by the number of StoreFiles than by their size. This is because a StoreFileScanner must be created for each StoreFile, and the process of aggregating their results into a single Get result becomes increasingly complex. However, in testing, I found that when there was only one StoreFile, the row cache provided almost no performance benefit. Therefore, I added this condition to prevent the row cache from unnecessarily occupying BlockCache space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, right, so the main gain here comes from avoiding the merge of results from different store file scanners. I guess, there could be still benefits on doing this row caching for gets only, even when only having one store file. Say, L2 cache is at capacity already, long client scans could cause evictions for blocks of gets for repeating keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I’ll remove the condition to cache only when the number of StoreFiles is above a threshold, and always cache the row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e33db29.
|
||
private boolean tryGetFromCache(HRegion region, RowCacheKey key, Get get, List<Cell> results) { | ||
RowCells row = | ||
(RowCells) region.getBlockCache().getBlock(key, get.getCacheBlocks(), false, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RowCacheKey uses the region encoded name for indexing, whilst BlockCacheKey uses (store file name + offset). If the given row is already cached in a L2 cache block, this call will fail to fetch it and we'll cache it on the L1 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially intended for the row cache to reside only in L1 and not be cached in L2, but I haven’t actually implemented that yet. I’ll give further thought to adding this.
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RowCacheService.java
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RowCacheService.java
Show resolved
Hide resolved
Thank you for starting the PR review.
BlockCache operates at the HFile access layer, whereas the row cache needs to function at a higher layer that covers both MemStore and HFile. That’s why I implemented RowCacheService in the RPC service layer. That said, the row cache does not actually cache HFileBlocks, yet it currently relies on the BlockCache interface. I realize this might not be appropriate. I reused the BlockCache interface to reduce the overhead of creating a separate cache implementation solely for the row cache, but in hindsight, this might not have been the best approach. It may be better to build a dedicated cache implementation specifically for the row cache. What do you think?
What exactly does “multiple cache” refer to? Does it mean the L1 and L2 caches in the CombinedBlockCache? If so, I haven’t really considered that aspect yet, but I’ll start looking into it.
I didn’t fully understand the intention behind your question. Could you please explain it in more detail?
This is in the same L1/L2 context as your comment 2, correct? If so, I haven’t considered that aspect yet, but I’ll start thinking about how to handle it. Since the row cache is only enabled when there are at least two HFiles, rows that exist only in the MemStore are not cached. However, when there are two or more HFiles, rows in MemStore are also added again to the row cache. This is an intentional design choice, aimed at bypassing the process of generating results via SegmentScanner and StoreFileScanner, and instead serving Get requests directly from the cache.
I slightly modified the LruBlockCache code to record the row cache size and entry count. The row cache occupies 268.67MB with 338,602 entries. The average size of a single row cache entry is 830 bytes. Within the overall BlockCache, the row cache accounts for 45% by entry count and 2% by size.
|
Yeah, I had the same thought while going through the comments. Having a separate cache structure seems the best way to implement this.
Nevermind my previous comment. We should focus on the separate cache for rows.
Rather than blocking writes to the row cache during updates/bulkload, can we simply make the updates evict/override the row from the cache if it's already there? For puts, we shouldn't need to worry about barries, if we make sure we don't cache the row if it's in the memstore only, but we should to make sure to remove it from the row cache because the cache would now be stale. For bulkloads, I guess we only need to make sure to evict the rows for affected regions after the bulkload has been committed.
Per other comments, agree it's fine to have the row in the row cache and its' block also in the block cache. We need to decide if we want to add blocks to the block cache when doing Get, or Get should cache only in the row cache? Also, should we avoid caching if the row is the memstore? Could be challenging in the current design of caching the whole row, because memstore migh have only updates for few cells within a row.
What if more rows get cached, over time, as more gets for different rows are executed? It could lead to many rows in the cache, and many more objects in the map to index it. In the recent past. we've seen some heap issues when having very large file based bucket cache and small compressed blocks. I guess we could face similar problems here too. |
The design doc looks good. Skimmed the code, seems we put row cache into block cache? Minding explaining more on why we choose to use block cache to implement row cache? What is the benefit? Thanks. |
When the data exists in both the MemStore and the StoreFiles, we need to store it in the row cache to avoid result merging. In that case, due to the following issues, a barrier was introduced.
It would be more efficient to do as you mentioned when doing a bulkload.
I already answered this in another comment, but I’ll respond here as well. I think it’s better to put it into the BlockCache when doing a Get, according to the BlockCache setting. It is more efficient not to create a row cache when the cells to be fetched exist only in the MemStore. However, if the cells to be fetched are in both the MemStore and the StoreFiles, then creating a row cache is efficient to avoid result merging. I’ll give some more thought on how we can achieve this.
Okay. Then I’ll take a heap dump and check the size of the map’s index. |
I did it that way because the implementation was simpler. However, it causes confusion and makes it harder to have clear control over the row cache, so I’ve decided to create a separate RowCache implementation. |
The TODOs are as follows, and I will proceed in order:
|
f9b5a3a
to
120de1b
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
120de1b
to
5429771
Compare
This comment has been minimized.
This comment has been minimized.
5429771
to
b1479a2
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
- Implement RowCache - Initially considered modifying LruBlockCache, but the required changes were extensive. Instead, implemented RowCache using Caffeine cache. - Add row.cache.size configuration - Default is 0.0 (disabled); RowCache is enabled only if explicitly set to a value > 0. - The combined size of BlockCache + MemStore + RowCache must not exceed 80% of the heap. - Add Row Cache tab to RegionServer Block Cache UI - RowCache is not a BlockCache, but added here since there is no better place. - Add RowCache metrics - Metrics for size, count, eviction, hit, and miss are now exposed.
b1479a2
to
a6a8465
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I configured the RegionServer with a 4 GB heap, setting hfile.block.cache.size to 0.3 and row.cache.size to 0.1, then reran the same workload as before. Under these settings, the maximum RowCache capacity is approximately 400 MB. After the RowCache was fully populated, I generated and analyzed a heap dump.
|
@wchevreuil |
And remove the condition that decides whether to put data into the row cache based on the number of StoreFiles
6171654
to
e33db29
Compare
I’m currently trying to determine the appropriate size of the RowCache relative to the BlockCache. |
</div> | ||
<div class="tab-pane" id="tab_row_cache" role="tabpanel"> | ||
<& row_cache_stats; rowCache = rowCache &> | ||
</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We may need to rename the labels here. Where we current say "Block Cache" should be only "Cache", then on L1/L2 tabs should be labeled "BlockCache L1"/"BlockCache L2".
.addCounter(Interns.info(ROW_CACHE_EVICTED_ROW_COUNT, ""), | ||
rsWrap.getRowCacheEvictedRowCount()) | ||
.addGauge(Interns.info(ROW_CACHE_SIZE, ""), rsWrap.getRowCacheSize()) | ||
.addGauge(Interns.info(ROW_CACHE_COUNT, ""), rsWrap.getRowCacheCount()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Thanks! Please allow me a few days to review it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for lagging on this, @EungsopYoo , I'm still going through the core of your implementation, but here goes some minor "cosmetic" changes I think we could do here.
I may give another review by tomorrow EOD.
.addCounter(Interns.info(ROW_CACHE_EVICTED_ROW_COUNT, ""), | ||
rsWrap.getRowCacheEvictedRowCount()) | ||
.addGauge(Interns.info(ROW_CACHE_SIZE, ""), rsWrap.getRowCacheSize()) | ||
.addGauge(Interns.info(ROW_CACHE_COUNT, ""), rsWrap.getRowCacheCount()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
</div> | ||
<div class="tab-pane" id="tab_row_cache" role="tabpanel"> | ||
<& row_cache_stats; rowCache = rowCache &> | ||
</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We may need to rename the labels here. Where we current say "Block Cache" should be only "Cache", then on L1/L2 tabs should be labeled "BlockCache L1"/"BlockCache L2".
RowCache rowCache; | ||
</%args> | ||
<%if rowCache == null %> | ||
<p>RowCache is null</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rather say: "RowCache disabled"?
https://github.com/ben-manes/caffeine?tab=readme-ov-file#download It is recommended to use 3.x for Java 11 or above.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
private final LongAdder hitCount = new LongAdder(); | ||
private final LongAdder missCount = new LongAdder(); | ||
private final LongAdder evictedRowCount = new LongAdder(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use Cache.stats() for this?
void cacheBlock(RowCacheKey key, RowCells value) { | ||
cache.put(key, value); | ||
} | ||
|
||
public RowCells getBlock(RowCacheKey key, boolean caching) { | ||
if (!caching) { | ||
missCount.increment(); | ||
return null; | ||
} | ||
|
||
RowCells value = cache.getIfPresent(key); | ||
if (value == null) { | ||
missCount.increment(); | ||
} else { | ||
hitCount.increment(); | ||
} | ||
return value; | ||
} | ||
|
||
void evictBlock(RowCacheKey key) { | ||
cache.asMap().remove(key); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rename all these methods, as we are not caching blocks, but rows.
|
||
// After creating the barrier, evict the existing row cache for this row, | ||
// as it becomes invalid after the mutation | ||
evictRowCache(key); | ||
|
||
return execute(operation); | ||
} finally { | ||
// Remove the barrier after mutation to allow the row cache to be populated again | ||
removeRowLevelBarrier(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should (or could) we recache the mutated row?
return operation.execute(); | ||
} | ||
|
||
void evictRowCache(RowCacheKey key) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should be "evictRow".
// Row cache keys should not be evicted on close, since the cache may contain many entries and | ||
// eviction would be slow. Instead, the region’s rowCacheSeqNum is used to generate new keys that | ||
// ignore the existing cache when the region is reopened or bulk-loaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when do stale rows in the row cache get evicted? And if we don't evict rows from cache for a closed region, would these be wasting cache space until cache is full and LFU logic finally finds those for eviction?
/** | ||
* This is used to invalidate the entire row cache after bulk loading. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment correct? I thought we would be invalidating only the rows for the given regions. Rows from regions not touched by bulkload would stay valid.
private final Map<RowCacheKey, AtomicInteger> rowLevelBarrierMap = new ConcurrentHashMap<>(); | ||
|
||
private final boolean enabledByConf; | ||
private final RowCache rowCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider define an interface for RowCache and refer to the interface from here, so that we can accommodate future new RowCache implementations, beyond the caffeine one currently provided as the reference one?
The change is pretty large, I suggest we start a feature branch to land it step by step. First we introduce the framework for integrating row cache, add a flag to enable/disable it, but the code when enabling row cache can be empty. And then we can start some integration tests, like YCSB to verify performance, and ITBLL to verify correctness, if all things are good, we can merge the feature branch back. WDYT? Thanks. |
@Apache9 @wchevreuil |
No description provided.