Skip to content

Conversation

dlmarion
Copy link
Contributor

An on-demand table is like an offline table, except that its tablets will be hosted when a client wants to interact with them. A table can be put into an on-demand state using the ondemand shell command or TableOperations.ondemand method. The accumulo.root and accumulo.metadata tables cannot be put into an On-Demand state.

Tablets of an on-demand table are hosted when a client fails to find the tablet location of an on-demand table. When this occurs the client makes an RPC call that inserts an "ondemand" column into the Tablet metadata. In the case that an on-demand tablet needs to be hosted for a client operation, the client will wait for the tablet to be hosted and the tablet location to be resolved by the client before proceeding.

The Manager will assign tablets that have an "ondemand" column in their metadata to TabletServers and will unassign hosted on-demand tablets when the "ondemand" column does not exist. When setting an online table to an on-demand state, all of its tablets will be unloaded. The new Property MANAGER_TABLET_GROUP_WATCHER_INTERVAL specifies the interval at which the Manager will look for tablet state changes. Lower values for this property will reduce the wait time in the client for an on-demand tablet to be hosted.

TabletServers keep track of the last access time for on-demand tablets. Periodically the TabletServer will evaluate which ondemand tablets should be unloaded. The interval for this evaluation is specified by the new Property
TABLE_ONDEMAND_UNLOADER_INTERVAL. At this interval the TabletServer will call the OnDemandTabletUnloader class specified by the Property TABLE_ONDEMAND_UNLOADER, which will return the set of tablets to unload. Unloading is performed by removing the "ondemand" column from the tablet metadata, which will cause the Manager to unassign the tablets. In the case that the TabletServer is running low on memory it will not call the OnDemandTabletUnloader, instead unloading the tablet with the oldest access time.

New metrics are emitted for on-demand tablets. Specifically, the metric accumulo.tserver.tablets.ondemand.unloaded.lowmem is incremented when an on-demand tablet is unloaded for low memory in the TabletServer, and the metric
accumulo.tserver.tablets.ondemand.online is modified when on-demand tablets are hosted or unloaded.

Finally, a new utility, ListOnlineOnDemandTablets, has been created to list on-demand tablets that are currently hosted.

Closes #3210 #3211 #3212

An on-demand table is like an offline table, except that its
tablets will be hosted when a client wants to interact with
them. A table can be put into an on-demand state using the
ondemand shell command or TableOperations.ondemand method.
The accumulo.root and accumulo.metadata tables cannot be put
into an On-Demand state.

Tablets of an on-demand table are hosted when a client fails
to find the tablet location of an on-demand table. When this
occurs the client makes an RPC call that inserts an "ondemand"
column into the Tablet metadata. In the case that an on-demand
tablet needs to be hosted for a client operation, the client
will wait for the tablet to be hosted and the tablet location
to be resolved by the client before proceeding.

The Manager will assign tablets that have an "ondemand"
column in their metadata to TabletServers and will unassign
hosted on-demand tablets when the "ondemand" column does
not exist. When setting an online table to an on-demand
state, all of its tablets will be unloaded. The new Property
MANAGER_TABLET_GROUP_WATCHER_INTERVAL specifies the interval
at which the Manager will look for tablet state changes.
Lower values for this property will reduce the wait time in
the client for an on-demand tablet to be hosted.

TabletServers keep track of the last access time for on-demand
tablets. Periodically the TabletServer will evaluate which
ondemand tablets should be unloaded. The interval for this
evaluation is specified by the new Property
TABLE_ONDEMAND_UNLOADER_INTERVAL. At this interval the
TabletServer will call the OnDemandTabletUnloader class
specified by the Property TABLE_ONDEMAND_UNLOADER, which will
return the set of tablets to unload. Unloading is performed by
removing the "ondemand" column from the tablet metadata, which
will cause the Manager to unassign the tablets. In the case
that the TabletServer is running low on memory it will not
call the OnDemandTabletUnloader, instead unloading the tablet
with the oldest access time.

New metrics are emitted for on-demand tablets. Specifically,
the metric accumulo.tserver.tablets.ondemand.unloaded.lowmem
is incremented when an on-demand tablet is unloaded for low
memory in the TabletServer, and the metric
accumulo.tserver.tablets.ondemand.online is modified when
on-demand tablets are hosted or unloaded.

Finally, a new utility, ListOnlineOnDemandTablets, has been
created to list on-demand tablets that are currently hosted.

Closes apache#3210 apache#3211 apache#3212
@dlmarion
Copy link
Contributor Author

dlmarion commented Mar 22, 2023

If you want to test this yourself, you can:

  1. Build Accumulo from this branch
  2. Set the following properties in accumulo.properties:
manager.tablet.watcher.interval=15s
tserver.ondemand.tablet.unloader.interval=1m
table.custom.ondemand.unloader.inactivity.threshold.seconds=120
  1. Start up local instance
  2. Log into Accumulo shell, and do the following
createtable test

Note
notice test and 1 tablet in Monitor
notice "loc" column in tablet metadata

ondemand test

Note
notice 0 tablets in monitor
notice "loc" missing from tablet metadata

insert a b c d

Note
notice shell command wait for tablet to be hosted
notice 1 tablet in Monitor
notice "loc" column in tablet metadata
notice "ondemand" column in tablet metadata

addsplits m -t test

Note
notice 1 tablet in Monitor
notice "loc" in tablet metadata for source tablet only
notice "ondemand" column in tablet metadata for source tablet only

sleep 240

Note
table.custom.ondemand.unloader.inactivity.threshold is set to 2m, waiting 4m should unload the tablet
notice 0 tablets in monitor
notice "loc" column missing in tablet metadata
notice "ondemand" column missing in tablet metadata

insert a b c d

Note
notice shell command wait for tablet to be hosted
notice 1 tablet in Monitor
notice "loc" column in tablet metadata
notice "ondemand" column in tablet metadata

  1. Stop Accumulo
  2. Start Accumulo

Note
test tablet had "ondemand" column on shutdown, so it gets re-hosted
notice 1 tablet in Monitor
notice "loc" column in tablet metadata
notice "ondemand" column in tablet metadata

  1. Wait a few minutes for tablet to be unloaded

Note
notice 0 tablets in monitor
notice "loc" column missing in tablet metadata
notice "ondemand" column missing in tablet metadata

Copy link
Contributor Author

@dlmarion dlmarion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed some things that can be changed.


@Override
public Repo<Manager> call(final long tid, final Manager manager) {
// TODO: How are we treating ONDEMAND tables for BulkImport?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to address this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In V2 bulk import when we switch it to use conditional mutations that may completely address this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #3264 for this.

Some ITs were created before the tablet unloading mechanism was
developed. These tests used the offline command to unload the
tablets. I removed the calls to offline to ensure that the
ondemand call was unloading the tablets in the test.
Code introduced to bring ondemand tablets online was inadvertently
throwing a TableNotFoundException in some cases causing ITs to fail
because the exception was unexpected. Moved the code that could
raise this exception and suppressed it with a log statement.
@dlmarion
Copy link
Contributor Author

Ran a full IT build and a single test failed, ConditionalWriterIT.testDeleteTable. It failed because a TableNotFoundException was thrown instead of a TableDeletedException. This was due to a call being added to get context.getTableName(TableId) to check if the table was in an ondemand state. I moved this code in fa80774 and suppressed the exception.

Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mostly finished reviewing this, the only thing I have left that I want to look at are the changes around the tablet state iterator code. Going to come back to that later today.

+ " a table to be offline before exporting.");
if (isOnline(tableName) || isOnDemand(tableName)) {
throw new IllegalStateException("The table " + tableName
+ " is not offline; exportTable requires a table to be offline before exporting.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment made me realize investigation may be needed. Created the following item in the elasticity project.

Accumulo: Elasticity (view)


@Override
public Repo<Manager> call(final long tid, final Manager manager) {
// TODO: How are we treating ONDEMAND tables for BulkImport?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In V2 bulk import when we switch it to use conditional mutations that may completely address this.

private final AtomicLong syncCounter = new AtomicLong(0);

final OnlineTablets onlineTablets = new OnlineTablets();
private final Map<KeyExtent,AtomicLong> onDemandTabletAccessTimes =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it would simplify things if the last access time were stored in the tablet object. Then this map does not need to be maintained and have the possibility to drift from the set of online tablets. When this map is updated could instead call a method on the tablet.

Also wondering if it would be better for the tablet code to take ownership of updating the last access time rather than the tserver code. Tserver code could read it and tablet code update it as needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no issue with this, and is probably a better implementation. I wonder if this should be done as a follow-on commit because Tablet is going to change, maybe drastically. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow on commit after merging SGTM. Could create an issue if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #3263 for this

"1.3.5"),
MANAGER_TABLET_GROUP_WATCHER_INTERVAL("manager.tablet.watcher.interval", "60s",
PropertyType.TIMEDURATION,
"Time to wait between scanning tablet states to determine migrations, etc.", "3.1.0"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make a PR for the 3 branch to add this change. Could still include the change here, we can make the change in main branch and when we merge main into elastic branch just resolve any conflicts. If we get this change in main then it will be one less divergence to worry about as we merge main into elastic branch going forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #3256 for this.

Comment on lines +197 to +199
final boolean shouldBeOnline = onlineTables.contains(tls.extent.tableId());
final boolean isOnDemandTable = onDemandTables.contains(tls.extent.tableId());
final boolean onDemandTabletShouldBeHosted = tls.ondemand;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be able to compute everything here and then later in the switch stmt only use shouldBeOnline var.

Suggested change
final boolean shouldBeOnline = onlineTables.contains(tls.extent.tableId());
final boolean isOnDemandTable = onDemandTables.contains(tls.extent.tableId());
final boolean onDemandTabletShouldBeHosted = tls.ondemand;
final boolean shouldBeOnline = onlineTables.contains(tls.extent.tableId()) || (onDemandTables.contains(tls.extent.tableId()) && tls.ondemand);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those variables are used in the debug log line that follows.

@dlmarion
Copy link
Contributor Author

Kicked off full IT build

@dlmarion dlmarion self-assigned this Mar 24, 2023
@cshannon
Copy link
Contributor

I tested this out quite a bit today using Uno and so far so good. The instructions need to be updated though with new property names. I couldn't figure out why the tables were not unloading and then I did some debugging and it looks like things got renamed.

The instructions say to set:

manager.tablet.watcher.interval=15s
table.ondemand.tablet.unloader.interval=1m
table.custom.ondemand.unloader.inactivity.threshold=120000

But this should now be:

manager.tablet.watcher.interval=15s
#table renamed to tserver
tserver.ondemand.tablet.unloader.interval=1m
#seconds appended to the end
table.custom.ondemand.unloader.inactivity.threshold.seconds=120

@cshannon
Copy link
Contributor

My latest changes from #3257 caused merged conflicts in this branch since I changed things in tests and especially TabletLocationState. I went ahead and created a PR for this branch that merges in elasticity branch and fixes all the conflicts: dlmarion#39

Copy link
Contributor

@cshannon cshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good, I tested everything out like I said earlier and it went fine once I set the correct properties. I made a few comments but most of them are just stylistic or nits so could be ignored.

There is one change for logging through that should be fixed as there is a mismatched number of parameters in TabletLocatorImpl


@Override
public void bringOnDemandTabletsOnline(TInfo tinfo, TCredentials credentials, String tableId,
List<TKeyExtent> extents) throws ThriftSecurityException, TException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
List<TKeyExtent> extents) throws ThriftSecurityException, TException {
List<TKeyExtent> extents) throws TException {

@dlmarion dlmarion merged commit 6a56f24 into apache:elasticity Mar 28, 2023
@dlmarion dlmarion deleted the elasticity-on-demand-tables branch October 24, 2023 20:36
@ctubbsii ctubbsii modified the milestones: 4.0.0, 3.1.0 Jul 12, 2024
@ctubbsii ctubbsii modified the milestones: 3.1.0, 4.0.0 Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

4 participants