Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support non consecutive blocks ryhope #425

Open
wants to merge 8 commits into
base: indexing-with-no-probvable-ext
Choose a base branch
from

Conversation

nicholas-mainardi
Copy link
Contributor

This PR mainly modifies ryhope crate to support non-consecutive block numbers in the index tree. It introduces the concept of epoch mapper, which maps user-defined logic epochs (which can correspond to arbitrary block numbers/timestamps) to internal incremental epochs used in the transactional storage implementations.

Copy link
Collaborator

@delehef delehef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of the minor comments, I have a major one. Although I agree that introducing a lookup epoch table is the only solution that makes sense for randomly indexed trees, it introduces (i) a performance hit for all the linearly-indexed trees, and this the harder the more they grow; (ii) consequently, a lot of tables that will boil down to [x, x + offset] – concretely, after a couple months, we will be JOIN-ing tables of several hundred thousands rows for every single query.

Do you think it would make sense/be possible, similarly to what you do with IS_EXTERNAL_MAPPER, to add a third specialization that would fall back to the old behavior when the index progression is known to be linear?

SELECT {execution_epoch}::BIGINT as {EPOCH},
{USER_EPOCH} as {KEY}
FROM {table_name} JOIN (
SELECT * FROM {mapper_table_name}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid the * and explicitly select columns?

{USER_EPOCH} as {KEY}
FROM {table_name} JOIN (
SELECT * FROM {mapper_table_name}
WHERE {USER_EPOCH} >= {}::BIGINT AND {USER_EPOCH} <= {}::BIGINT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please compute the arguments above and put them directly in the {} for readability sake?


pub fn mapper_table_name(table_name: &str) -> String {
format!("{}_mapper", table_name)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If yo do that for the _mapper, could you please do it for the _meta as well, for the sake of homogeneity?

pub type Epoch = i64;
pub type UserEpoch = i64;

pub type IncrementalEpoch = i64;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the why of this type.


/// A timestamp in a versioned storage. Using a signed type allows for easy
/// detection & debugging of erroneous subtractions.
pub type Epoch = i64;
pub type UserEpoch = i64;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the docstring to emphasize the the fact that this is not the actual epoch, but the user-facing one.

Ok(())
}
}

impl EpochMapper for InMemoryEpochMapper {
async fn try_to_incremental_epoch(&self, epoch: UserEpoch) -> Option<IncrementalEpoch> {
self.try_to_incremental_epoch_inner(epoch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put these function bodies right here, or is there a reason to implement them in duplicate out of the trait implementation?

/// Storage for tree state.
state: VersionedStorage<<T as TreeTopology>::State>,
/// Storage for topological data.
nodes: VersionedKvStorage<<T as TreeTopology>::Key, <T as TreeTopology>::Node>,
/// Storage for node-associated data.
data: VersionedKvStorage<<T as TreeTopology>::Key, V>,
epoch_mapper: SharedEpochMapper<InMemoryEpochMapper, READ_ONLY>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please comment this field, especially in which direction it maps.

}

#[derive(Default)]
pub struct Tree;
impl Tree {
pub struct Tree<const IS_EPOCH_TREE: bool>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace IS_EPOCH_TREE with something more generic?

&format!(
"
CREATE VIEW {mapper_table_alias} AS
SELECT * FROM {mapper_table_name}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be *?

),
],
)
// Build the subquery that will be used as the source of epochs and block numbers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use /// here instead of //

@nicholas-mainardi
Copy link
Contributor Author

nicholas-mainardi commented Jan 4, 2025

Outside of the minor comments, I have a major one. Although I agree that introducing a lookup epoch table is the only solution that makes sense for randomly indexed trees, it introduces (i) a performance hit for all the linearly-indexed trees, and this the harder the more they grow; (ii) consequently, a lot of tables that will boil down to [x, x + offset] – concretely, after a couple months, we will be JOIN-ing tables of several hundred thousands rows for every single query.

I see, it's a concern I had as well, but I wasn't so sure that a JOIN would be much slower than a generate_series operation. Basically, the generate_series is also constructing multiple rows for each row of the original table, so the output size would roughly be the same I think. But on the other hand it doesn't have to read any other table, so it could do all the operations in memory I guess. Do you know whether a JOIN will actually be much slower than generate_series or shall we do some benchmarks?

Do you think it would make sense/be possible, similarly to what you do with IS_EXTERNAL_MAPPER, to add a third specialization that would fall back to the old behavior when the index progression is known to be linear?

I think this would be possible, but maybe it will be more annoying on the API side because the caller will need to specify an internal implementation detail (i.e., whether the table has incremental epochs or not). Furthermore, we will need to make this information also available to parsil, which might make the parsil API more cumbersome. But it should still be reasonable, so it could be a good compromise if the performance hit is too high. Maybe on this I would like to hear also @nikkolasg's opinion since it will affect also the API and DQ as well (as I expect any table in DQ will need to specify to ryhope how to handle epochs).

@delehef delehef changed the title Feat/support non consecutive blocks ryhope feat: support non consecutive blocks ryhope Jan 4, 2025
@delehef
Copy link
Collaborator

delehef commented Jan 4, 2025

but I wasn't so sure that a JOIN would be much slower than a generate_series operation.

IIRC, the generate_series only came into play after the block range had been filtered, right? So although I agree that the final row count is the same, this is not what is worrying me – rather, that we will JOIN a 10k's large tree node table with a 10k's large mapping table before any filtering took place.

Furthermore, we will need to make this information also available to parsil, which might make the parsil API more cumbersome.

Agreed on this, the parsil/ryhope interfacing is suboptimal. I already tried to split them better, but the best I could do is the MetaOperation thing, especially the KeySource associated type, which is far from a perfect solution :(

@nicholas-mainardi
Copy link
Contributor Author

nicholas-mainardi commented Jan 6, 2025

IRC, the generate_series only came into play after the block range had been filtered, right? So although I agree that the final row count is the same, this is not what is worrying me – rather, that we will JOIN a 10k's large tree node table with a 10k's large mapping table before any filtering took place.

Not sure actually, to me it looked that the generate_series was applied on every row of the table, as it will compare each value of __valid_from and __valid_until with the block range bounds and then generate a series for each row depending on the comparison, so the complexity should be somehow O(n*|block_range|), where |block_range| is the size of the block range. With the new query, the filtering on the epoch mapper table is still applied before the JOIN (it should be inside the SELECT sub-query), so the complexity should still be O(n*|block_range|). Did I get wrong how the queries with generate_series were working?

@delehef
Copy link
Collaborator

delehef commented Jan 6, 2025

If we take this example, that went from:

SELECT
  {KEY}, generate_series(GREATEST({VALID_FROM}, $1), LEAST({VALID_UNTIL}, $2)) AS epoch, {PAYLOAD}
  FROM {table}
  WHERE NOT ({VALID_FROM} > $2 OR {VALID_UNTIL} < $1) AND {KEY} = ANY($3),

to:

SELECT {KEY}, {USER_EPOCH} AS epoch, {PAYLOAD} 
  FROM {table} JOIN (
    SELECT {USER_EPOCH}, {INCREMENTAL_EPOCH} FROM {mapper_table_name} WHERE {USER_EPOCH} >= $1 AND {USER_EPOCH} <= $2
  ) AS __mapper 
  ON {VALID_FROM} <= {INCREMENTAL_EPOCH} AND {VALID_UNTIL} >= {INCREMENTAL_EPOCH} 
  WHERE {KEY} = ANY($3)
            

the old version is one linear scan over the block range. The new one however, is one linear scan plus a join over the filtered mapper, that we expect to grow linearly in function of time.

So if b is the block range, m the mapper size, and n the vTable size, then to my understanding the old version is O(n), whereas the new one is O(n [WHERE clause on table] + m [WHERE clause on mapper] + b×b [JOIN table/mapper]). Does that seem correct to you?

Copy link
Collaborator

@nikkolasg nikkolasg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job for a complex task ! What i would really like to see more tho is more comments explaining the architectural choice, the hidden assumptions (only one process must update the DB epoch table etc), and also split into a bit more files when that makes sense. The ryhope codebase is fairly dense.


/// The index tree when the primary index is the block number of a blockchain is a sbbst since it
/// is a highly optimized tree for monotonically increasing index. It produces very little
/// tree-manipulating operations on update, and therefore, requires the least amount of reproving
/// when adding a new index.
/// NOTE: when dealing with another type of index, i.e. a general index such as what can happen on
/// a result table, then this tree does not work anymore.
pub type BlockTree = sbbst::Tree;
pub type BlockTree = sbbst::EpochTree;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change comments above which is still very much linked to sequential usecase

index_table_name: String,
row_table_name: String,
genesis_block: UserEpoch,
alpha: scapegoat::Alpha,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol we never really played with this alpha to find the best trade-off, something to keep in mind...

Comment on lines +298 to +301
// {table} JOIN (
// SELECT {USER_EPOCH}, {INCREMENTAL_EPOCH} FROM {mapper_table}
// WHERE {USER_EPOCH} >= $min_block AND {USER_EPOCH} <= $max_block
// ) ON {VALID_FROM} <= {INCREMENTAL_EPOCH} AND {VALID_UNTIL} >= {INCREMENTAL_EPOCH}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really not an expert of the parser framework, but isn't it simpler to write the query in a string and ask the framework to parse it instead of manually constructing it ? why is it not possible ?

if self
.current_epoch()
.await
.ok()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth panicking in errors now such that it's easy to remember and change in the error type later on ?
I mean i'll completely forget about the ok() after this review for sure.

}
}
#[derive(Clone, Debug)]
pub struct InMemoryEpochMapper(BTreeMap<UserEpoch, IncrementalEpoch>);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not if you ask me. We don't use the memory backend to be honest in practice, maybe in a few tests here and there only in ryhope i believe but nothing important. So clearly not something to spend time optimizing.


async fn fetch_at_inner(&self, epoch: IncrementalEpoch) -> T {
trace!("[{self}] fetching payload at {}", epoch);
let connection = self.db.get().await.unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put expect() instead.

pub(crate) const INITIAL_INCREMENTAL_EPOCH: IncrementalEpoch = 0;

#[derive(Clone, Debug)]
pub struct EpochMapperStorage {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments please.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can we put this whole thing into another file ? we're already at 1.5k lines.

in_tx: bool,
/// Set of `UserEpoch`s being updated in the cache since the last commit to the DB
dirty: BTreeSet<UserEpoch>,
pub(super) cache: Arc<RwLock<InMemoryEpochMapper>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write somewhere why

  1. we need a cache
  2. we need a shared cache
    it wasn't clear to me why we needed that until i chatted with you.

Comment on lines +1700 to +1709
db_tx
.query(
&format!(
"INSERT INTO {} ({USER_EPOCH}, {INCREMENTAL_EPOCH})
VALUES ($1, $2)",
self.mapper_table_name()
),
&[&(user_epoch as UserEpoch), &incremental_epoch],
)
.await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do only one query accross all vs doing one by one for each epoch ?

assert!(s.store(11, 2).await.is_err()); // try insert key smaller than previous one
assert!(s.store(14, 2).await.is_ok());
assert!(s.store(15, 2).await.is_ok());
s.commit_transaction().await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the values of the DB + cache afterwards ?

@nicholas-mainardi
Copy link
Contributor Author

Ah right now I see what you meant that a range is applied before generate_series, I didn't think that usually in the queries we apply a range in the WHERE clause, sorry about that.
Btw, in the old version I was considering also the work of generate_series to be done for each of the rows of the vTable in the range, which should be O(b) for each matching row as it might need to generate at most b incremental epochs. But maybe you were not considering it because it is done entirely in memory so it should be much faster than table operations?. So the complexity of the old version in your example query should be O(b*b) (assuming we have an index over __valid_from and __valid_until in the vTable) against O(n*b), which should be a really good point I think.

I am trying to think if there is a way to re-write the queries for the new version without the expensive JOIN. I found a way for some type of queries with this solution:

SELECT __key, generate_series(GREATEST(__valid_from, min_epoch), LEAST(__valid_until, max_epoch)) FROM {table} CROSS JOIN (SELECT MIN(__incremental_epoch) as min_epoch, MAX(__incremental_epoch) as max_epoch FROM {table_mapper} WHERE __user_epoch >= $1 AND __user_epoch <= $2) as mapper_range WHERE __valid_from <= mapper_range.max_epoch AND __valid_until >= mapper_range.min_epoch;

This query should be more efficient since we are JOINING with a table with only 1 row, and the sub-query should be run only on a range of epochs, not on the entire mapper table.
The problem is that this would work only if we can return incremental epoch in the SELECT, but in some queries (e..g, the ones needed to compute the results in parsil::Executor), we need to have user-defined epochs because there might be other predicates in the query referring to user epochs. I am not sure we can expose these user-epochs without an actual JOIN. I will try to see if I can find a reasonably efficient solution, otherwise I think it would make sense performance-wise to keep the old version for tables with incremental epochs, as you suggested.

@nikkolasg
Copy link
Collaborator

As we discussed in meeting, let's first do a benchmark and see what are the consequences..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants