Skip to content

Optimise evaluation strategy in circumstances where segmentation is local to each MeasureProvider #2

Open
@matthewwardrop

Description

@matthewwardrop

Currently the evaluation strategy logic always joins measures in by joining MeasureProviders in by the nominal statistical unit type. In cases where segmentation can be done entirely local to all MeasureProviders, this is inefficient.

For example, if you have two measure providers:

And you run:
`m.evaluate('user', measures=['age', 'lead_time'], segment_by=['ds'])

The current SQL generated will be something like:

SELECT users.ds, SUM(age), SUM(lead_time)
FROM (
    SELECT * FROM users
) users
LEFT JOIN (
    SELECT * FROM signups
) ON users.user = signups.user AND users.ds = signups.ds
GROUP BY users.ds

Especially if the cardinality of users is large, this can be needlessly computationally taxing.

Instead, we should add an optimisation when all requested segmentation can be performed locally by all measure providers such that the generated SQL looks something like:

SELECT a.ds, sum_age, sum_lead_time
FROM (
    SELECT ds, SUM(age) AS sum_age FROM users GROUP BY ds
) a
FULL OUTER JOIN (
    SELECT ds, SUM(lead_time) AS sum_lead_time FROM signups GROUP BY ds
) b ON a.ds = b.ds

Which potentially reduces the computational complexity of the join substantially in these simple evaluation scenarios.

Thanks for suggesting/flagging this @danfrankj

Metadata

Metadata

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions