Description
Currently the evaluation strategy logic always joins measures in by joining MeasureProviders in by the nominal statistical unit type. In cases where segmentation can be done entirely local to all MeasureProviders, this is inefficient.
For example, if you have two measure providers:
- users
- user [primary key]
- age [measure]
- ds [partition, see issue Add support for partition dimensions #1]
- signups
- user [foreign key]
- lead_time [measure]
- ds [partition, see issue Add support for partition dimensions #1]
And you run:
`m.evaluate('user', measures=['age', 'lead_time'], segment_by=['ds'])
The current SQL generated will be something like:
SELECT users.ds, SUM(age), SUM(lead_time)
FROM (
SELECT * FROM users
) users
LEFT JOIN (
SELECT * FROM signups
) ON users.user = signups.user AND users.ds = signups.ds
GROUP BY users.ds
Especially if the cardinality of users is large, this can be needlessly computationally taxing.
Instead, we should add an optimisation when all requested segmentation can be performed locally by all measure providers such that the generated SQL looks something like:
SELECT a.ds, sum_age, sum_lead_time
FROM (
SELECT ds, SUM(age) AS sum_age FROM users GROUP BY ds
) a
FULL OUTER JOIN (
SELECT ds, SUM(lead_time) AS sum_lead_time FROM signups GROUP BY ds
) b ON a.ds = b.ds
Which potentially reduces the computational complexity of the join substantially in these simple evaluation scenarios.
Thanks for suggesting/flagging this @danfrankj