Skip to content

Dynamically sized temporal bins #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 58 commits into
base: deploy
Choose a base branch
from
Open

Dynamically sized temporal bins #128

wants to merge 58 commits into from

Conversation

Nate-Wessel
Copy link
Contributor

@Nate-Wessel Nate-Wessel commented Jul 9, 2024

This creates dynamic time bins by taking the preexisting definition of "enough data" for a bin (80% of total corridor length has at least some data) and instead of applying it as a criteria to predefined one-hour bin, applies it in a rolling window over a request's time range.

Data exist as 5-minute bins in the database and we step through these one by one until we have enough of them in sequence to say there is enough data for the requested corridor. Once one bin reaches the threshold, we step to the next 5-minute bin and start accumulating data for the next observation.

The resulting time bins vary in length from 5-minutes (minimum) to much longer (need to define a maximum yet) but all would meet our criteria for having sufficient data.

The result is many more observations than the one-hour approach. These bins also, as expected, exhibit greater variability since there is less reversion to the mean within bins and greater temporal variability is visible with the increased temporal resolution.

Base automatically changed from bypass-congestion-network to deploy August 2, 2024 16:23
@Nate-Wessel Nate-Wessel linked an issue Sep 23, 2024 that may be closed by this pull request
@Nate-Wessel Nate-Wessel marked this pull request as ready for review September 23, 2024 13:41
@Nate-Wessel
Copy link
Contributor Author

I've done a bit of manual integration testing to make sure all these merges from deploy didn't break anything. They didn't, as far as I can tell. This should be good to go.

@Nate-Wessel
Copy link
Contributor Author

This work, it should be noted, is only scoped to demonstrate the promise of the concept of further subdividing what are currently uniform 1-hour bins. It has been in some practical, applied use for several months without thorough review, and the cracks are starting to show in this hasty implementation. We may eventually implement a much more robust way of defining bins but the point of this PR is to demonstrate that it can be done and is worth doing.

Copy link

@gabrielwol gabrielwol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main concern at this point is the lack of constraint on bin size as detailed in a comment.

Also to note is I've implemented an equivalent dynamic binning function in SQL here: https://github.com/CityofToronto/bdit_data-sources/blob/1132-here-aggregation-proposal/here/traffic/sql/function-cache_tt_results.sql, drawing significant inspiration from your code!

bin_ends = list()
total_length = links_df['length'].sum()
minimum_length = 0.8 * total_length
for tx in link_speeds_df.tx.unique():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I noted on Slack, there needs to be a constraint on max(tx) - min(tx) in an assembled bin. The example I caught is assembling bins from two successive Thursdays! I think within the range (from start_time - end_time) is appropriate.

image

start_node=30439735
end_node=30440490
start_time=9
end_time=16
start_date='2023-12-01'
end_date='2024-01-01'
include_holidays=True
dow_list=[4]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

I think the constraint needs to be more complex than that though. Depending where we draw the lines we could lose data that might more profitably be lumped in with the previous bin.

The code would also need some larger structural changes since bins are currently only defined by their end-points.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending where we draw the lines we could lose data that might more profitably be lumped in with the previous bin.

I thought about this too, but I decided it added too much complexity (and doesn't even add a new full segment observation, it just adds some link observations to the last bin in each period).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps so.

What do you think of this possible approach? Instead of accumulating time bins until the length threshold is reached (regardless of how long the bin becomes), do a rolling max 1-hour window, where observations from earlier 5-minute time-bins are discarded if the rolling window exceeds one-hour and still doesn't have sufficient link length.

This sidesteps a lot of possible complexity around time and date boundaries for the query, like how start/end times can wrap around midnight.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of accumulating time bins until the length threshold is reached (regardless of how long the bin becomes)...

I would be reluctant to change this core part of the algorithm (for minimal gain) after spending some time implementing it 😅

RE:

complexity around time and date boundaries for the query, like how start/end times can wrap around midnight

I didn't deal with periods overlapping midnight yet (probably won't be a standard aggregation?), but as long as you group by date + start_time + end_time before binning, it shouldn't be a problem. You need to make sure all the bins belong to the same window before assembling them!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be an edge case given the new higher sample, especially on arterials, but I'm a bit worried about averaging together observations over a very long period. I'd like to be able to extend this to lower volume locals as well at some point, so making it robust to lower volumes (sparse data) would be ideal.

From your analysis so far, do you have any sense of how often it happens that bins are much > 1 hour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a quick implementation of a 1-hour rolling window approach here.

The context for the sudden burst of energy here is that it would be really nice to have some solution to this for an urgent DR. The binning problem would be acute because they want things aggregated for each hour.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to rerun the congestion network aggregation over night cause it turns out I had the 1 hr bin restriction on. Across the congestion network, looks like 3% of bins are > 1 hr for midnight-6am time period, and 0-2% for others periods.

SELECT
    date_trunc('month', crs.dt) AS mnth,
    crs.time_grp,
    CASE WHEN date_part('isodow', crs.dt) < 6 THEN 'Weekday' ELSE 'Weekend' END AS isodow,
    COUNT(*) AS num_obs,
    COUNT(*) FILTER (WHERE upper(bin_range) - lower(bin_range) > '1 hour'::interval) AS count_longer_1hr,
    COUNT(*) FILTER (WHERE upper(bin_range) - lower(bin_range) > '1 hour'::interval)::numeric / COUNT(*) AS frac_longer_1_hr
FROM gwolofs.congestion_raw_segments AS crs
LEFT JOIN ref.holiday ON crs.dt = holiday.dt
WHERE holiday.holiday IS NULL
    --only relevant for multi-hour bins
    AND upper(time_grp) - lower(time_grp) > '1 hour'::interval
GROUP BY date_trunc('month', crs.dt), crs.time_grp, isodow
ORDER BY frac_longer_1_hr DESC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the time-frame on that? All available time (since 2019), or just the last year or so since the sample increase?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That query includes 2023-11 (pre-sample size increase) and 2024-12. The ~2%s are mostly 2023-11 and the ~0%s are mostly 2024-12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Subdivide one-hour bins
3 participants