-
Notifications
You must be signed in to change notification settings - Fork 2
Dynamically sized temporal bins #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: deploy
Are you sure you want to change the base?
Conversation
It's just a grouping var anyway and this makes the intermediate python output easier to read
no need to do calculations for data we'll throw out in the next step
does this actually save any bandwidth or space? I don't know but it feels like it could
incidentaly, I did satisfy my curiousity and verified that hmean of the speed is the same as average of the travel time
I've done a bit of manual integration testing to make sure all these merges from deploy didn't break anything. They didn't, as far as I can tell. This should be good to go. |
thanks to Gabe for spotting this!
This work, it should be noted, is only scoped to demonstrate the promise of the concept of further subdividing what are currently uniform 1-hour bins. It has been in some practical, applied use for several months without thorough review, and the cracks are starting to show in this hasty implementation. We may eventually implement a much more robust way of defining bins but the point of this PR is to demonstrate that it can be done and is worth doing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main concern at this point is the lack of constraint on bin size as detailed in a comment.
Also to note is I've implemented an equivalent dynamic binning function in SQL here: https://github.com/CityofToronto/bdit_data-sources/blob/1132-here-aggregation-proposal/here/traffic/sql/function-cache_tt_results.sql, drawing significant inspiration from your code!
backend/app/get_travel_time.py
Outdated
bin_ends = list() | ||
total_length = links_df['length'].sum() | ||
minimum_length = 0.8 * total_length | ||
for tx in link_speeds_df.tx.unique(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I noted on Slack, there needs to be a constraint on max(tx) - min(tx) in an assembled bin. The example I caught is assembling bins from two successive Thursdays! I think within the range (from start_time - end_time) is appropriate.
start_node=30439735
end_node=30440490
start_time=9
end_time=16
start_date='2023-12-01'
end_date='2024-01-01'
include_holidays=True
dow_list=[4]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
I think the constraint needs to be more complex than that though. Depending where we draw the lines we could lose data that might more profitably be lumped in with the previous bin.
The code would also need some larger structural changes since bins are currently only defined by their end-points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending where we draw the lines we could lose data that might more profitably be lumped in with the previous bin.
I thought about this too, but I decided it added too much complexity (and doesn't even add a new full segment observation, it just adds some link observations to the last bin in each period).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps so.
What do you think of this possible approach? Instead of accumulating time bins until the length threshold is reached (regardless of how long the bin becomes), do a rolling max 1-hour window, where observations from earlier 5-minute time-bins are discarded if the rolling window exceeds one-hour and still doesn't have sufficient link length.
This sidesteps a lot of possible complexity around time and date boundaries for the query, like how start/end times can wrap around midnight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of accumulating time bins until the length threshold is reached (regardless of how long the bin becomes)...
I would be reluctant to change this core part of the algorithm (for minimal gain) after spending some time implementing it 😅
RE:
complexity around time and date boundaries for the query, like how start/end times can wrap around midnight
I didn't deal with periods overlapping midnight yet (probably won't be a standard aggregation?), but as long as you group by date + start_time + end_time before binning, it shouldn't be a problem. You need to make sure all the bins belong to the same window before assembling them!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be an edge case given the new higher sample, especially on arterials, but I'm a bit worried about averaging together observations over a very long period. I'd like to be able to extend this to lower volume locals as well at some point, so making it robust to lower volumes (sparse data) would be ideal.
From your analysis so far, do you have any sense of how often it happens that bins are much > 1 hour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on a quick implementation of a 1-hour rolling window approach here.
The context for the sudden burst of energy here is that it would be really nice to have some solution to this for an urgent DR. The binning problem would be acute because they want things aggregated for each hour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to rerun the congestion network aggregation over night cause it turns out I had the 1 hr bin restriction on. Across the congestion network, looks like 3% of bins are > 1 hr for midnight-6am time period, and 0-2% for others periods.
SELECT
date_trunc('month', crs.dt) AS mnth,
crs.time_grp,
CASE WHEN date_part('isodow', crs.dt) < 6 THEN 'Weekday' ELSE 'Weekend' END AS isodow,
COUNT(*) AS num_obs,
COUNT(*) FILTER (WHERE upper(bin_range) - lower(bin_range) > '1 hour'::interval) AS count_longer_1hr,
COUNT(*) FILTER (WHERE upper(bin_range) - lower(bin_range) > '1 hour'::interval)::numeric / COUNT(*) AS frac_longer_1_hr
FROM gwolofs.congestion_raw_segments AS crs
LEFT JOIN ref.holiday ON crs.dt = holiday.dt
WHERE holiday.holiday IS NULL
--only relevant for multi-hour bins
AND upper(time_grp) - lower(time_grp) > '1 hour'::interval
GROUP BY date_trunc('month', crs.dt), crs.time_grp, isodow
ORDER BY frac_longer_1_hr DESC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the time-frame on that? All available time (since 2019), or just the last year or so since the sample increase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That query includes 2023-11 (pre-sample size increase) and 2024-12. The ~2%s are mostly 2023-11 and the ~0%s are mostly 2024-12.
This creates dynamic time bins by taking the preexisting definition of "enough data" for a bin (80% of total corridor length has at least some data) and instead of applying it as a criteria to predefined one-hour bin, applies it in a rolling window over a request's time range.
Data exist as 5-minute bins in the database and we step through these one by one until we have enough of them in sequence to say there is enough data for the requested corridor. Once one bin reaches the threshold, we step to the next 5-minute bin and start accumulating data for the next observation.
The resulting time bins vary in length from 5-minutes (minimum) to much longer (need to define a maximum yet) but all would meet our criteria for having sufficient data.
The result is many more observations than the one-hour approach. These bins also, as expected, exhibit greater variability since there is less reversion to the mean within bins and greater temporal variability is visible with the increased temporal resolution.