Dynamically sized temporal bins #128

Nate-Wessel · 2024-07-09T22:07:41Z

This creates dynamic time bins by taking the preexisting definition of "enough data" for a bin (80% of total corridor length has at least some data) and instead of applying it as a criteria to predefined one-hour bin, applies it in a rolling window over a request's time range.

Data exist as 5-minute bins in the database and we step through these one by one until we have enough of them in sequence to say there is enough data for the requested corridor. Once one bin reaches the threshold, we step to the next 5-minute bin and start accumulating data for the next observation.

The resulting time bins vary in length from 5-minutes (minimum) to much longer (need to define a maximum yet) but all would meet our criteria for having sufficient data.

The result is many more observations than the one-hour approach. These bins also, as expected, exhibit greater variability since there is less reversion to the mean within bins and greater temporal variability is visible with the increased temporal resolution.

It's just a grouping var anyway and this makes the intermediate python output easier to read

no need to do calculations for data we'll throw out in the next step

does this actually save any bandwidth or space? I don't know but it feels like it could

incidentaly, I did satisfy my curiousity and verified that hmean of the speed is the same as average of the travel time

Nate-Wessel · 2025-01-09T19:33:21Z

I've done a bit of manual integration testing to make sure all these merges from deploy didn't break anything. They didn't, as far as I can tell. This should be good to go.

thanks to Gabe for spotting this!

Nate-Wessel · 2025-01-28T20:52:46Z

This work, it should be noted, is only scoped to demonstrate the promise of the concept of further subdividing what are currently uniform 1-hour bins. It has been in some practical, applied use for several months without thorough review, and the cracks are starting to show in this hasty implementation. We may eventually implement a much more robust way of defining bins but the point of this PR is to demonstrate that it can be done and is worth doing.

gabrielwol

Main concern at this point is the lack of constraint on bin size as detailed in a comment.

Also to note is I've implemented an equivalent dynamic binning function in SQL here: https://github.com/CityofToronto/bdit_data-sources/blob/1132-here-aggregation-proposal/here/traffic/sql/function-cache_tt_results.sql, drawing significant inspiration from your code!

gabrielwol · 2025-01-31T14:47:43Z

backend/app/get_travel_time.py

+    bin_ends = list()
+    total_length = links_df['length'].sum()
+    minimum_length = 0.8 * total_length
+    for tx in link_speeds_df.tx.unique():


As I noted on Slack, there needs to be a constraint on max(tx) - min(tx) in an assembled bin. The example I caught is assembling bins from two successive Thursdays! I think within the range (from start_time - end_time) is appropriate.

start_node=30439735 end_node=30440490 start_time=9 end_time=16 start_date='2023-12-01' end_date='2024-01-01' include_holidays=True dow_list=[4]

Good catch.

I think the constraint needs to be more complex than that though. Depending where we draw the lines we could lose data that might more profitably be lumped in with the previous bin.

The code would also need some larger structural changes since bins are currently only defined by their end-points.

Depending where we draw the lines we could lose data that might more profitably be lumped in with the previous bin.

I thought about this too, but I decided it added too much complexity (and doesn't even add a new full segment observation, it just adds some link observations to the last bin in each period).

Perhaps so.

What do you think of this possible approach? Instead of accumulating time bins until the length threshold is reached (regardless of how long the bin becomes), do a rolling max 1-hour window, where observations from earlier 5-minute time-bins are discarded if the rolling window exceeds one-hour and still doesn't have sufficient link length.

This sidesteps a lot of possible complexity around time and date boundaries for the query, like how start/end times can wrap around midnight.

Instead of accumulating time bins until the length threshold is reached (regardless of how long the bin becomes)...

I would be reluctant to change this core part of the algorithm (for minimal gain) after spending some time implementing it 😅

RE:

complexity around time and date boundaries for the query, like how start/end times can wrap around midnight

I didn't deal with periods overlapping midnight yet (probably won't be a standard aggregation?), but as long as you group by date + start_time + end_time before binning, it shouldn't be a problem. You need to make sure all the bins belong to the same window before assembling them!

It might be an edge case given the new higher sample, especially on arterials, but I'm a bit worried about averaging together observations over a very long period. I'd like to be able to extend this to lower volume locals as well at some point, so making it robust to lower volumes (sparse data) would be ideal.

From your analysis so far, do you have any sense of how often it happens that bins are much > 1 hour?

I'm working on a quick implementation of a 1-hour rolling window approach here.

The context for the sudden burst of energy here is that it would be really nice to have some solution to this for an urgent DR. The binning problem would be acute because they want things aggregated for each hour.

Had to rerun the congestion network aggregation over night cause it turns out I had the 1 hr bin restriction on. Across the congestion network, looks like 3% of bins are > 1 hr for midnight-6am time period, and 0-2% for others periods.

SELECT date_trunc('month', crs.dt) AS mnth, crs.time_grp, CASE WHEN date_part('isodow', crs.dt) < 6 THEN 'Weekday' ELSE 'Weekend' END AS isodow, COUNT(*) AS num_obs, COUNT(*) FILTER (WHERE upper(bin_range) - lower(bin_range) > '1 hour'::interval) AS count_longer_1hr, COUNT(*) FILTER (WHERE upper(bin_range) - lower(bin_range) > '1 hour'::interval)::numeric / COUNT(*) AS frac_longer_1_hr FROM gwolofs.congestion_raw_segments AS crs LEFT JOIN ref.holiday ON crs.dt = holiday.dt WHERE holiday.holiday IS NULL --only relevant for multi-hour bins AND upper(time_grp) - lower(time_grp) > '1 hour'::interval GROUP BY date_trunc('month', crs.dt), crs.time_grp, isodow ORDER BY frac_longer_1_hr DESC

What's the time-frame on that? All available time (since 2019), or just the last year or so since the sample increase?

That query includes 2023-11 (pre-sample size increase) and 2024-12. The ~2%s are mostly 2023-11 and the ~0%s are mostly 2024-12.

backend/app/get_travel_time.py

Nate-Wessel added 19 commits June 6, 2024 20:18

get all the raw ta data in a dataframe

15218e2

do all the data processing in python/pandas

d2d6ce3

push results of second travel time calculation through

dd359f7

remove newly unused bits

2e218bf

rename var

602db83

define pandas version

6239b2b

Merge branch 'deploy' into bypass-congestion-network

308685d

use consistent link length

e46e8fe

Make date a string

786f557

It's just a grouping var anyway and this makes the intermediate python output easier to read

cut-off should be ">=" 0.8

c5e57df

do extrapolation after filtering by length

05586a0

no need to do calculations for data we'll throw out in the next step

Make hr an int

8ebde8c

does this actually save any bandwidth or space? I don't know but it feels like it could

remove speed to be clearer about averging

67d6baa

remove import I shouldn't have left here

618de4c

incidentaly, I did satisfy my curiousity and verified that hmean of the speed is the same as average of the travel time

correct flip of numerator denominator

b8d8fa3

use EXTRAPOLATED travel time, not simple sum

f290d96

wireframe function

853f2a8

heed pandas warning

2e4dd42

Merge branch 'bypass-congestion-network' into min-bins

f04610c

Nate-Wessel added back-end new feature labels Jul 9, 2024

Nate-Wessel self-assigned this Jul 9, 2024

Nate-Wessel added 8 commits July 10, 2024 14:49

Merge branch 'deploy' into bypass-congestion-network

3ca8797

Merge branch 'bypass-congestion-network' into min-bins

da055ed

merge in deploy

bde8af1

Merge branch 'deploy' into bypass-congestion-network

b0264c5

Merge branch 'bypass-congestion-network' into min-bins

101e57e

implement basic cumulative binning algorithm

186f360

restore previous function

6278369

group by custom bins

4933a3e

Nate-Wessel added 8 commits July 15, 2024 20:21

add title

c60025d

extend date range

e91b0bb

remove unused params

724e63b

remove requirement that links belong to segments

57acdf3

merge in deploy

794f601

Merge branch 'deploy' into bypass-congestion-network

56c9081

restore link lengths, seemingly removed by automatic merge

6b0ab56

Merge branch 'bypass-congestion-network' into min-bins

0a21552

Base automatically changed from bypass-congestion-network to deploy August 2, 2024 16:23

Nate-Wessel linked an issue Sep 23, 2024 that may be closed by this pull request

Subdivide one-hour bins #143

Open

Nate-Wessel marked this pull request as ready for review September 23, 2024 13:41

AdK0101 and others added 9 commits September 25, 2024 15:34

fix for case with no obs (#145)

4c72f8e

Merge branch 'deploy' into min-bins

685dc7e

Merge branch 'deploy' into min-bins

4b7a9f2

Merge branch 'deploy' into min-bins

04aa513

Merge branch 'deploy' into min-bins

211620d

Merge branch 'deploy' into min-bins

ee781ee

Merge branch 'deploy' into min-bins

71abf3f

Merge branch 'deploy' into min-bins

b9de246

Merge branch 'deploy' into min-bins

ee28261

Nate-Wessel added 2 commits January 16, 2025 19:29

Merge branch 'deploy' into min-bins

076cf84

remove duplicated lines

01f3f4d

thanks to Gabe for spotting this!

Nate-Wessel requested a review from gabrielwol January 28, 2025 18:14

gabrielwol reviewed Feb 5, 2025

View reviewed changes

be explicit about order

a759988

Nate-Wessel commented Feb 26, 2025

View reviewed changes

backend/app/get_travel_time.py Outdated Show resolved Hide resolved

Nate-Wessel added 2 commits February 26, 2025 21:20

remove unused arg, do sorting properly

1dc3832

refactor to track links per 5min bin

3d9f051

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamically sized temporal bins #128

Dynamically sized temporal bins #128

Uh oh!

Nate-Wessel commented Jul 9, 2024 •

edited

Loading

Uh oh!

Nate-Wessel commented Jan 9, 2025

Uh oh!

Nate-Wessel commented Jan 28, 2025

Uh oh!

gabrielwol left a comment

Uh oh!

gabrielwol Jan 31, 2025

Uh oh!

Nate-Wessel Feb 26, 2025

Uh oh!

gabrielwol Feb 26, 2025

Uh oh!

Nate-Wessel Feb 26, 2025

Uh oh!

gabrielwol Feb 26, 2025

Uh oh!

Nate-Wessel Feb 26, 2025

Uh oh!

Nate-Wessel Feb 26, 2025

Uh oh!

gabrielwol Feb 27, 2025

Uh oh!

Nate-Wessel Feb 27, 2025

Uh oh!

gabrielwol Feb 27, 2025

Uh oh!

Uh oh!

Uh oh!

Dynamically sized temporal bins #128

Are you sure you want to change the base?

Dynamically sized temporal bins #128

Uh oh!

Conversation

Nate-Wessel commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nate-Wessel commented Jan 9, 2025

Uh oh!

Nate-Wessel commented Jan 28, 2025

Uh oh!

gabrielwol left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Nate-Wessel commented Jul 9, 2024 •

edited

Loading