-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frequencies_to_stop_times
slow execution speed in large datasets
#89
Comments
Ok -- In my use case, following the third option above (not exactly, I made a couple errors) sped execution time ~100x. It still drags toward the end bc it's accumulating to a progressively larger and larger data.table, but it's enough of a boost to make my use case work. My fork is here: https://github.com/cseveren/gtfs-router - Let me know if you'd like me to submit a pull request. NB: I also added display tracker/counter on original trips. |
Bringing the discussion that arose in r5r's repo to this issue: The development version of path <- system.file("extdata/spo_gtfs.zip", package = "gtfstools")
gtfstools_gtfs <- gtfstools::read_gtfs(path)
gtfsrouter_gtfs <- gtfsrouter::extract_gtfs(path, quiet = TRUE)
microbenchmark::microbenchmark(
cvt_gtfstools_gtfs <- gtfstools::frequencies_to_stop_times(gtfstools_gtfs),
cvt_gtfsrouter_gtfs <- gtfsrouter::frequencies_to_stop_times(gtfsrouter_gtfs),
times = 5L
)
#> Unit: milliseconds
#> expr
#> cvt_gtfstools_gtfs <- gtfstools::frequencies_to_stop_times(gtfstools_gtfs)
#> cvt_gtfsrouter_gtfs <- gtfsrouter::frequencies_to_stop_times(gtfsrouter_gtfs)
#> min lq mean median uq max neval
#> 777.2107 793.785 875.153 925.58 937.2485 941.9406 5
#> 26631.9393 31469.455 31593.529 32363.29 32884.5130 34618.4487 5 Also, there seems to be a bug in cvt_gtfstools_gtfs$frequencies
#> NULL
cvt_gtfsrouter_gtfs$frequencies
#> trip_id start_time end_time headway_secs
#> 1: CPTM L07-0 14400 17940 720
#> 2: CPTM L07-0 18000 21540 360
#> 3: CPTM L07-0 21600 25140 360
#> 4: CPTM L07-0 25200 28740 360
#> 5: CPTM L07-0 28800 32340 360
#> ---
#> 700: 5290-10-1 79200 82740 1200
#> 701: 5290-10-1 82800 86340 1200
#> 702: 6450-51-0 18000 21540 3600
#> 703: 6450-51-0 21600 25140 3600
#> 704: 6450-51-0 25200 28740 3600 It would be interesting to hear from @cseveren if P.S.: I don't want to sound arrogant/snob coming to this issue and showing that "my function" is faster, or anything like that. Before implementing |
Much faster than my hack-y solution! Took only 20-30 seconds. |
Wow, thanks @cseveren for raising this important issue, and especially thanks to @dhersz for chiming in and offering a clearly superior solution, for which you definitely do not sound arrogant 😆 I was originally going to just ask for a PR of the My suggestion would then be for me to start doing that in a branch here which we can use to compare with the current |
Those commits finally finish converting the "frequencies_to_stop_times" routine to C++. It's slightly faster than @dhersz's @dhersz Our two packages do not give identical results, but largely because this procedure requires a few seemingly arbitrary decisions to be made. These notably include:
|
Re-opening because there is one pronounced negative effect on other routines. The routine implemented here appends |
That mention refers to several commits which included erroneous reference to that issue instead of this one |
TODO:
|
The above commit converts the C++ code to use vectors throughout instead of nested lists, and speeds up the sample feed by a factor of around 5, so was definitely worth doing. TODO:
|
The above commits should be enough to close this issue. @dhersz Here is a reprex applying the library (gtfsrouter)
packageVersion ("gtfsrouter")
#> [1] '0.0.5.123'
path <- "<path>/<to>/santiago-gtfs.zip"
gtfstools_gtfs <- gtfstools::read_gtfs(path)
#> Registered S3 method overwritten by 'gtfsio':
#> method from
#> summary.gtfs gtfsrouter
gtfsrouter_gtfs <- gtfsrouter::extract_gtfs(path, quiet = TRUE)
#> Warning: This feed contains no transfers.txt
#> A transfers.txt table may be constructed with the 'gtfs_transfer_table' function
system.time (
cvt_gtfstools_gtfs <- gtfstools::frequencies_to_stop_times(gtfstools_gtfs)
)
#> user system elapsed
#> 65.897 0.434 61.148
nrow (cvt_gtfstools_gtfs$stop_times)
#> [1] 6636076
system.time (
cvt_gtfsrouter_gtfs <- gtfsrouter::frequencies_to_stop_times(gtfsrouter_gtfs)
)
#> user system elapsed
#> 10.398 0.060 10.423
nrow (cvt_gtfsrouter_gtfs$stop_times)
#> [1] 6262567 Created on 2022-10-05 with reprex v2.0.2 And this new function is around 6 times faster than library (gtfsrouter)
path <- "<path>/<to>/santiago-gtfs.zip"
gtfs <- gtfsrouter::extract_gtfs(path, quiet = TRUE)
gtfs$frequencies [, start_time := rcpp_time_to_seconds (start_time)]
gtfs$frequencies [, end_time := rcpp_time_to_seconds (end_time)]
gtfs$stop_times$timepoint <- 1L
freq_trips <- unique (gtfs$frequencies$trip_id)
gtfs$stop_times$timepoint [which (gtfs$stop_times$trip_id %in% freq_trips)] <- 0L
f_stop_times <- gtfs$stop_times [gtfs$stop_times$timepoint == 0L, ]
gtfs$stop_times <- gtfs$stop_times [gtfs$stop_times$timepoint == 1L, ]
freqs <- gtfs$frequencies
freqs$nseq <- ceiling ((freqs$end_time - freqs$start_time) / freqs$headway_secs)
n <- sum (freqs$nseq)
# plus total numbers of timetable entries:
trip_id_table <- table (f_stop_times$trip_id)
index <- match (freqs$trip_id, names (trip_id_table))
freqs$num_tt_entries <- trip_id_table [index]
num_tt_entries_exp <- sum (freqs$num_tt_entries * freqs$nseq)
num_tt_entries_exp + nrow (gtfs$stop_times)
#> [1] 6262567 Created on 2022-10-05 with reprex v2.0.2 I'll leave open for a few more days, in case anyone wants to try this out and report back, otherwise i'll close soon. Thanks for all the input! |
@mpadge thanks for looking into this issue so thoroughly and sorry for such a late reply - I had been working on another project the last few months and had a real hard time switching contexts. The new gtfsrouter function looks great and I'm glad it's so much faster than gtfstools' and gtfsrouter's previous one! It's great that our packages are pushing each other to be faster and more reliable. I'll try to play with Santiago's GTFS to see why there such a big discrepancy between the results of the two functions.
If I understand correctly, point 1 can be illustrated with the following example: trip A departs every 7 minutes from 6:00 to 6:30. Assuming departure times should be treated "exactly", we would have a trip at 6:00, 6:07, 6:14, 6:21 amd 6:28. Your solution includes adding a trip departing at 6:30, is that correct? Currently the gtfstools function also ignores exact_times values. I've reported this in a gtfstools issue (ipeaGIT/gtfstools#56) in which I also propose some "strategies" to deal with exact_times = 0. I'm curious, however, of what you mean by arrival/departure_time values of 0. Are these when these fields are empty, and only the departure/arrival time of the first and last stop of a given trip are explicit, or literally values such as 00:00:00? |
No worries @dhersz, any reply is greatly appreciated no matter how long it might take. So thanks!
Yep, that's is precisely what happens, and that is exactly the solution I've opted for.
Departure values for the first stops of any service described in frequency tables should all be "00:00:00", and generally are. The subsequent times then just specify the travel times from that first stop. And those trips should then have "exact_times = 0" values in the frequencies table, but sometimes have "exact_times = 1". That does not make any sense (excepting of course the extremely rare case where an trip does indeed start at exactly that time). So my proposal is simply to ignore the "exact_times" flags completely. The entries in the frequencies table can be used to populate timetables, and any other entries in "stop_times" that specify times outside the windows given in "frequencies" will still be picked up as effectively having "exact_times = 1". So that flag is really not necessary anyway. Does that make sense? |
Cool. I don't explicitly add trips at the end_time defined in the frequencies table as you do, but the end_time of a specific frequencies entry frequently is the start_time of the subsequent entry, in which I add a trip. So I don't think that explains such a large difference between the outputs.
I think we have different interpretations of what exact_times = 0/1 means. First, I didn't know that that the departure time at the first stop of a trip described in the frequencies table should be 00:00:00. I checked the specification and there doesn't seem to be any spec about that in it, but there's in fact a recommendation made at GTFS Best Practices, by MobilityData. Cool! To be honest, in my experience I've never seen feeds abiding to this practice, but it's good to know about it. In my understanding, the exact_times value controls whether the headway is precisely taken into account, not the travel times between stops. Again in my understanding, the travel time between stops should always be considered exactly as described (like your implementation does consider), and whether this departure and arrival times are in practice strictly adhered to is controlled by the timepoint field. Quoting the spec:
So, for example, a trip whose frequencies entry is start_time = 04:00:00, end_time = 05:00:00, headway = 12 minutes and exact_times = 1 should be the equivalent of trips departing at the first stop at 04:00, 04:12, 04:24, ..., 04:48, 05:00. If the same entry had exact_times = 0 the operators would not necessarily be able to maintain the 12 minutes headway exactly, so we could have trips departing at, let's say 04:00:15, 04:13:00, 04:24:30, ..., 04:47:00, 04:59:50. The gtfstools issue I linked above refers exactly to this "random" aspect of exact_times = 0, and possible strategies to use when this value occurs. Summarizing all of this, I'd like to re-state that this is my interpretation, and would like to know if you agree with it. Currently I think both our implementations should yield near identical results (with the exception of same entries whose subsequent frequencies entry does not start exactly when the previous one stops), so I'll investigate further the difference between their results. |
I haven't yet tested with Santiago's GTFS, but I'm playing with SPTrans' GTFS (shipped with gtfstools) and I'm already seeing some discrepancies between gtfsrouter and gtfstools function. Opposite to Santiago's case, however, in my tests gtfsrouter function is returning a much larger stop_times table. Some bugs/problems I'm currently seeing with the functions:
data_path <- system.file("extdata/spo_gtfs.zip", package = "gtfstools")
gtfs <- gtfsrouter::extract_gtfs(data_path)
#> ▶ Unzipping GTFS archive✔ Unzipped GTFS archive
#> Warning: This feed contains no transfers.txt
#> A transfers.txt table may be constructed with the 'gtfs_transfer_table' function
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds
converted <- gtfsrouter::frequencies_to_stop_times(gtfs)
head(converted$stop_times)
#> trip_id arrival_time departure_time stop_id stop_sequence
#> 1: CPTM L07-0_f0 28800 28800 18940 1
#> 2: CPTM L07-0_f0 29280 29280 18920 2
#> 3: CPTM L07-0_f0 29760 29760 18919 3
#> 4: CPTM L07-0_f0 30240 30240 18917 4
#> 5: CPTM L07-0_f0 30720 30720 18916 5
#> 6: CPTM L07-0_f0 31200 31200 18965 6
head(converted$frequencies)
#> trip_id start_time end_time headway_secs
#> 1: CPTM L07-0 14400 17940 720
#> 2: CPTM L07-0 18000 21540 360
#> 3: CPTM L07-0 21600 25140 360
#> 4: CPTM L07-0 25200 28740 360
#> 5: CPTM L07-0 28800 32340 360
#> 6: CPTM L07-0 32400 35940 480
converted$stop_times[trip_id == "CPTM L07-0"]
#> Empty data.table (0 rows and 5 cols): trip_id,arrival_time,departure_time,stop_id,stop_sequence
head(converted$stop_times, 36)
#> trip_id arrival_time departure_time stop_id stop_sequence
#> 1: CPTM L07-0_f0 28800 28800 18940 1
#> 2: CPTM L07-0_f0 29280 29280 18920 2
#> 3: CPTM L07-0_f0 29760 29760 18919 3
#> 4: CPTM L07-0_f0 30240 30240 18917 4
#> 5: CPTM L07-0_f0 30720 30720 18916 5
#> 6: CPTM L07-0_f0 31200 31200 18965 6
#> 7: CPTM L07-0_f0 31680 31680 18923 7
#> 8: CPTM L07-0_f0 32160 32160 18922 8
#> 9: CPTM L07-0_f0 32640 32640 4114459 9
#> 10: CPTM L07-0_f0 33120 33120 18921 10
#> 11: CPTM L07-0_f0 33600 33600 18924 11
#> 12: CPTM L07-0_f0 34080 34080 18925 12
#> 13: CPTM L07-0_f0 34560 34560 18926 13
#> 14: CPTM L07-0_f0 35040 35040 18971 14
#> 15: CPTM L07-0_f0 35520 35520 18972 15
#> 16: CPTM L07-0_f0 36000 36000 18973 16
#> 17: CPTM L07-0_f0 36480 36480 18974 17
#> 18: CPTM L07-0_f0 36960 36960 18975 18
#> 19: CPTM L07-0_f0 32400 32400 18940 1
#> 20: CPTM L07-0_f0 32880 32880 18920 2
#> 21: CPTM L07-0_f0 33360 33360 18919 3
#> 22: CPTM L07-0_f0 33840 33840 18917 4
#> 23: CPTM L07-0_f0 34320 34320 18916 5
#> 24: CPTM L07-0_f0 34800 34800 18965 6
#> 25: CPTM L07-0_f0 35280 35280 18923 7
#> 26: CPTM L07-0_f0 35760 35760 18922 8
#> 27: CPTM L07-0_f0 36240 36240 4114459 9
#> 28: CPTM L07-0_f0 36720 36720 18921 10
#> 29: CPTM L07-0_f0 37200 37200 18924 11
#> 30: CPTM L07-0_f0 37680 37680 18925 12
#> 31: CPTM L07-0_f0 38160 38160 18926 13
#> 32: CPTM L07-0_f0 38640 38640 18971 14
#> 33: CPTM L07-0_f0 39120 39120 18972 15
#> 34: CPTM L07-0_f0 39600 39600 18973 16
#> 35: CPTM L07-0_f0 40080 40080 18974 17
#> 36: CPTM L07-0_f0 40560 40560 18975 18
#> trip_id arrival_time departure_time stop_id stop_sequence
head(converted$trips)
#> route_id service_id trip_id trip_headsign direction_id shape_id
#> 1: CPTM L07 USD CPTM L07-0 JUNDIAI 0 17846
#> 2: CPTM L07 USD CPTM L07-1 LUZ 1 17847
#> 3: CPTM L08 USD CPTM L08-0 AMADOR BUENO 0 17848
#> 4: CPTM L08 USD CPTM L08-1 JULIO PRESTES 1 17849
#> 5: CPTM L09 USD CPTM L09-0 GRAJAU 0 17850
#> 6: CPTM L09 USD CPTM L09-1 OSASCO 1 17851 |
Thanks @dhersz, I'm addressing most of those issues now. Before i start ...
times <- as.integer (runif (1e6, 0, 3600 * 12 - 1))
b <- bench::mark (
hms::hms (times),
gtfstools:::seconds_to_string (times),
check = FALSE,
time_unit = "ms"
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
b [, 1:5]
#> # A tibble: 2 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <dbl> <dbl> <dbl> <bch:byt>
#> 1 hms::hms(times) 1.48 1.67 328. 8.32MB
#> 2 gtfstools:::seconds_to_string(times) 961. 961. 1.04 135.7MB
b$median [2] / b$median [1]
#> [1] 575.8945 Created on 2022-12-14 with reprex v2.0.2 I'd suggest doing the same. C++ string manipulation like you've got in |
TODO
@dhersz Note that the frequencies table is kept for the moment, because i need to figure out how it's going to be used to re-generate timetables based on random headways. |
Thanks for the awesome and helpful package! I get a lot of value out of
frequencies_to_stop_times
, but I've noticed that it's quite slow when working at scale (i.e., it takes several hours on my reasonably powerful desktop to go from 37k stop times and 1k trips to 4.4mil stop times and 140k trips for recent Mexico City freq-based GTFS feeds). This is almost assuredly due to the use ofrbind
here to append row by row onto the eventually quite hugestop_times
data.table which is used to store and eventually output results:I propose three alternatives, which I've tried to implement locally on my machine but I've not worked with
rcpp
before and so I couldn't make work. I can't quite tell ifstop_times_trip_i
is always 1 row long or not. If so:stop_times
with some upper bound number of rows (perhaps user set), and then simply replace the above with:Else (if it's number of rows vary), here are two other ideas:
stop_times
with some upper bound number of rows (perhaps user set), and replace a group of rows, as aboverbind
, there's an opportunity for a couple of levels of recursion. Specifically, thefor trips
block could be (sorry for the stupid names):This last option, while not totally efficient, will probably substantially increase performance without needing to explicitly deal with preallocation. Again, thanks for the great package!
The text was updated successfully, but these errors were encountered: