`frequencies_to_stop_times` slow execution speed in large datasets #89

cseveren · 2022-01-21T00:28:10Z

Thanks for the awesome and helpful package! I get a lot of value out of frequencies_to_stop_times, but I've noticed that it's quite slow when working at scale (i.e., it takes several hours on my reasonably powerful desktop to go from 37k stop times and 1k trips to 4.4mil stop times and 140k trips for recent Mexico City freq-based GTFS feeds). This is almost assuredly due to the use of rbind here to append row by row onto the eventually quite huge stop_times data.table which is used to store and eventually output results:

 stop_times <- rbind (stop_times, stop_times_trip_i)

I propose three alternatives, which I've tried to implement locally on my machine but I've not worked with rcpp before and so I couldn't make work. I can't quite tell if stop_times_trip_i is always 1 row long or not. If so:

Preallocate stop_times with some upper bound number of rows (perhaps user set), and then simply replace the above with:
```
 stop_times[i, ] <- stop_times_trip_i
```

Else (if it's number of rows vary), here are two other ideas:

Preallocate stop_times with some upper bound number of rows (perhaps user set), and replace a group of rows, as above
Accumulate results in smaller data.tables and then rbind, there's an opportunity for a couple of levels of recursion. Specifically, the for trips block could be (sorry for the stupid names):

    for (trip in trips)  {

        # n is to  be added to trip_id
        n <- 1

        # order frequencies table by trip_id and start_time
        frequencies_trip <-
            gtfs_cp$frequencies [order (trip_id, start_time)] [trip_id == trip]

        # in case end_time of the previous period and start_time of the next are
        # equal:
        frequencies_trip [end_time ==
                          data.table::shift (start_time, 1,
                                             type = "lead"),
                            end_time := end_time - 1]

        stop_times_trip <-
            gtfs_cp$stop_times [order (stop_sequence)] [trip_id == trip]

        # in case of the first arrival time > 0, then reset them:
        if (stop_times_trip [1] [["arrival_time"]] > 0)  {
            stop_times_trip [, c ("arrival_time", "departure_time") :=
                             list(
                arrival_time - stop_times_trip [1] [["arrival_time"]],
                departure_time - stop_times_trip [1] [["arrival_time"]]
                )]
        }

        start_t <- min (frequencies_trip$start_time)

        headway <- headway_old <- frequencies_trip [1] [["headway_secs"]]

        ## NEW LINE ##
        stop_times_small <- stop_times

        for (i in row (frequencies_trip))  {

            end_t <- frequencies_trip [i] [["end_time"]]
            headway <- frequencies_trip [i] [["headway_secs"]]

            # in order to ensure a 'smooth' transition between frequency periods
            ifelse (
                headway_old - start_t +
                    frequencies_trip [i] [["start_time"]] < headway,
                start_t <- start_t - headway_old + headway,
                start_t <- frequencies_trip [i] [["start_time"]]
            )

            ## NEW LINE ##
            stop_times_small_small <- stop_times

            # multiply stop_times for all trips based on a given frequency
            while (start_t < end_t)  {

                stop_times_trip_i <-
                    data.table::copy (stop_times_trip) [, c ("arrival_time",
                                                             "departure_time",
                                                             "trip_id_f") :=
                            list ((arrival_time + start_t),
                                  (departure_time + start_t),
                                  paste (trip_id, n, sep = "_"))]

                n <- n + 1

                # EDITED BELOW #
                stop_times_small_small <- rbind (stop_times_small_small, stop_times_trip_i)

                start_t <- start_t + headway
            }
            # EDITED BELOW #
            stop_times_small <- rbind (stop_times_small, stop_times_small_small)

            headway_old <- headway

        }
        # EDITED BELOW #
        stop_times <- rbind (stop_times, stop_times_small_small)

    }

This last option, while not totally efficient, will probably substantially increase performance without needing to explicitly deal with preallocation. Again, thanks for the great package!

The text was updated successfully, but these errors were encountered:

cseveren · 2022-01-21T15:44:27Z

Ok -- In my use case, following the third option above (not exactly, I made a couple errors) sped execution time ~100x. It still drags toward the end bc it's accumulating to a progressively larger and larger data.table, but it's enough of a boost to make my use case work.

My fork is here: https://github.com/cseveren/gtfs-router - Let me know if you'd like me to submit a pull request. NB: I also added display tracker/counter on original trips.

dhersz · 2022-01-26T21:04:21Z

Bringing the discussion that arose in r5r's repo to this issue:

The development version of gtfstools also includes a frequencies_to_stop_times() function. It's currently a bit faster than gtfsrouter's:

path <- system.file("extdata/spo_gtfs.zip", package = "gtfstools")

gtfstools_gtfs <- gtfstools::read_gtfs(path)
gtfsrouter_gtfs <- gtfsrouter::extract_gtfs(path, quiet = TRUE)

microbenchmark::microbenchmark(
  cvt_gtfstools_gtfs <- gtfstools::frequencies_to_stop_times(gtfstools_gtfs),
  cvt_gtfsrouter_gtfs <- gtfsrouter::frequencies_to_stop_times(gtfsrouter_gtfs),
  times = 5L
)
#> Unit: milliseconds
#>                                                                           expr
#>     cvt_gtfstools_gtfs <- gtfstools::frequencies_to_stop_times(gtfstools_gtfs)
#>  cvt_gtfsrouter_gtfs <- gtfsrouter::frequencies_to_stop_times(gtfsrouter_gtfs)
#>         min        lq      mean   median         uq        max neval
#>    777.2107   793.785   875.153   925.58   937.2485   941.9406     5
#>  26631.9393 31469.455 31593.529 32363.29 32884.5130 34618.4487     5

Also, there seems to be a bug in gtfsrouter's, since it doesn't update the frequencies table:

cvt_gtfstools_gtfs$frequencies
#> NULL

cvt_gtfsrouter_gtfs$frequencies
#>         trip_id start_time end_time headway_secs
#>   1: CPTM L07-0      14400    17940          720
#>   2: CPTM L07-0      18000    21540          360
#>   3: CPTM L07-0      21600    25140          360
#>   4: CPTM L07-0      25200    28740          360
#>   5: CPTM L07-0      28800    32340          360
#>  ---                                            
#> 700:  5290-10-1      79200    82740         1200
#> 701:  5290-10-1      82800    86340         1200
#> 702:  6450-51-0      18000    21540         3600
#> 703:  6450-51-0      21600    25140         3600
#> 704:  6450-51-0      25200    28740         3600

It would be interesting to hear from @cseveren if gtfstools' function works fine for his use case. I could also work on a PR if @mpadge is happy with my implementation and wants to use it.

P.S.: I don't want to sound arrogant/snob coming to this issue and showing that "my function" is faster, or anything like that. Before implementing frequencies_to_stop_times() in gtfstools I admittedly looked at gtfsrouter implementation and took a lot of inspiration from it. This conversation randomly came up in r5r as I was looking at some old open issues, and I decided that moving it to this issue would be beneficial because I'd be able to properly report what seems to me like a bug and we would be able to discuss both implementations in further details.

cseveren · 2022-01-26T21:49:47Z

Much faster than my hack-y solution! Took only 20-30 seconds.

mpadge · 2022-01-27T11:03:37Z

Wow, thanks @cseveren for raising this important issue, and especially thanks to @dhersz for chiming in and offering a clearly superior solution, for which you definitely do not sound arrogant 😆 I was originally going to just ask for a PR of the gtfstools function into here (mostly because I want to keep this package as light on dependencies as possible, and importing gtfstools would then brings in the ultimately heavy dep of sf, and a few others). But then I had a bit of a look and realised it might help all of us if I try to rewrite the function in C++. The version here was just assembled as a hack, and the slowness is precisely for reasons you asserted @cseveren, but that's something that could be totally circumvented by porting the function to C++.

My suggestion would then be for me to start doing that in a branch here which we can use to compare with the current gtfstools implementation. I'll make sure I do it cpp11-style, rather than Rcpp, so it can easily be plugged straight in there if it turns out to offer significant speed gains. Only catch: I'm about to be away from normal routine for a few weeks, so may have little opportunity before March (or later) to get cracking on this. Hope that's okay!

mpadge · 2022-09-29T13:04:58Z

Those commits finally finish converting the "frequencies_to_stop_times" routine to C++. It's slightly faster than @dhersz's gtfstools version, although not by a huge amount (around 20%). I think the big advantage is that this operation is much clearer when coded in C++, and so actually takes much less code as well - around 70 lines of very clear C++ plus 78 lines of R, most of which are input checks. This compares with gtfstools R function of 336 lines, and former function here of 170 lines that was (i) confusing, and (ii) didn't work properly.

@dhersz Our two packages do not give identical results, but largely because this procedure requires a few seemingly arbitrary decisions to be made. These notably include:

What to to if "stop_time" for a "frequencies" entry is marginally less than some integral of "start_time + n * headway" - this does happen, and I've opted to include an addiitonal trip in those cases.
What do to when "stop_times" entries for "trip_id" values which are in the frequencies table have "arrival/departure_time" values of "0", yet "trip_id" values flagged in "frequencies" tables with "exact_times" values of 1. These latter values of 1 are supposed to indicate exact arrival/departure times, which should be used as given, but obviously arrival/departure time values of 0 are nonsense. I've opted to ignore "exact_times" flags in "frequencies" tables, becuase these seem to be frequently erroneous.

mpadge · 2022-09-29T13:10:20Z

Re-opening because there is one pronounced negative effect on other routines. The routine implemented here appends _[0-9]+ values to the end of trip_id entries in the "stop_times" table. The internal routine used to construct timetables relies on matching "trip_id" values, and was modified here to grep matches via an lapply call, because there are commonly too many trip_id values to form a single compound grep pattern. This lapply(ptn, grep) call is, however, really slow, so needs to be replaced with a better way of mapping potentially modified trip_id values back to original, unmodified forms.

mpadge · 2022-09-30T09:50:21Z

That mention refers to several commits which included erroneous reference to that issue instead of this one

mpadge · 2022-09-30T09:51:44Z

TODO:

Ensure appropriate processing of frequencies entries with repeated "trip_id" values but different "headway"
Modify C++ code to construct and use single vectors throughout, instead of current usage of nested lists, in the hope that that will speed up algorithm.

mpadge · 2022-10-05T12:08:46Z

The above commit converts the C++ code to use vectors throughout instead of nested lists, and speeds up the sample feed by a factor of around 5, so was definitely worth doing.

TODO:

The routine now relies on hard-coded indexing into the output vectors, but leaves some gaps. Need to fix!

mpadge · 2022-10-05T12:55:44Z

The above commits should be enough to close this issue. @dhersz Here is a reprex applying the gtfstools routine, and this new C++ one, to the ultimate frequencies feed from Santiago, Chile (almost entirely frequency-based):

library (gtfsrouter)
packageVersion ("gtfsrouter")
#> [1] '0.0.5.123'
path <- "<path>/<to>/santiago-gtfs.zip"
gtfstools_gtfs <- gtfstools::read_gtfs(path)
#> Registered S3 method overwritten by 'gtfsio':
#>   method       from      
#>   summary.gtfs gtfsrouter
gtfsrouter_gtfs <- gtfsrouter::extract_gtfs(path, quiet = TRUE)
#> Warning: This feed contains no transfers.txt 
#>   A transfers.txt table may be constructed with the 'gtfs_transfer_table' function

system.time (
    cvt_gtfstools_gtfs <- gtfstools::frequencies_to_stop_times(gtfstools_gtfs)
    )
#>    user  system elapsed 
#>  65.897   0.434  61.148
nrow (cvt_gtfstools_gtfs$stop_times)
#> [1] 6636076

system.time (
    cvt_gtfsrouter_gtfs <- gtfsrouter::frequencies_to_stop_times(gtfsrouter_gtfs)
)
#>    user  system elapsed 
#>  10.398   0.060  10.423
nrow (cvt_gtfsrouter_gtfs$stop_times)
#> [1] 6262567

^{Created on 2022-10-05 with reprex v2.0.2}

And this new function is around 6 times faster than gtfstools. I also include the sizes of new timetables generated to also indicate that there is something awry with the gtfstools size. This code confirms that the gtfsrouter value is indeed what should be expected (this is taken from the R code here, which relies on using/abusing the permitted "timepoint" flag on GTFS timetables to indicate whether entries are in the "frequencies" table or not):

library (gtfsrouter)
path <- "<path>/<to>/santiago-gtfs.zip"
gtfs <- gtfsrouter::extract_gtfs(path, quiet = TRUE)
gtfs$frequencies [, start_time := rcpp_time_to_seconds (start_time)]
gtfs$frequencies [, end_time := rcpp_time_to_seconds (end_time)]
gtfs$stop_times$timepoint <- 1L
freq_trips <- unique (gtfs$frequencies$trip_id)
gtfs$stop_times$timepoint [which (gtfs$stop_times$trip_id %in% freq_trips)] <- 0L

f_stop_times <- gtfs$stop_times [gtfs$stop_times$timepoint == 0L, ]
gtfs$stop_times <- gtfs$stop_times [gtfs$stop_times$timepoint == 1L, ]

freqs <- gtfs$frequencies
freqs$nseq <- ceiling ((freqs$end_time - freqs$start_time) / freqs$headway_secs)
n <- sum (freqs$nseq)
# plus total numbers of timetable entries:
trip_id_table <- table (f_stop_times$trip_id)
index <- match (freqs$trip_id, names (trip_id_table))
freqs$num_tt_entries <- trip_id_table [index]

num_tt_entries_exp <- sum (freqs$num_tt_entries * freqs$nseq)
num_tt_entries_exp + nrow (gtfs$stop_times)
#> [1] 6262567

^{Created on 2022-10-05 with reprex v2.0.2}

I'll leave open for a few more days, in case anyone wants to try this out and report back, otherwise i'll close soon. Thanks for all the input!

dhersz · 2022-12-12T20:08:54Z

@mpadge thanks for looking into this issue so thoroughly and sorry for such a late reply - I had been working on another project the last few months and had a real hard time switching contexts. The new gtfsrouter function looks great and I'm glad it's so much faster than gtfstools' and gtfsrouter's previous one! It's great that our packages are pushing each other to be faster and more reliable. I'll try to play with Santiago's GTFS to see why there such a big discrepancy between the results of the two functions.

@dhersz Our two packages do not give identical results, but largely because this procedure requires a few seemingly arbitrary decisions to be made. These notably include:

1. What to to if "stop_time" for a "frequencies" entry is marginally less than some integral of "start_time + n * headway" - this does happen, and I've opted to include an addiitonal trip in those cases.

2. What do to when "stop_times" entries for "trip_id" values which are in the frequencies table have "arrival/departure_time" values of "0", yet "trip_id" values flagged in "frequencies" tables with "exact_times" values of 1. These latter values of 1 are supposed to indicate exact arrival/departure times, which should be used as given, but obviously arrival/departure time values of 0 are nonsense. I've opted to ignore "exact_times" flags in "frequencies" tables, becuase these seem to be frequently erroneous.

If I understand correctly, point 1 can be illustrated with the following example: trip A departs every 7 minutes from 6:00 to 6:30. Assuming departure times should be treated "exactly", we would have a trip at 6:00, 6:07, 6:14, 6:21 amd 6:28. Your solution includes adding a trip departing at 6:30, is that correct?

Currently the gtfstools function also ignores exact_times values. I've reported this in a gtfstools issue (ipeaGIT/gtfstools#56) in which I also propose some "strategies" to deal with exact_times = 0. I'm curious, however, of what you mean by arrival/departure_time values of 0. Are these when these fields are empty, and only the departure/arrival time of the first and last stop of a given trip are explicit, or literally values such as 00:00:00?

mpadge · 2022-12-13T09:31:08Z

No worries @dhersz, any reply is greatly appreciated no matter how long it might take. So thanks!

If I understand correctly, point 1 can be illustrated with the following example: trip A departs every 7 minutes from 6:00 to 6:30. Assuming departure times should be treated "exactly", we would have a trip at 6:00, 6:07, 6:14, 6:21 amd 6:28. Your solution includes adding a trip departing at 6:30, is that correct?

Yep, that's is precisely what happens, and that is exactly the solution I've opted for.

Currently the gtfstools function also ignores exact_times values. I've reported this in a gtfstools issue (ipeaGIT/gtfstools#56) in which I also propose some "strategies" to deal with exact_times = 0. I'm curious, however, of what you mean by arrival/departure_time values of 0. Are these when these fields are empty, and only the departure/arrival time of the first and last stop of a given trip are explicit, or literally values such as 00:00:00?

Departure values for the first stops of any service described in frequency tables should all be "00:00:00", and generally are. The subsequent times then just specify the travel times from that first stop. And those trips should then have "exact_times = 0" values in the frequencies table, but sometimes have "exact_times = 1". That does not make any sense (excepting of course the extremely rare case where an trip does indeed start at exactly that time). So my proposal is simply to ignore the "exact_times" flags completely. The entries in the frequencies table can be used to populate timetables, and any other entries in "stop_times" that specify times outside the windows given in "frequencies" will still be picked up as effectively having "exact_times = 1". So that flag is really not necessary anyway. Does that make sense?

dhersz · 2022-12-13T14:48:49Z

No worries @dhersz, any reply is greatly appreciated no matter how long it might take. So thanks!

If I understand correctly, point 1 can be illustrated with the following example: trip A departs every 7 minutes from 6:00 to 6:30. Assuming departure times should be treated "exactly", we would have a trip at 6:00, 6:07, 6:14, 6:21 amd 6:28. Your solution includes adding a trip departing at 6:30, is that correct?

Yep, that's is precisely what happens, and that is exactly the solution I've opted for.

Cool. I don't explicitly add trips at the end_time defined in the frequencies table as you do, but the end_time of a specific frequencies entry frequently is the start_time of the subsequent entry, in which I add a trip. So I don't think that explains such a large difference between the outputs.

Currently the gtfstools function also ignores exact_times values. I've reported this in a gtfstools issue (ipeaGIT/gtfstools#56) in which I also propose some "strategies" to deal with exact_times = 0. I'm curious, however, of what you mean by arrival/departure_time values of 0. Are these when these fields are empty, and only the departure/arrival time of the first and last stop of a given trip are explicit, or literally values such as 00:00:00?

Departure values for the first stops of any service described in frequency tables should all be "00:00:00", and generally are. The subsequent times then just specify the travel times from that first stop. And those trips should then have "exact_times = 0" values in the frequencies table, but sometimes have "exact_times = 1". That does not make any sense (excepting of course the extremely rare case where an trip does indeed start at exactly that time). So my proposal is simply to ignore the "exact_times" flags completely. The entries in the frequencies table can be used to populate timetables, and any other entries in "stop_times" that specify times outside the windows given in "frequencies" will still be picked up as effectively having "exact_times = 1". So that flag is really not necessary anyway. Does that make sense?

I think we have different interpretations of what exact_times = 0/1 means.

First, I didn't know that that the departure time at the first stop of a trip described in the frequencies table should be 00:00:00. I checked the specification and there doesn't seem to be any spec about that in it, but there's in fact a recommendation made at GTFS Best Practices, by MobilityData. Cool! To be honest, in my experience I've never seen feeds abiding to this practice, but it's good to know about it.

In my understanding, the exact_times value controls whether the headway is precisely taken into account, not the travel times between stops. Again in my understanding, the travel time between stops should always be considered exactly as described (like your implementation does consider), and whether this departure and arrival times are in practice strictly adhered to is controlled by the timepoint field.

Quoting the spec:

Frequency-based service (exact_times=0) in which service does not follow a fixed schedule throughout the day. Instead, operators attempt to strictly maintain predetermined headways for trips.
A compressed representation of schedule-based service (exact_times=1) that has the exact same headway for trips over specified time period(s). In schedule-based service operators try to strictly adhere to a schedule.

So, for example, a trip whose frequencies entry is start_time = 04:00:00, end_time = 05:00:00, headway = 12 minutes and exact_times = 1 should be the equivalent of trips departing at the first stop at 04:00, 04:12, 04:24, ..., 04:48, 05:00.

If the same entry had exact_times = 0 the operators would not necessarily be able to maintain the 12 minutes headway exactly, so we could have trips departing at, let's say 04:00:15, 04:13:00, 04:24:30, ..., 04:47:00, 04:59:50. The gtfstools issue I linked above refers exactly to this "random" aspect of exact_times = 0, and possible strategies to use when this value occurs.

Summarizing all of this, I'd like to re-state that this is my interpretation, and would like to know if you agree with it. Currently I think both our implementations should yield near identical results (with the exception of same entries whose subsequent frequencies entry does not start exactly when the previous one stops), so I'll investigate further the difference between their results.

dhersz · 2022-12-13T16:04:09Z

I haven't yet tested with Santiago's GTFS, but I'm playing with SPTrans' GTFS (shipped with gtfstools) and I'm already seeing some discrepancies between gtfsrouter and gtfstools function. Opposite to Santiago's case, however, in my tests gtfsrouter function is returning a much larger stop_times table.

Some bugs/problems I'm currently seeing with the functions:

gtfstools' is spending more than half of the total processing time converting times in seconds to times in HH:MM:SS format (i.e 0 -> "00:00:00", 3600 -> "01:00:00"). This function is really simple, so should definitely be optimized.
gtfsrouter's is not converting time in seconds to time in HH:MM:SS format after conversion. See below:

data_path <- system.file("extdata/spo_gtfs.zip", package = "gtfstools")

gtfs <- gtfsrouter::extract_gtfs(data_path)
#> ▶ Unzipping GTFS archive✔ Unzipped GTFS archive
#> Warning: This feed contains no transfers.txt 
#>   A transfers.txt table may be constructed with the 'gtfs_transfer_table' function
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds

converted <- gtfsrouter::frequencies_to_stop_times(gtfs)

head(converted$stop_times)
#>          trip_id arrival_time departure_time stop_id stop_sequence
#> 1: CPTM L07-0_f0        28800          28800   18940             1
#> 2: CPTM L07-0_f0        29280          29280   18920             2
#> 3: CPTM L07-0_f0        29760          29760   18919             3
#> 4: CPTM L07-0_f0        30240          30240   18917             4
#> 5: CPTM L07-0_f0        30720          30720   18916             5
#> 6: CPTM L07-0_f0        31200          31200   18965             6

gtfsrouter's is not removing the frequencies table after converting its entries. I think removing this table makes sense, as the trips described in this table were already converted and don't even exist in the converted GTFS. See below:

head(converted$frequencies)
#>       trip_id start_time end_time headway_secs
#> 1: CPTM L07-0      14400    17940          720
#> 2: CPTM L07-0      18000    21540          360
#> 3: CPTM L07-0      21600    25140          360
#> 4: CPTM L07-0      25200    28740          360
#> 5: CPTM L07-0      28800    32340          360
#> 6: CPTM L07-0      32400    35940          480

converted$stop_times[trip_id == "CPTM L07-0"]
#> Empty data.table (0 rows and 5 cols): trip_id,arrival_time,departure_time,stop_id,stop_sequence

gtfsrouter's doesn't seem to be creating correct new trip_ids. Many trips that should be different trips are currently described under the same id. See below, for example. These should be two different trips (probably "CPTM L07-0_f0" and "CPTM L07-0_f1", following the same naming convention you used).

head(converted$stop_times, 36)
#>           trip_id arrival_time departure_time stop_id stop_sequence
#>  1: CPTM L07-0_f0        28800          28800   18940             1
#>  2: CPTM L07-0_f0        29280          29280   18920             2
#>  3: CPTM L07-0_f0        29760          29760   18919             3
#>  4: CPTM L07-0_f0        30240          30240   18917             4
#>  5: CPTM L07-0_f0        30720          30720   18916             5
#>  6: CPTM L07-0_f0        31200          31200   18965             6
#>  7: CPTM L07-0_f0        31680          31680   18923             7
#>  8: CPTM L07-0_f0        32160          32160   18922             8
#>  9: CPTM L07-0_f0        32640          32640 4114459             9
#> 10: CPTM L07-0_f0        33120          33120   18921            10
#> 11: CPTM L07-0_f0        33600          33600   18924            11
#> 12: CPTM L07-0_f0        34080          34080   18925            12
#> 13: CPTM L07-0_f0        34560          34560   18926            13
#> 14: CPTM L07-0_f0        35040          35040   18971            14
#> 15: CPTM L07-0_f0        35520          35520   18972            15
#> 16: CPTM L07-0_f0        36000          36000   18973            16
#> 17: CPTM L07-0_f0        36480          36480   18974            17
#> 18: CPTM L07-0_f0        36960          36960   18975            18
#> 19: CPTM L07-0_f0        32400          32400   18940             1
#> 20: CPTM L07-0_f0        32880          32880   18920             2
#> 21: CPTM L07-0_f0        33360          33360   18919             3
#> 22: CPTM L07-0_f0        33840          33840   18917             4
#> 23: CPTM L07-0_f0        34320          34320   18916             5
#> 24: CPTM L07-0_f0        34800          34800   18965             6
#> 25: CPTM L07-0_f0        35280          35280   18923             7
#> 26: CPTM L07-0_f0        35760          35760   18922             8
#> 27: CPTM L07-0_f0        36240          36240 4114459             9
#> 28: CPTM L07-0_f0        36720          36720   18921            10
#> 29: CPTM L07-0_f0        37200          37200   18924            11
#> 30: CPTM L07-0_f0        37680          37680   18925            12
#> 31: CPTM L07-0_f0        38160          38160   18926            13
#> 32: CPTM L07-0_f0        38640          38640   18971            14
#> 33: CPTM L07-0_f0        39120          39120   18972            15
#> 34: CPTM L07-0_f0        39600          39600   18973            16
#> 35: CPTM L07-0_f0        40080          40080   18974            17
#> 36: CPTM L07-0_f0        40560          40560   18975            18
#>           trip_id arrival_time departure_time stop_id stop_sequence

gtfsrouter's is not updating the trips table, see below. It should remove the old trip_ids and add the new ones.

head(converted$trips)
#>    route_id service_id    trip_id trip_headsign direction_id shape_id
#> 1: CPTM L07        USD CPTM L07-0       JUNDIAI            0    17846
#> 2: CPTM L07        USD CPTM L07-1           LUZ            1    17847
#> 3: CPTM L08        USD CPTM L08-0  AMADOR BUENO            0    17848
#> 4: CPTM L08        USD CPTM L08-1 JULIO PRESTES            1    17849
#> 5: CPTM L09        USD CPTM L09-0        GRAJAU            0    17850
#> 6: CPTM L09        USD CPTM L09-1        OSASCO            1    17851

mpadge · 2022-12-14T15:28:54Z

Thanks @dhersz, I'm addressing most of those issues now. Before i start ...

gtfstools' is spending more than half of the total processing time converting times in seconds to times in HH:MM:SS format (i.e 0 -> "00:00:00", 3600 -> "01:00:00"). This function is really simple, so should definitely be optimized.

gtfsrouter just uses hms for that:

times <- as.integer (runif (1e6, 0, 3600 * 12 - 1))
b <- bench::mark (
    hms::hms (times),
    gtfstools:::seconds_to_string (times),
    check = FALSE,
    time_unit = "ms"
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
b [, 1:5]
#> # A tibble: 2 × 5
#>   expression                              min median `itr/sec` mem_alloc
#>   <bch:expr>                            <dbl>  <dbl>     <dbl> <bch:byt>
#> 1 hms::hms(times)                        1.48   1.67    328.      8.32MB
#> 2 gtfstools:::seconds_to_string(times) 961.   961.        1.04   135.7MB
b$median [2] / b$median [1]
#> [1] 575.8945

^{Created on 2022-12-14 with reprex v2.0.2}

I'd suggest doing the same. C++ string manipulation like you've got in gtfstools is not efficient at all - like 600 times less efficient than hms (which is pure R!)

mpadge · 2022-12-14T15:39:31Z

TODO

~~convert arrival/departure times in "stop_times" back to hms format~~ No, gtfsrouter always represents times in seconds. This will change via Use gtfsio package to import feeds #84, but not (yet) via this issue.
Fix trip_id entries in "stop_times" table.
Update "trips" table with new "trip_id" values
Respect exact_times parameter in the way described by @dhersz above.

@dhersz Note that the frequencies table is kept for the moment, because i need to figure out how it's going to be used to re-generate timetables based on random headways.

cseveren mentioned this issue Jan 26, 2022

use detailed_itineraries on Frequency-based GTFS feeds ipeaGIT/r5r#181

Closed

mpadge added bug Something isn't working enhancement New feature or request labels Jan 27, 2022

mpadge added a commit that referenced this issue Sep 29, 2022

add src/freq_to_stop_times.cpp for #89

774e82a

mpadge added a commit that referenced this issue Sep 29, 2022

Rcpp interrupt on freq_to_stop_times for #89

0d6da36

mpadge closed this as completed in 22a37c2 Sep 29, 2022

mpadge added a commit that referenced this issue Sep 29, 2022

fix timetable fns for trip_id values modifed in src for #89

761a07a

mpadge added a commit that referenced this issue Sep 29, 2022

fix test-frequencies for #89

24c735f

mpadge added a commit that referenced this issue Sep 29, 2022

fix grep for freq_to_stop_times stop_ids in make_timetable (#89)

235b738

mpadge reopened this Sep 29, 2022

mpadge mentioned this issue Sep 30, 2022

Extremely high RAM requirement for gtfs_transfer_table with large dataset #91

Closed

mpadge added a commit that referenced this issue Sep 30, 2022

enable freq_to_stop_times for same trip_id, different headway for #89

851826e

mpadge added a commit that referenced this issue Oct 5, 2022

more freq_to_stop_times tweaks towards #89

903092b

mpadge added a commit that referenced this issue Oct 5, 2022

fix matching number of timetable entries in freqs table for #89

a292613

mpadge added a commit that referenced this issue Oct 5, 2022

fix indexing in freqs_to_stop_times.cpp for #89

fe5fb08

mpadge added a commit that referenced this issue Dec 15, 2022

separate 'calc_num_new_timetables' fn for #89

ddb44bc

mpadge added a commit that referenced this issue Dec 15, 2022

add 'update_trips_table_with_freqs' for #89

40e715f

mpadge added a commit that referenced this issue Dec 15, 2022

fix 🐛 in update_trips_table_with_freqs for #89

eff916c

mpadge mentioned this issue Jan 2, 2023

frequencies.txt #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`frequencies_to_stop_times` slow execution speed in large datasets #89

`frequencies_to_stop_times` slow execution speed in large datasets #89

cseveren commented Jan 21, 2022

cseveren commented Jan 21, 2022

dhersz commented Jan 26, 2022

cseveren commented Jan 26, 2022

mpadge commented Jan 27, 2022

mpadge commented Sep 29, 2022

mpadge commented Sep 29, 2022

mpadge commented Sep 30, 2022

mpadge commented Sep 30, 2022 •

edited

Loading

mpadge commented Oct 5, 2022 •

edited

Loading

mpadge commented Oct 5, 2022 •

edited

Loading

dhersz commented Dec 12, 2022

mpadge commented Dec 13, 2022

dhersz commented Dec 13, 2022

dhersz commented Dec 13, 2022

mpadge commented Dec 14, 2022

mpadge commented Dec 14, 2022 •

edited

Loading

frequencies_to_stop_times slow execution speed in large datasets #89

frequencies_to_stop_times slow execution speed in large datasets #89

Comments

cseveren commented Jan 21, 2022

cseveren commented Jan 21, 2022

dhersz commented Jan 26, 2022

cseveren commented Jan 26, 2022

mpadge commented Jan 27, 2022

mpadge commented Sep 29, 2022

mpadge commented Sep 29, 2022

mpadge commented Sep 30, 2022

mpadge commented Sep 30, 2022 • edited Loading

TODO:

mpadge commented Oct 5, 2022 • edited Loading

TODO:

mpadge commented Oct 5, 2022 • edited Loading

dhersz commented Dec 12, 2022

mpadge commented Dec 13, 2022

dhersz commented Dec 13, 2022

dhersz commented Dec 13, 2022

mpadge commented Dec 14, 2022

mpadge commented Dec 14, 2022 • edited Loading

TODO

`frequencies_to_stop_times` slow execution speed in large datasets #89

`frequencies_to_stop_times` slow execution speed in large datasets #89

mpadge commented Sep 30, 2022 •

edited

Loading

mpadge commented Oct 5, 2022 •

edited

Loading

mpadge commented Oct 5, 2022 •

edited

Loading

mpadge commented Dec 14, 2022 •

edited

Loading