--- title: "Introduction to gtfstools" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to gtfstools} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) Sys.setenv(OMP_THREAD_LIMIT = 2) ``` The General Transit Feed Specification (GTFS) data format defines a common scheme for describing transit systems, and is widely used by transit agencies around the world and consumed by many software applications. The **gtfstools** package makes handling GTFS data in **R** very easy and fast, offering many utility functions to read, manipulate, analyze and write transit feeds in such format. # GTFS feeds GTFS feeds exist in two main different forms: the GTFS *static* and the GTFS *realtime*. This package allows you to manipulate GTFS *static* feeds, the most common variation. These feeds are the collection of many `csv`-like files (with a `.txt` extension) contained in a single `.zip` file. A GTFS `.zip` file is composed by at least five required files, but may also contain a few other conditionally required and optional files: - Required: `agency.txt`, `stops.txt`, `routes.txt`, `trips.txt`, `stop_times.txt` - Conditionally required: `calendar.txt`, `calendar_dates.txt`, `feed_info.txt` - Optional: `fare_attributes.txt`, `fare_rules.txt`, `shapes.txt`, `frequencies.txt`, `transfers.txt`, `pathways.txt`, `levels.txt`, `translations.txt`, `attributions.txt` Please check the official [GTFS reference](https://developers.google.com/transit/gtfs/reference) for more details on the specification. # Basic usage Before using **gtfstools** please make sure that you have it installed in your computer. You can download either the most stable version from CRAN... ```{r, eval = FALSE} install.packages("gtfstools") ``` ...or the development version from GitHub. ```{r, eval = FALSE} install.packages("gtfstools", repos = "https://dhersz.r-universe.dev") # or # install.packages("remotes") remotes::install_github("ipeaGIT/gtfstools") ``` Then attach it to the current R session: ```{r, message = FALSE} library(gtfstools) ``` A few sample files are included in the package: ```{r} data_path <- system.file("extdata", package = "gtfstools") list.files(data_path) ``` - `ggl_gtfs.zip` has been manually built from the [example GTFS feed](https://developers.google.com/transit/gtfs/examples/gtfs-feed) provided by Google. The files samples are licensed under [Creative Commons Attribution 4.0 License](https://creativecommons.org/licenses/by/4.0/). - `spo_gtfs.zip` is a subset of the São Paulo's SPTrans feed, available [here](https://www.sptrans.com.br/desenvolvedores/). - `ber_gtfs.zip` is a subset of Berlin's GTFS, available [here](https://daten.berlin.de/datensaetze/vbb-fahrplandaten-via-gtfs). - `poa_gtfs.zip` is a subset of Porto Alegre's EPTC feed, available [here](https://dadosabertos.poa.br/dataset/gtfs). Throughout this demonstration we will be using São Paulo's and Google's feeds. ## Read feeds **gtfstools** reads feeds as a `list` of `data.table`s, a high-performance version of base **R**'s `data.frame`s. Thus, reading, writing and manipulating GTFS objects created by **gtfstools** is very easy and fast even if some of your tables contain a few million rows. To read a feed use the `read_gtfs()` function. By default the function reads all `.txt` files contained in the main `.zip` file. It may be useful, however, to read only a couple of specific files, specially if you're dealing with some big data sets. To do so, specify which file you want to read in the `files` argument (*without* the `.txt` extension): ```{r} spo_path <- file.path(data_path, "spo_gtfs.zip") # default behaviour spo_gtfs <- read_gtfs(spo_path) names(spo_gtfs) # only reads the 'shapes.txt' and 'trips.txt' files spo_shapes <- read_gtfs(spo_path, files = c("shapes", "trips")) names(spo_shapes) ``` Please note that date fields are read as columns of class `Date`, instead of being kept as integers (as specified in the [official reference](https://developers.google.com/transit/gtfs/reference)), allowing for easier data manipulation. These columns are converted back to integers when [writing the GTFS objects](#write-feeds) to a `.zip` file, so GTFS files generated by the package always conform to the specification. ## Analyse feeds **gtfstools** also includes a few functions to prevent you from getting stuck with repetitive tasks: `get_trip_geometry()` returns the geometry of each trip in a GTFS object as an `sf` object (please check [`{sf}` webpage](https://r-spatial.github.io/sf/) for more details). GTFS data allows you to generate geometries using two different methods: either converting the shapes described in the `shapes.txt` file to an `sf`, or linking the subsequent stops of each trip as described in the `stop_times.txt` along a straight line. While the former tends to yield more reliable and higher resolution geometries, it may be useful to compare the results of both methods to check if the trips described in `stop_times` actually resemble their actual shape: ```{r} trip_geom <- get_trip_geometry(spo_gtfs, file = "shapes") plot(trip_geom$geometry) single_trip <- spo_gtfs$trips$trip_id[1] single_trip # 'file' argument defaults to c("shapes", "stop_times") both_geom <- get_trip_geometry(spo_gtfs, trip_id = single_trip) plot(both_geom["origin_file"]) ``` `get_trip_duration()` returns the duration of each trip in a GTFS object, as specified in the `stop_times` file, in the temporal unit of your desire (either seconds, minutes, hours or days): ```{r} trip_durtn <- get_trip_duration(spo_gtfs, unit = "s") head(trip_durtn) # 'unit' argument defaults to "min" single_durtn <- get_trip_duration(spo_gtfs, trip_id = single_trip) single_durtn ``` `get_trip_segment_duration()` is a similar function, that even takes the same arguments, but returns the duration of each trip *segment* (i.e. the time interval between two consecutive stops). ```{r} trip_seg_durtn <- get_trip_segment_duration(spo_gtfs, unit = "s") head(trip_seg_durtn) single_seg_durtn <- get_trip_segment_duration(spo_gtfs, trip_id = single_trip) head(single_seg_durtn) ``` The quick example above shows how this function may help you diagnosing some problems in your GTFS data: apparently every single trip in `spo_gtfs` is composed by several equally long segments, which looks unreasonable. Finally, `get_trip_speed()` is a helper around `get_trip_geometry()` and `get_trip_duration()` that returns the average speed of each trip in a GTFS object: ```{r} trip_speed <- get_trip_speed(spo_gtfs, unit = "m/s") head(trip_speed) # 'unit' argument defaults to "km/h" single_trip_speed <- get_trip_speed(spo_gtfs, trip_id = single_trip) single_trip_speed ``` ## Manipulate feeds Each table inside a GTFS object can be easily manipulated using the usual `data.table` syntax. `{data.table}` provides many useful features, such as updating columns by reference, fast binary search, efficient data aggregation, and many others, allowing you to deal with large data sets very efficiently. Please check its [official website](https://rdatatable.gitlab.io/data.table/index.html) for more details on syntax and usage. Just remember that, since every GTFS object is a *`list`* of `data.table`s, you must refer to each table using the `$` operator. For example, this is how you'd remove the `headway_secs` column from the `frequencies` file and add it again afterwards: ```{r} old_headway_secs <- spo_gtfs$frequencies$headway_secs spo_gtfs$frequencies[, headway_secs := NULL] head(spo_gtfs$frequencies) spo_gtfs$frequencies[, headway_secs := old_headway_secs] head(spo_gtfs$frequencies) ``` **gtfstools** also provides some functions that help you getting over some common tasks. `merge_gtfs()` takes many GTFS objects and combines them row-wise. By default the function binds every table inside the objects, but you can specify which tables you want to merge with the `files` argument: ```{r} ggl_path <- file.path(data_path, "ggl_gtfs.zip") ggl_gtfs <- read_gtfs(ggl_path) names(spo_gtfs) names(ggl_gtfs) merged_gtfs <- merge_gtfs(spo_gtfs, ggl_gtfs) names(merged_gtfs) # only merges the 'shapes' and 'trips' tables merged_files <- merge_gtfs(spo_gtfs, ggl_gtfs, files = c("shapes", "trips")) names(merged_files) ``` `set_trip_speed()` sets the average speed of specified trips by adjusting the `arrival_time` and `departure_time` columns in the `stop_times` table. Average speed is calculated as the difference between the arrival time at the last stop minus the departure time at the first top, divided by the trip's length. Please note that arrival and departure times at intermediate stops are set as `""`. Some transport routing software, such as [OpenTripPlanner](https://www.opentripplanner.org/) and [R5](https://github.com/conveyal/r5), support specifying stop times like so, in which case they interpolate arrival/departure times at intermediate stops based on the trip's average speed and the euclidean distance between stops. ```{r} selected_trips <- c("2002-10-0", "CPTM L07-0") get_trip_speed(spo_gtfs, selected_trips, unit = "km/h") # 'speed' is recycled to all trips if only a single value is given new_speed_gtfs <- set_trip_speed(spo_gtfs, selected_trips, 50) get_trip_speed(new_speed_gtfs, selected_trips) # but you can also specify different speeds for each trip new_speed_gtfs <- set_trip_speed(spo_gtfs, selected_trips, c(30, 40)) get_trip_speed(new_speed_gtfs, selected_trips) ``` ## Write feeds Finally, `write_gtfs()` allows you to save your GTFS objects to disk. It defaults to writing every single table inside the object as a `.txt` file, but you can conditionally exclude files if you so wish: ```{r} temp_dir <- file.path(tempdir(), "gttools_vig") dir.create(temp_dir) list.files(temp_dir) filename <- file.path(temp_dir, "spo_gtfs.zip") write_gtfs(spo_gtfs, filename) list.files(temp_dir) zip::zip_list(filename)$filename write_gtfs(spo_gtfs, filename, files = c("stop_times", "trips", "calendar")) zip::zip_list(filename)$filename ``` `write_gtfs()` also converts `Date` columns back to integer, producing GTFS files that conform to the official specification.