tdunn - Predicting bike ridership: getting the data

R setup

library(tidyverse)
library(httr)
library(lubridate)
library(gt)
library(glue)

library(dunnr)
extrafont::loadfonts(device = "win", quiet = TRUE)
theme_set(theme_td())
set_geom_fonts()
set_palette()

Introduction

In 2016, the city of Halifax installed its first cyclist tracker on Agricola Street. Last year, the city made bike counter data available on their open data platform. As a cyclist and Haligonian, this is of course interesting to me personally. As a data scientist, this seems like a nice opportunity to work through a machine learning project end-to-end: from retrieving, exploring, and processing the data, to building and evaluating models, to producing an end product. (A REST API? A Shiny app? TBD.)

In this post, I get and explore data from two sources: (1) the aforementioned bike counter data from city of Halifax, and (2) historical weather data from the government of Canada.

Getting bicycle count data

The bicycle counts were easy enough to find on Halifax’s platform: (https://catalogue-hrm.opendata.arcgis.com/datasets/45d4ecb0cb48469186e683ebc54eb188_0/explore?showTable=true). Each data set comes with a nice API explorer for constructing queries. I’ll use httr to GET the data with the basic query provided there:

query_url <- "https://services2.arcgis.com/11XBiaBYA9Ep0yNJ/arcgis/rest/services/Bicycle_Counts/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json"
resp <- httr::GET(query_url)
resp

Response [https://services2.arcgis.com/11XBiaBYA9Ep0yNJ/arcgis/rest/services/Bicycle_Counts/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json]
  Date: 2022-04-27 03:26
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 579 kB

The response code (200) indicates a successful connection. The data comes in JSON format, which I can parse to an R list with:

parsed_content <- content(resp, as = "parsed")
str(parsed_content, max.level = 1)

List of 6
 $ objectIdFieldName    : chr "ObjectId"
 $ uniqueIdField        :List of 2
 $ globalIdFieldName    : chr ""
 $ fields               :List of 11
 $ exceededTransferLimit: logi TRUE
 $ features             :List of 2000

This returned a list of 6 items. The fields item is a list of variables:

fields <- map_dfr(
  parsed_content$fields,
  # Drop NULL elements so I can convert to a tibble
  ~ discard(.x, is.null) %>% as_tibble()
)
gt(fields)

name	type	alias	sqlType	length
CHANNEL_ID	esriFieldTypeString	CHANNEL_ID	sqlTypeNVarchar	256
CHANNEL_NAME	esriFieldTypeString	CHANNEL_NAME	sqlTypeNVarchar	4000
SERIAL_NUMBER	esriFieldTypeString	SERIAL_NUMBER	sqlTypeNVarchar	4000
SITE_NAME	esriFieldTypeString	SITE_NAME	sqlTypeNVarchar	4000
LATITUDE	esriFieldTypeDouble	LATITUDE	sqlTypeFloat	NA
LONGITUDE	esriFieldTypeDouble	LONGITUDE	sqlTypeFloat	NA
INSTALLATION_DATE	esriFieldTypeDate	INSTALLATION_DATE	sqlTypeTimestamp2	8
COUNT_DATETIME	esriFieldTypeDate	COUNT_DATETIME	sqlTypeTimestamp2	8
COUNTER_TYPE	esriFieldTypeString	COUNTER_TYPE	sqlTypeNVarchar	4000
COUNTER_VALUE	esriFieldTypeInteger	COUNTER_VALUE	sqlTypeInteger	NA
ObjectId	esriFieldTypeOID	ObjectId	sqlTypeInteger	NA

The data is the features item, which itself is a list of length 2000. Here is the first element:

parsed_content$features[[1]]

$attributes
$attributes$CHANNEL_ID
[1] "100059339"

$attributes$CHANNEL_NAME
[1] "Hollis St"

$attributes$SERIAL_NUMBER
[1] "X2H20032465"

$attributes$SITE_NAME
[1] "Hollis St"

$attributes$LATITUDE
[1] 44.64799

$attributes$LONGITUDE
[1] -63.57352

$attributes$INSTALLATION_DATE
[1] 1.594166e+12

$attributes$COUNT_DATETIME
[1] 1.595966e+12

$attributes$COUNTER_TYPE
[1] "Bicycle"

$attributes$COUNTER_VALUE
[1] 2

$attributes$ObjectId
[1] 1

Looks like there is another level of nesting with attributes. Compile all of these elements into a single data frame:

bike_counts <- map_dfr(
  parsed_content$features,
  ~ as_tibble(.x$attributes)
)
glimpse(bike_counts)

Rows: 2,000
Columns: 11
$ CHANNEL_ID        <chr> "100059339", "100059339", "100059339", "100059339", …
$ CHANNEL_NAME      <chr> "Hollis St", "Hollis St", "Hollis St", "Hollis St", …
$ SERIAL_NUMBER     <chr> "X2H20032465", "X2H20032465", "X2H20032465", "X2H200…
$ SITE_NAME         <chr> "Hollis St", "Hollis St", "Hollis St", "Hollis St", …
$ LATITUDE          <dbl> 44.64799, 44.64799, 44.64799, 44.64799, 44.64799, 44…
$ LONGITUDE         <dbl> -63.57352, -63.57352, -63.57352, -63.57352, -63.5735…
$ INSTALLATION_DATE <dbl> 1.594166e+12, 1.594166e+12, 1.594166e+12, 1.594166e+…
$ COUNT_DATETIME    <dbl> 1.595966e+12, 1.595970e+12, 1.595974e+12, 1.595977e+…
$ COUNTER_TYPE      <chr> "Bicycle", "Bicycle", "Bicycle", "Bicycle", "Bicycle…
$ COUNTER_VALUE     <int> 2, 1, 2, 2, 0, 0, 0, 0, 0, 0, 6, 4, 13, 8, 5, 6, 8, …
$ ObjectId          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…

Note that just 2000 records were returned. The exceededTransferLimit = TRUE value tells us that this is the limit of the API. I can get the total count of records by altering the original query slightly:

n_records <- httr::GET(paste0(query_url, "&returnCountOnly=true")) %>%
  content(as = "parsed") %>%
  unlist(use.names = FALSE)
n_records

[1] 124356

So to get all of the data at 2000 records per request, I’ll have to make a minimum of 63 calls to the API. The API offers a “resultOffset” argument to get records in sequence. Make a function to get 2000 records for a given offset:

get_bike_data <- function(offset) {
  # Need to prevent scientific notation, e.g. "1e+05" instead of "100000"
  offset <- format(offset, scientific = FALSE)
  
  parsed_content <- httr::GET(paste0(query_url, "&resultOffset=", offset)) %>%
    content(as = "parsed")
  
  map_dfr(
    parsed_content$features,
    ~ as_tibble(.x$attributes)
  ) 
}

And combine it all into a single data frame:

bike_data <- map_dfr(
  seq(0, ceiling(n_records / 2000)),
  ~ get_bike_data(offset = .x * 2000)
)

glimpse(bike_data)

Rows: 124,356
Columns: 11
$ CHANNEL_ID        <chr> "100059339", "100059339", "100059339", "100059339", …
$ CHANNEL_NAME      <chr> "Hollis St", "Hollis St", "Hollis St", "Hollis St", …
$ SERIAL_NUMBER     <chr> "X2H20032465", "X2H20032465", "X2H20032465", "X2H200…
$ SITE_NAME         <chr> "Hollis St", "Hollis St", "Hollis St", "Hollis St", …
$ LATITUDE          <dbl> 44.64799, 44.64799, 44.64799, 44.64799, 44.64799, 44…
$ LONGITUDE         <dbl> -63.57352, -63.57352, -63.57352, -63.57352, -63.5735…
$ INSTALLATION_DATE <dbl> 1.594166e+12, 1.594166e+12, 1.594166e+12, 1.594166e+…
$ COUNT_DATETIME    <dbl> 1.595966e+12, 1.595970e+12, 1.595974e+12, 1.595977e+…
$ COUNTER_TYPE      <chr> "Bicycle", "Bicycle", "Bicycle", "Bicycle", "Bicycle…
$ COUNTER_VALUE     <int> 2, 1, 2, 2, 0, 0, 0, 0, 0, 0, 6, 4, 13, 8, 5, 6, 8, …
$ ObjectId          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…

This returned 124356 records, as expected. The ObjectId should be a unique sequential identifier from 1 to 124356, which I’ll check:

range(bike_data$ObjectId); n_distinct(bike_data$ObjectId)

[1]      1 124356

[1] 124356

EDA and cleaning

First thing I usually do with a new data set is clean the column names:

bike_data <- janitor::clean_names(bike_data)
glimpse(bike_data)

Rows: 124,356
Columns: 11
$ channel_id        <chr> "100059339", "100059339", "100059339", "100059339", …
$ channel_name      <chr> "Hollis St", "Hollis St", "Hollis St", "Hollis St", …
$ serial_number     <chr> "X2H20032465", "X2H20032465", "X2H20032465", "X2H200…
$ site_name         <chr> "Hollis St", "Hollis St", "Hollis St", "Hollis St", …
$ latitude          <dbl> 44.64799, 44.64799, 44.64799, 44.64799, 44.64799, 44…
$ longitude         <dbl> -63.57352, -63.57352, -63.57352, -63.57352, -63.5735…
$ installation_date <dbl> 1.594166e+12, 1.594166e+12, 1.594166e+12, 1.594166e+…
$ count_datetime    <dbl> 1.595966e+12, 1.595970e+12, 1.595974e+12, 1.595977e+…
$ counter_type      <chr> "Bicycle", "Bicycle", "Bicycle", "Bicycle", "Bicycle…
$ counter_value     <int> 2, 1, 2, 2, 0, 0, 0, 0, 0, 0, 6, 4, 13, 8, 5, 6, 8, …
$ object_id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…

Next I want to deal with the installation_date and count_datetime variables, which are very large integers. From fields above, the data type for these variables is esriFieldTypeDate. After some digging on Google, turns out this is Unix time (the number of milliseconds since January 1, 1970; also called epoch time). With as.POSIXct(), I can supply the number of seconds and set origin = "1970-01-01" to get back the correct datetime objects:

bike_data <- bike_data %>%
  mutate(
    across(c(installation_date, count_datetime),
           ~ as.POSIXct(.x / 1000, tz = "UTC", origin = "1970-01-01")),
    # These are just dates, the time of day doesn't matter
    installation_date = as.Date(installation_date),
    # I'll also want the date without time of day
    count_date = as.Date(count_datetime)
  )

These variables are unique to the sites:

bike_data %>%
  count(site_name, latitude, longitude, serial_number, installation_date,
        counter_type, name = "n_records") %>%
  gt()

site_name	latitude	longitude	serial_number	installation_date	counter_type	n_records
Dartmouth Harbourfront Greenway	44.66436	-63.55736	X2H20114473	2021-07-08	Bicycle	13976
Hollis St	44.64799	-63.57352	X2H20032465	2020-07-08	Bicycle	15748
South Park St	44.64194	-63.57972	X2H19070467	2019-09-01	Bicycle	46424
Vernon St	44.64292	-63.59154	X2H20114470	2020-12-09	Bicycle	24104
Windsor St	44.65466	-63.60368	X2H20114472	2020-12-09	Bicycle	24104

Drop serial_number and counter_type, which aren’t useful.

bike_data <- bike_data %>% select(-serial_number, -counter_type)

Sites can have multiple channels:

bike_data %>%
  count(site_name, channel_name, channel_id, name = "n_records") %>%
  gt()

site_name	channel_name	channel_id	n_records
Dartmouth Harbourfront Greenway	Dartmouth Harbourfront Greenway Northbound	353280085	6988
Dartmouth Harbourfront Greenway	Dartmouth Harbourfront Greenway Southbound	353280086	6988
Hollis St	Hollis St	100059339	15748
South Park St	South Park St Northbound	101054257	23212
South Park St	South Park St Southbound	102054257	23212
Vernon St	Vernon St Northbound	353252897	12052
Vernon St	Vernon St Southbound	353252898	12052
Windsor St	Windsor St Northbound	353252910	12052
Windsor St	Windsor St Southbound	353252909	12052

All but the Hollis St site has separate northbound and southbound channels.

For each site, check the installation_date relative to the range of count_date:

bike_data %>%
  group_by(site_name, installation_date) %>%
  summarise(min_count_date = min(count_date), max_count_date = max(count_date),
            .groups = "drop") %>%
  gt()

site_name	installation_date	min_count_date	max_count_date
Dartmouth Harbourfront Greenway	2021-07-08	2021-07-08	2022-04-25
Hollis St	2020-07-08	2020-07-08	2022-04-25
South Park St	2019-09-01	2019-09-01	2022-04-25
Vernon St	2020-12-09	2020-12-09	2022-04-25
Windsor St	2020-12-09	2020-12-09	2022-04-25

Everything is nicely aligned: the first data corresponds to the installation date, and the last data corresponds to the date the data were retrieved.

Plot the position of each of the counters using the given latitude and longitude, overlaid on a map of Halifax with the ggmap package:¹

library(ggmap)
site_locs <- bike_data %>%
  distinct(site_name, lat = latitude, lon = longitude)

mean_lat <- mean(site_locs$lat)
mean_lon <- mean(site_locs$lon)

halifax_map <- get_googlemap(c(mean_lon, mean_lat),
                             zoom = 14, maptype = "satellite")

ggmap(halifax_map) +
  geom_point(data = site_locs, size = 4,
             aes(fill = site_name), shape = 21, color = "white") +
  ggrepel::geom_label_repel(
    data = site_locs,
    aes(color = site_name, label = str_trunc(site_name, width = 25)),
    box.padding = 1.0
  ) +
  theme_void() +
  theme(legend.position = "none")

For each site and channel, get the time of day from count_datetime to determine the frequency of collection:

bike_data %>%
  mutate(time_of_day = format(count_datetime, "%H:%M:%S")) %>%
  count(site_name, time_of_day, name = "n_records") %>%
  ggplot(aes(y = time_of_day, x = n_records)) +
  geom_col() +
  facet_wrap(~ str_trunc(site_name, 15), nrow = 1) +
  scale_x_continuous(expand = c(0, 0), breaks = c(0, 500, 1000, 1500)) +
  dunnr::add_facet_borders()

Each counter reports observations at the hour mark (+1 second). There are some slight difference in the number of records due to the time of day I retrieved the data.

I made an assumption that the count_datetime variable was in UTC timezone. I can check this assumption by looking at average counts (over the entire data set).

bike_data_tod <- bike_data %>%
  mutate(
    time_of_day = format(count_datetime, "%H:%M:%S"),
    # Create a dummy variable with arbitrary date so I can plot time of day
    time_of_day = lubridate::ymd_hms(
      paste0("2022-04-22 ", time_of_day)
    )
  )
bike_data_tod %>%
  group_by(site_name, time_of_day) %>%
  summarise(
    n = n(), mean_count = mean(counter_value),
    .groups = "drop"
  ) %>%
  ggplot(aes(x = time_of_day, y = mean_count)) +
  geom_area(fill = td_colors$nice$mellow_yellow, color = "black") +
  facet_wrap(~ site_name, ncol = 1, scales = "free_y") +
  scale_x_datetime(date_breaks = "2 hours", date_labels = "%H") +
  scale_y_continuous(expand = c(0, 0)) +
  dunnr::add_facet_borders()

These peaks at around 8AM and 5PM tell me that the data is actually recorded in the local time zone (Atlantic), not UTC like I assumed. If they were in UTC time, the peaks would correspond to 11AM and 8PM locally, which would be odd times for peak cyclists.

Any interesting trends in different channels?

bike_data_tod %>%
  # Remove Hollis St, which does not have different channels
  filter(site_name != "Hollis St") %>%
  mutate(channel_direction = str_extract(channel_name, "(North|South)bound")) %>%
  group_by(site_name, channel_direction, time_of_day) %>%
  summarise(mean_count = mean(counter_value), .groups = "drop") %>%
  ggplot(aes(x = time_of_day, y = mean_count, color = channel_direction)) +
  geom_line() +
  facet_wrap(~ site_name, ncol = 1, scales = "free_y") +
  scale_x_datetime(date_breaks = "2 hours", date_labels = "%H") +
  theme(legend.position = "top")

Vernon St and Windsor St counters have higher traffic Southbound (heading downtown) at the start of the typical workday, and higher traffic Northbound (leaving downtown) at the end of the typical workday.

I am less interested in counts over the course of a day or by channel, and more interested in daily counts. Now that I know the count_date is correctly converted with the local time, get the sum at each site and each 24 hour day:

bike_data_daily_counts <- bike_data %>%
  group_by(site_name, installation_date, count_date) %>%
  summarise(
    n_records = n(), n_bikes = sum(counter_value), .groups = "drop"
  )

Now plot counts per day at each site:

bike_data_daily_counts %>%
  ggplot(aes(x = count_date, y = n_bikes)) +
  geom_line() +
  facet_wrap(~ site_name, ncol = 1, scales = "free_y") +
  dunnr::add_facet_borders()

The seasonal trends are very obvious from this plot. One thing that stood out to me is the big increase from 2020 to 2021 on South Park St. It may be representative of the start of the COVID pandemic, but I think it also has to do with the addition of protected bike lanes in December 2020. Before 2020, there appears to be a series of 0 counts on South Park St which may be artifacts:

bike_data_daily_counts %>%
  filter(site_name == "South Park St", count_date < "2020-01-01") %>%
  ggplot(aes(x = count_date, y = n_bikes)) +
  geom_line()

I’m almost certain this series of zeroes is not real, so I’ll remove it from the data. Find the date of the first non-zero n_bikes at this site, and filter out data before then:

south_park_min_date <- bike_data_daily_counts %>%
  filter(site_name == "South Park St", n_bikes > 0) %>%
  pull(count_date) %>%
  min()
south_park_min_date

[1] "2019-11-23"

bike_data_daily_counts <- bike_data_daily_counts %>%
  filter(!((site_name == "South Park St") & (count_date < south_park_min_date)))

Overlay counts by year for each site:

bike_data_daily_counts %>%
  mutate(count_year = year(count_date),
         # Replace year with 1970 so I can plot on the same scale
         count_date = as.Date(yday(count_date), origin = "1970-01-01")) %>%
  ggplot(aes(x = count_date, y = n_bikes, color = factor(count_year))) +
  geom_line(size = 1, alpha = 0.8) +
  facet_wrap(~ site_name, ncol = 1, scales = "free_y") +
  scale_x_date(date_labels = "%B") +
  dunnr::add_facet_borders() +
  theme(legend.position = "bottom") +
  labs(x = NULL, color = "Year") +
  scale_color_brewer(palette = "Set1")

I’m interested in day of the week effects as well:

bike_data_daily_counts %>%
  mutate(day_of_week = wday(count_date, label = TRUE)) %>%
  ggplot(aes(y = day_of_week, x = n_bikes)) +
  geom_boxplot()

Less activity on the weekends.

Getting weather data

Temporal data are probably the most important predictors of ridership, but I’m sure a close second is the day’s weather. I’ll get this with the API provided by the Meteorological Service of Canada. I can get a list of available data sets (which they call collections) as follows:²

base_url <- "https://api.weather.gc.ca/"
resp <- httr::GET(paste0(base_url, "collections?f=json"))

content_parsed <- content(resp, as = "parsed")
str(content_parsed, max.level = 1)

List of 2
 $ collections:List of 70
 $ links      :List of 3

The first element of collections:

collections <- content_parsed$collections
str(collections[[1]], max.level = 2)

List of 7
 $ id         : chr "hydrometric-stations"
 $ title      : chr "Hydrometric Monitoring Stations"
 $ description: chr "A station is a site on a river or lake where water quantity (water level and flow) are collected and recorded."
 $ keywords   :List of 2
  ..$ : chr "station"
  ..$ : chr "hydrometric station"
 $ links      :List of 13
  ..$ :List of 5
  ..$ :List of 5
  ..$ :List of 5
  ..$ :List of 5
  ..$ :List of 5
  ..$ :List of 4
  ..$ :List of 4
  ..$ :List of 4
  ..$ :List of 4
  ..$ :List of 4
  ..$ :List of 4
  ..$ :List of 4
  ..$ :List of 4
 $ extent     :List of 2
  ..$ spatial :List of 2
  ..$ temporal:List of 1
 $ itemType   : chr "feature"

Unlike the bicycle counts data, this nested format doesn’t lend itself well to direct conversion to a tibble:

as_tibble(collections[[1]])

Error:
! Tibble columns must have compatible sizes.
• Size 2: Columns `keywords` and `extent`.
• Size 13: Column `links`.
ℹ Only values of size one are recycled.

Instead, I can use enframe() to get a two-column data frame:

enframe(collections[[1]])

# A tibble: 7 × 2
  name        value           
  <chr>       <list>          
1 id          <chr [1]>       
2 title       <chr [1]>       
3 description <chr [1]>       
4 keywords    <list [2]>      
5 links       <list [13]>     
6 extent      <named list [2]>
7 itemType    <chr [1]>

Assuming every item in the collections list has the same structure, I’ll just extract the id, title, and description:

collections_df <-  map_dfr(
  collections,
  ~ enframe(.x) %>%
    filter(name %in% c("id", "title", "description")) %>%
    pivot_wider(names_from = name, values_from = value)
)
gt(collections_df) %>%
  tab_options(container.height = 300, container.overflow.y = TRUE)

id	title	description
hydrometric-stations	Hydrometric Monitoring Stations	A station is a site on a river or lake where water quantity (water level and flow) are collected and recorded.
hydrometric-daily-mean	Daily Mean of Water Level or Flow	The daily mean is the average of all unit values for a given day.
hydrometric-monthly-mean	Monthly Mean of Water Level or Flow	The monthly mean is the average of daily mean values for a given month.
hydrometric-annual-statistics	Annual Maximum and Minimum Daily Water Level or Flow	The annual maximum and minimum daily data are the maximum and minimum daily mean values for a given year.
hydrometric-annual-peaks	Annual Maximum and Minimum Instantaneous Water Level or Flow	The annual maximum and minimum instantaneous data are the maximum and minimum instantaneous values for a given year.
hydrometric-realtime	Real-time hydrometric data	Real-time water level and flow (discharge) data collected at over 2100 hydrometric stations across Canada (last 30 days).
climate-normals	1981-2010 Climate Normals	Climate Normals are used to summarize or describe the average climatic conditions of a particular location. At the completion of each decade, Environment and Climate Change Canada updates its climate normals for as many locations and as many climatic characteristics as possible. The climate normals offered here are based on Canadian climate stations with at least 15 years of data between 1981 to 2010.
climate-stations	Climate Stations	Climate observations are derived from two sources of data. The first are Daily Climate Stations producing one or two observations per day of temperature, precipitation. The second are hourly stations that typically produce more weather elements e.g. wind or snow on ground.
climate-monthly	Monthly Climate Observation Summaries	A cross-country summary of the averages and extremes for the month, including precipitation totals, max-min temperatures, and degree days. This data is available from stations that produce daily data.
climate-daily	Daily Climate Observations	Daily climate observations are derived from two sources of data. The first are Daily Climate Stations producing one or two observations per day of temperature, precipitation. The second are hourly stations that typically produce more weather elements e.g. wind or snow on ground. Only a subset of the total stations is shown due to size limitations. The criteria for station selection are listed as below. The priorities for inclusion are as follows: (1) Station is currently operational, (2) Stations with long periods of record, (3) Stations that are co-located with the categories above and supplement the period of record
climate-hourly	Hourly Climate Observations	Hourly climate observations are derived from the data source HLY01. These are stations that produce hourly meteorological observations, taken each hour of the day for the hours 00h-23h, for both daily (temperature, precipitation) and non-daily elements (station pressure, relative humidity, visibility). Only a subset of the total stations are shown due to size limitations. The stations were selected based on the following criteria: (1) Stations near cities with populations greater than 10,000, (2) Stations with long periods of record of at least 30 years of data, (3) Stations selected for the 1991-2020 WMO Normals, (4) Stations with RBCN designation not selected for 1991-2020 WMO Normals, and/or (5) Stations whose sum when joined with co-located/successor stations equates to a period of record of 30 years or greater.
ahccd-stations	Adjusted and Homogenized Canadian Climate Data (AHCCD) Stations	Climate station datasets that incorporate adjustments (derived from statistical procedures) to the original historical station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation.
ahccd-annual	Adjusted and Homogenized Canadian Climate Data (AHCCD) Annual	Adjusted and Homogenized Canadian Climate Data (AHCCD) are climate station datasets that incorporate adjustments (derived from statistical procedures) to the original historical station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation. Data are provided for temperature, precipitation, pressure and wind speed. Station trend data are provided when available. Trends are calculated using the Theil-Sen method using the station's full period of available data. The availability of trends will vary by station; if more than 5 consecutive years are missing data or more than 10% of the data within the time series is missing, a trend was not calculated.
ahccd-seasonal	Adjusted and Homogenized Canadian Climate Data (AHCCD) Seasonal	Adjusted and Homogenized Canadian Climate Data (AHCCD) are climate station datasets that incorporate adjustments (derived from statistical procedures) to the original historical station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation. Data are provided for temperature, precipitation, pressure and wind speed. Station trend data are provided when available. Trends are calculated using the Theil-Sen method using the station's full period of available data. The availability of trends will vary by station; if more than 5 consecutive years are missing data or more than 10% of the data within the time series is missing, a trend was not calculated.
ahccd-monthly	Adjusted and Homogenized Canadian Climate Data (AHCCD) Monthly	Adjusted and Homogenized Canadian Climate Data (AHCCD) are climate station datasets that incorporate adjustments (derived from statistical procedures) to the original historical station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation. Data are provided for temperature, precipitation, pressure and wind speed. Station trend data are provided when available. Trends are calculated using the Theil-Sen method using the station's full period of available data. The availability of trends will vary by station; if more than 5 consecutive years are missing data or more than 10% of the data within the time series is missing, a trend was not calculated.
ahccd-trends	Adjusted and Homogenized Canadian Climate Data (AHCCD) Trends	Adjusted and Homogenized Canadian Climate Data (AHCCD) are climate station datasets that incorporate adjustments (derived from statistical procedures) to the original historical station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation. Data are provided for temperature, precipitation, pressure and wind speed. Station trend data are provided when available. Trends are calculated using the Theil-Sen method using the station's full period of available data. The availability of trends will vary by station; if more than 5 consecutive years are missing data or more than 10% of the data within the time series is missing, a trend was not calculated.
swob-realtime	Surface Weather Observations	Surface Observations measured at the automatic and manual stations of the Environment and Climate Change Canada and partners networks, either for a single station, or for the stations of specific provinces and territories (last 30 days)
ltce-stations	Virtual Climate Stations (LTCE)	A Virtual Climate station is the result of threading together climate data from proximate current and historical stations to construct a long term threaded data set. For the purpose of identifying and tabulating daily extremes of record for temperature, precipitation and snowfall, the Meteorological Service of Canada has threaded or put together data from closely related stations to compile a long time series of data for about 750 locations in Canada to monitor for record-breaking weather. The length of the time series of virtual stations is often greater than 100 years. A Virtual Climate station is always named for an “Area” rather than a point, e.g. Winnipeg Area, to indicate that the data are drawn from that area(within a 20km radius from the urban center) rather than a single precise location.
ltce-temperature	Daily Extremes of Records (LTCE) – Temperature	Anomalous weather resulting in Temperature and Precipitation extremes occurs almost every day somewhere in Canada. For the purpose of identifying and tabulating daily extremes of record for temperature, precipitation and snowfall, the Meteorological Service of Canada has threaded or put together data from closely related stations to compile a long time series of data for about 750 locations in Canada to monitor for record-breaking weather. Virtual Climate stations correspond with the city pages of weather.gc.ca. This data provides the daily extremes of record for Temperature for each day of the year. Daily elements include: High Maximum, Low Maximum, High Minimum, Low Minimum.
ltce-precipitation	Daily Extremes of Records (LTCE) – Precipitation	Anomalous weather resulting in Temperature and Precipitation extremes occurs almost every day somewhere in Canada. For the purpose of identifying and tabulating daily extremes of record for temperature, precipitation and snowfall, the Meteorological Service of Canada has threaded or put together data from closely related stations to compile a long time series of data for about 750 locations in Canada to monitor for record-breaking weather. Virtual Climate stations correspond with the city pages of weather.gc.ca. This data provides the daily extremes of record for Precipitation for each day of the year. Daily elements include: Greatest Precipitation.
ltce-snowfall	Daily Extremes of Records (LTCE) – Snowfall	Anomalous weather resulting in Temperature and Precipitation extremes occurs almost every day somewhere in Canada. For the purpose of identifying and tabulating daily extremes of record for temperature, precipitation and snowfall, the Meteorological Service of Canada has threaded or put together data from closely related stations to compile a long time series of data for about 750 locations in Canada to monitor for record-breaking weather. Virtual Climate stations correspond with the city pages of weather.gc.ca. This data provides the daily extremes of record for Snowfall for each day of the year. Daily elements include: Greatest Snowfall.
aqhi-forecasts-realtime	Air Quality Health Index – Forecasts	The Air Quality Health Index (AQHI) is a scale designed to help quantify the quality of the air in a certain region on a scale from 1 to 10. When the amount of air pollution is very high, the number is reported as 10+. It also includes a category that describes the health risk associated with the index reading (e.g. Low, Moderate, High, or Very High Health Risk). The AQHI is calculated based on the relative risks of a combination of common air pollutants that are known to harm human health, including ground-level ozone, particulate matter, and nitrogen dioxide. The AQHI formulation captures only the short term or acute health risk (exposure of hour or days at a maximum). The formulation of the AQHI may change over time to reflect new understanding associated with air pollution health effects. The AQHI is calculated from data observed in real time, without being verified (quality control).
aqhi-observations-realtime	Air Quality Health Index – Observations	The Air Quality Health Index (AQHI) is a scale designed to help quantify the quality of the air in a certain region on a scale from 1 to 10. When the amount of air pollution is very high, the number is reported as 10+. It also includes a category that describes the health risk associated with the index reading (e.g. Low, Moderate, High, or Very High Health Risk). The AQHI is calculated based on the relative risks of a combination of common air pollutants that are known to harm human health, including ground-level ozone, particulate matter, and nitrogen dioxide. The AQHI formulation captures only the short term or acute health risk (exposure of hour or days at a maximum). The formulation of the AQHI may change over time to reflect new understanding associated with air pollution health effects. The AQHI is calculated from data observed in real time, without being verified (quality control).
bulletins-realtime	Real-time meteorological bulletins	Real-time meteorological bulletins (last 140 days)
climate:cmip5:projected:annual:anomaly	Projected annual anomaly CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:projected:seasonal:anomaly	Projected seasonal anomaly CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:projected:monthly:anomaly	Projected monthly anomaly CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:projected:annual:absolute	Projected annual CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:projected:seasonal:absolute	Projected seasonal CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:projected:monthly:absolute	Projected monthly CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:projected:annual:P20Y-Avg	Projected annual anomaly for 20 years average CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:projected:seasonal:P20Y-Avg	Projected seasonal anomaly for 20 years average CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:historical:annual:absolute	Historical annual CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:historical:annual:anomaly	Historical annual anomaly CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:historical:seasonal:absolute	Historical seasonal CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:historical:seasonal:anomaly	Historical seasonal anomaly CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:historical:monthly:absolute	Historical monthly CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:cmip5:historical:monthly:anomaly	Historical monthly anomaly CMIP5	The Global climate model scenarios dataset is based on an ensemble of global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). Multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios at a 1x1 degree grid resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. A range of percentiles across the multi-model ensembles are available for download.
climate:dcs:projected:annual:anomaly	Projected annual anomaly DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:dcs:projected:seasonal:anomaly	Projected seasonal anomaly DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:dcs:projected:annual:absolute	Projected annual DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:dcs:projected:seasonal:absolute	Projected seasonal DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:dcs:projected:monthly:absolute	Projected monthly DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:dcs:projected:annual:P20Y-Avg	Projected annual anomaly for 20 years average DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:dcs:projected:seasonal:P20Y-Avg	Projected seasonal anomaly for 20 years average DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:dcs:historical:annual:absolute	Historical annual DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:dcs:historical:annual:anomaly	Historical annual anomaly DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:dcs:historical:seasonal:absolute	Historical seasonal DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:dcs:historical:seasonal:anomaly	Historical seasonal anomaly DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:dcs:historical:monthly:absolute	Historical monthly DCS	The statistically downscaled climate scenarios dataset provides multi-model ensembles of modelled output (actual value) and projected change (anomaly) are available for historical simulations and three emission scenarios, RCP2.6, RCP4.5, RCP8.5, at a 10km resolution. Projected changes are expressed as anomalies with respect to the reference period of 1986-2005. Downscaled data are based on global climate model projections from the Coupled Model Intercomparison Project Phase 5 (CMIP5). A range of percentiles across the multi-model ensemble are available for download.
climate:indices:historical	Historical indices	High-resolution statistically downscaled climate indices relevant to climate change impacts in Canada are available at a 10 km spatial resolution and an annual temporal resolution for 1951-2100. The climate indices are based on model projections from 24 global climate models (GCMs) that participated in the Coupled Model Intercomparison Project Phase 5 (CMIP5).
climate:indices:projected	Projected indices	High-resolution statistically downscaled climate indices relevant to climate change impacts in Canada are available at a 10 km spatial resolution and an annual temporal resolution for 1951-2100. The climate indices are based on model projections from 24 global climate models (GCMs) that participated in the Coupled Model Intercomparison Project Phase 5 (CMIP5).
climate:spei-1:historical	Historical SPEI-1	The Standardized Precipitation Evapotranspiration Index (SPEI) is computed similarly to the SPI. The main difference is that SPI assesses precipitation variance, while SPEI also considers demand from evapotranspiration which is subtracted from any precipitation accumulation prior to assessment. Unlike the SPI, the SPEI captures the main impact of increased temperatures on water demand.
climate:spei-3:historical	Historical SPEI-3	The Standardized Precipitation Evapotranspiration Index (SPEI) is computed similarly to the SPI. The main difference is that SPI assesses precipitation variance, while SPEI also considers demand from evapotranspiration which is subtracted from any precipitation accumulation prior to assessment. Unlike the SPI, the SPEI captures the main impact of increased temperatures on water demand.
climate:spei-12:historical	Historical SPEI-12	The Standardized Precipitation Evapotranspiration Index (SPEI) is computed similarly to the SPI. The main difference is that SPI assesses precipitation variance, while SPEI also considers demand from evapotranspiration which is subtracted from any precipitation accumulation prior to assessment. Unlike the SPI, the SPEI captures the main impact of increased temperatures on water demand.
climate:spei-1:projected	Projected SPEI-1	The Standardized Precipitation Evapotranspiration Index (SPEI) is computed similarly to the SPI. The main difference is that SPI assesses precipitation variance, while SPEI also considers demand from evapotranspiration which is subtracted from any precipitation accumulation prior to assessment. Unlike the SPI, the SPEI captures the main impact of increased temperatures on water demand.
climate:spei-3:projected	Projected SPEI-3	The Standardized Precipitation Evapotranspiration Index (SPEI) is computed similarly to the SPI. The main difference is that SPI assesses precipitation variance, while SPEI also considers demand from evapotranspiration which is subtracted from any precipitation accumulation prior to assessment. Unlike the SPI, the SPEI captures the main impact of increased temperatures on water demand.
climate:spei-12:projected	Projected SPEI-12	The Standardized Precipitation Evapotranspiration Index (SPEI) is computed similarly to the SPI. The main difference is that SPI assesses precipitation variance, while SPEI also considers demand from evapotranspiration which is subtracted from any precipitation accumulation prior to assessment. Unlike the SPI, the SPEI captures the main impact of increased temperatures on water demand.
climate:cangrd:historical:annual:trend	CanGRD historical annual trend	CANGRD data are interpolated from adjusted and homogenized climate station data (i.e., AHCCD datasets). Homogenized climate data incorporate adjustments to the original station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation. Annual trends of relative total precipitation change (%) for 1948-2012 based on Canadian gridded data (CANGRD) are available, at a 50km resolution across Canada. The relative trends reflect the percent change in total precipitation over a period from the baseline value (defined as the average over 1961-1990 as the reference period). Annual trends of mean surface air temperature change (degrees Celsius) for 1948-2016 based on Canadian gridded data (CANGRD) are available at a 50km resolution across Canada. Temperature trends represent the departure from a mean reference period (1961-1990).
climate:cangrd:historical:annual:anomaly	CanGRD historical annual anomaly	Gridded annual mean temperature anomalies derived from daily minimum, maximum and mean surface air temperatures (degrees Celsius) and anomalies derived from daily total precipitation is available at a 50km resolution across Canada. The Canadian gridded data (CANGRD) are interpolated from homogenized temperature (i.e., AHCCD datasets). Homogenized temperatures incorporate adjustments to the original station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation. The anomalies are the difference between the temperature for a given year or season and a baseline value (defined as the average over 1961-1990 as the reference period). The yearly and seasonal temperature anomalies were computed for the years 1948 to 2017. The data will continue to be updated every year. For precipitation, the Canadian gridded data (CANGRD) are interpolated from adjusted precipitation (i.e., AHCCD datasets). Adjusted precipitation data incorporate adjustments to the original station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation. The anomalies are the percentage difference between the value for a given year or season and a baseline value (defined as the average over 1961-1990 as the reference period). The yearly and seasonal relative precipitation anomalies were computed for the years 1948 to 2014. The data will be updated as time permits.
climate:cangrd:historical:monthly:anomaly	CanGRD historical monthly anomaly	Gridded monthly mean temperature anomalies derived from daily minimum, maximum and mean surface air temperatures (degrees Celsius) and anomalies derived from daily total precipitation is available at a 50km resolution across Canada. The Canadian gridded data (CANGRD) are interpolated from homogenized temperature (i.e., AHCCD datasets). Homogenized temperatures incorporate adjustments to the original station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation. The anomalies are the difference between the temperature for a given year or season and a baseline value (defined as the average over 1961-1990 as the reference period). The yearly and seasonal temperature anomalies were computed for the years 1948 to 2017. The data will continue to be updated every year. For precipitation, the Canadian gridded data (CANGRD) are interpolated from adjusted precipitation (i.e., AHCCD datasets). Adjusted precipitation data incorporate adjustments to the original station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation. The anomalies are the percentage difference between the value for a given year or season and a baseline value (defined as the average over 1961-1990 as the reference period). The yearly and seasonal relative precipitation anomalies were computed for the years 1948 to 2014. The data will be updated as time permits.
climate:cangrd:historical:seasonal:trend	CanGRD historical seasonal trend	CANGRD data are interpolated from adjusted and homogenized climate station data (i.e., AHCCD datasets). Homogenized climate data incorporate adjustments to the original station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation.Seasonal trends of relative total precipitation change (%) for 1948-2012 based on Canadian gridded data (CANGRD) are available, at a 50km resolution across Canada. The relative trends reflect the percent change in total precipitation over a period from the baseline value (defined as the average over 1961-1990 as the reference period). Seasonal trends of mean surface air temperature change (degrees Celsius) for 1948-2016 based on Canadian gridded data (CANGRD) are available at a 50km resolution across Canada. Temperature trends represent the departure from a mean reference period (1961-1990).
climate:cangrd:historical:seasonal:anomaly	CanGRD historical seasonal anomaly	Gridded seasonal mean temperature anomalies derived from daily minimum, maximum and mean surface air temperatures (degrees Celsius) and anomalies derived from daily total precipitation is available at a 50km resolution across Canada. The Canadian gridded data (CANGRD) are interpolated from homogenized temperature (i.e., AHCCD datasets). Homogenized temperatures incorporate adjustments to the original station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation. The anomalies are the difference between the temperature for a given year or season and a baseline value (defined as the average over 1961-1990 as the reference period). The yearly and seasonal temperature anomalies were computed for the years 1948 to 2017. The data will continue to be updated every year. For precipitation, the Canadian gridded data (CANGRD) are interpolated from adjusted precipitation (i.e., AHCCD datasets). Adjusted precipitation data incorporate adjustments to the original station data to account for discontinuities from non-climatic factors, such as instrument changes or station relocation. The anomalies are the percentage difference between the value for a given year or season and a baseline value (defined as the average over 1961-1990 as the reference period). The yearly and seasonal relative precipitation anomalies were computed for the years 1948 to 2014. The data will be updated as time permits.
weather:rdpa:15km:24f	Regional Deterministic Precipitation Analysis (RDPA) 24 hours accumulation at 15km	The Regional Deterministic Precipitation Analysis (RDPA) produces a best estimate of the amount of precipitation that occurred over recent past periods of 24 hours. The estimate integrates data from in situ precipitation gauge measurements, weather radar and numerical weather prediction models. Geographic coverage is North America (Canada, United States and Mexico). Data is available at horizontal resolution of 15 km. Data is only available for the surface level. Analysis data is made available once a day for the 24h intervals. A preliminary estimate is available approximately 1h after the end of the accumulation period, and revised 6h after in order to assimilate gauge data arriving later.
weather:rdpa:15km:6f	Regional Deterministic Precipitation Analysis (RDPA) 6 hours accumulation at 15 km	The Regional Deterministic Precipitation Analysis (RDPA) produces a best estimate of the amount of precipitation that occurred over recent past periods of 6 hours. The estimate integrates data from in situ precipitation gauge measurements, weather radar and numerical weather prediction models. Geographic coverage is North America (Canada, United States and Mexico). Data is available at horizontal resolution of 15 km. Data is only available for the surface level. Analysis data is made available four times a day for the 6h intervals. A preliminary estimate is available approximately 1h after the end of the accumulation period, and revised 6h after in order to assimilate gauge data arriving later.
weather:rdpa:10km:24f	Regional Deterministic Precipitation Analysis (RDPA) 24 hours accumulation	The Regional Deterministic Precipitation Analysis (RDPA) produces a best estimate of the amount of precipitation that occurred over recent past periods of 24 hours. The estimate integrates data from in situ precipitation gauge measurements, weather radar and numerical weather prediction models. Geographic coverage is North America (Canada, United States and Mexico). Data is available at horizontal resolution of 10 km. Data is only available for the surface level. Analysis data is made available once a day for the 24h intervals. A preliminary estimate is available approximately 1h after the end of the accumulation period, and revised 6h after in order to assimilate gauge data arriving later.
weather:rdpa:10km:6f	Regional Deterministic Precipitation Analysis (RDPA) 6 hours accumulation	The Regional Deterministic Precipitation Analysis (RDPA) produces a best estimate of the amount of precipitation that occurred over recent past periods of 6 hours. The estimate integrates data from in situ precipitation gauge measurements, weather radar and numerical weather prediction models. Geographic coverage is North America (Canada, United States and Mexico). Data is available at horizontal resolution of 10 km. Data is only available for the surface level. Analysis data is made available four times a day for the 6h intervals. A preliminary estimate is available approximately 1h after the end of the accumulation period, and revised 6h after in order to assimilate gauge data arriving later.
weather:rdpa:10km:24p	Regional Deterministic Precipitation Analysis (RDPA) 24 hours accumulation (preliminary)	The Regional Deterministic Precipitation Analysis (RDPA) produces a best estimate of the amount of precipitation that occurred over recent past periods of 24 hours. The estimate integrates data from in situ precipitation gauge measurements, weather radar and numerical weather prediction models. Geographic coverage is North America (Canada, United States and Mexico). Data is available at horizontal resolution of 10 km. Data is only available for the surface level. Analysis data is made available once a day for the 24h intervals. The preliminary estimate is available approximately 1h after the end of the accumulation period.
weather:rdpa:10km:6p	Regional Deterministic Precipitation Analysis (RDPA) 6 hours accumulation (preliminary)	The Regional Deterministic Precipitation Analysis (RDPA) produces a best estimate of the amount of precipitation that occurred over recent past periods of 6 hours. The estimate integrates data from in situ precipitation gauge measurements, weather radar and numerical weather prediction models. Geographic coverage is North America (Canada, United States and Mexico). Data is available at horizontal resolution of 10 km. Data is only available for the surface level. Analysis data is made available four times a day for 6h intervals. The preliminary estimate is available approximately 1h after the end of the accumulation period.
weather:cansips:250km:forecast:members	Canadian Seasonal to Inter-annual Prediction System	The Canadian Seasonal to Inter-annual Prediction System (CanSIPS) carries out physics calculations to arrive at probabilistic predictions of atmospheric elements from the beginning of a month out to up to 12 months into the future. Atmospheric elements include temperature, precipitation, wind speed and direction and others. This product contains raw numerical results of these calculations. Geographical coverage is global. Data is available on a grid at a horizontal resolution of 2.5 degrees and for a few selected vertical levels. Predictions and corresponding hindcast are made available monthly.

The collections I want for this project are climate-stations (to find the appropriate Halifax station) and climate-daily to get daily measurements at that station. Get climate-stations:

resp <- httr::GET(paste0(base_url, "collections/climate-stations/items?f=json"))

content_parsed <- content(resp, as = "parsed")
str(content_parsed, max.level = 1)

List of 6
 $ type          : chr "FeatureCollection"
 $ features      :List of 500
 $ numberMatched : int 8552
 $ numberReturned: int 500
 $ links         :List of 5
 $ timeStamp     : chr "2022-04-27T03:27:26.475648Z"

Before looking closer at the data, I can already tell I’ll want to increase the limit of returned entries. From the API documentation, the maximum number is 10000, so I can get all 8552 records in one API call:

resp <- GET(paste0(base_url,
                   "collections/climate-stations/items?f=json&limit=10000"))

content_parsed <- content(resp, as = "parsed")
str(content_parsed, max.level = 1)

List of 6
 $ type          : chr "FeatureCollection"
 $ features      :List of 8552
 $ numberMatched : int 8552
 $ numberReturned: int 8552
 $ links         :List of 4
 $ timeStamp     : chr "2022-04-27T03:27:27.265491Z"

The data is contained in the features list:

climate_stations <- content_parsed$features
str(climate_stations[[1]], max.level = 3)

List of 4
 $ type      : chr "Feature"
 $ properties:List of 32
  ..$ STN_ID                  : int 8496
  ..$ STATION_NAME            : chr "CARLETON SUR MER"
  ..$ PROV_STATE_TERR_CODE    : chr "QC"
  ..$ ENG_PROV_NAME           : chr "QUEBEC"
  ..$ FRE_PROV_NAME           : chr "QUÉBEC"
  ..$ COUNTRY                 : chr "CAN"
  ..$ LATITUDE                : int 480800000
  ..$ LONGITUDE               : int -660700000
  ..$ TIMEZONE                : chr "EST"
  ..$ ELEVATION               : chr "541.00"
  ..$ CLIMATE_IDENTIFIER      : chr "705AA86"
  ..$ TC_IDENTIFIER           : NULL
  ..$ WMO_IDENTIFIER          : NULL
  ..$ STATION_TYPE            : chr "N/A"
  ..$ NORMAL_CODE             : NULL
  ..$ PUBLICATION_CODE        : int 1
  ..$ DISPLAY_CODE            : int 9
  ..$ ENG_STN_OPERATOR_ACRONYM: NULL
  ..$ FRE_STN_OPERATOR_ACRONYM: NULL
  ..$ ENG_STN_OPERATOR_NAME   : NULL
  ..$ FRE_STN_OPERATOR_NAME   : NULL
  ..$ FIRST_DATE              : chr "1968-10-01 00:00:00"
  ..$ LAST_DATE               : chr "1968-10-31 00:00:00"
  ..$ HLY_FIRST_DATE          : NULL
  ..$ HLY_LAST_DATE           : NULL
  ..$ DLY_FIRST_DATE          : chr "1968-10-01 00:00:00"
  ..$ DLY_LAST_DATE           : chr "1968-10-31 00:00:00"
  ..$ MLY_FIRST_DATE          : NULL
  ..$ MLY_LAST_DATE           : NULL
  ..$ HAS_MONTHLY_SUMMARY     : chr "Y"
  ..$ HAS_NORMALS_DATA        : chr "N"
  ..$ HAS_HOURLY_DATA         : chr "N"
 $ geometry  :List of 2
  ..$ type       : chr "Point"
  ..$ coordinates:List of 2
  .. ..$ : num -66.1
  .. ..$ : num 48.1
 $ id        : chr "705AA86"

After some frustration, I found that the geometry$coordinates are the correct latitude/longitude – those in the properties list are slightly off for some reason. Extract the data:

climate_stations <- map_dfr(
  climate_stations,
  ~ discard(.x$properties, is.null) %>% as_tibble() %>%
    mutate(lat = .x$geometry$coordinates[[2]],
           lon = .x$geometry$coordinates[[1]])
) %>%
  janitor::clean_names() %>%
  # Drop the incorrect latitude and longitude
  select(-latitude, -longitude)

glimpse(climate_stations)

Rows: 8,552
Columns: 32
$ stn_id                   <int> 8496, 9005, 10205, 6149, 6154, 6174, 6177, 61…
$ station_name             <chr> "CARLETON SUR MER", "PORT COLBORNE (AUT)", "K…
$ prov_state_terr_code     <chr> "QC", "ON", "QC", "NB", "NB", "NB", "NB", "NB…
$ eng_prov_name            <chr> "QUEBEC", "ONTARIO", "QUEBEC", "NEW BRUNSWICK…
$ fre_prov_name            <chr> "QUÉBEC", "ONTARIO", "QUÉBEC", "NOUVEAU-BRUNS…
$ country                  <chr> "CAN", "CAN", "CAN", "CAN", "CAN", "CAN", "CA…
$ timezone                 <chr> "EST", "EST", "EST", "AST", "AST", "AST", "AS…
$ elevation                <chr> "541.00", "183.50", "123.80", "152.40", "173.…
$ climate_identifier       <chr> "705AA86", "613F606", "7113382", "8101178", "…
$ station_type             <chr> "N/A", "Climate-Auto", "N/A", "N/A", "N/A", "…
$ publication_code         <int> 1, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ display_code             <int> 9, NA, NA, 7, 9, 5, 9, 7, 7, 9, 9, 8, 9, 9, 1…
$ first_date               <chr> "1968-10-01 00:00:00", "1992-12-02 00:00:00",…
$ last_date                <chr> "1968-10-31 00:00:00", "2022-04-24 12:30:02",…
$ dly_first_date           <chr> "1968-10-01 00:00:00", "1992-12-02 00:00:00",…
$ dly_last_date            <chr> "1968-10-31 00:00:00", "2022-04-24 00:00:00",…
$ has_monthly_summary      <chr> "Y", "Y", "N", "Y", "Y", "Y", "Y", "Y", "Y", …
$ has_normals_data         <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", …
$ has_hourly_data          <chr> "N", "Y", "Y", "N", "N", "N", "N", "N", "N", …
$ lat                      <dbl> 48.13333, 42.86667, 60.02306, 45.93333, 47.36…
$ lon                      <dbl> -66.11667, -79.25000, -70.00361, -64.78333, -…
$ tc_identifier            <chr> NA, "WPC", "YAS", NA, NA, NA, NA, NA, NA, NA,…
$ wmo_identifier           <chr> NA, "71463", NA, NA, NA, NA, NA, NA, NA, NA, …
$ eng_stn_operator_acronym <chr> NA, "ECCC - MSC", "DND", NA, NA, NA, NA, NA, …
$ fre_stn_operator_acronym <chr> NA, "ECCC - SMC", "MDN", NA, NA, NA, NA, NA, …
$ eng_stn_operator_name    <chr> NA, "Environment and Climate Change Canada - …
$ fre_stn_operator_name    <chr> NA, "Environnement et Changement climatique C…
$ hly_first_date           <chr> NA, "1994-02-01 02:00:00", "1992-10-21 07:00:…
$ hly_last_date            <chr> NA, "2022-04-24 12:30:02", "2015-09-10 13:00:…
$ mly_first_date           <chr> NA, "2006-04-01 00:00:00", NA, "1964-01-01 00…
$ mly_last_date            <chr> NA, "2006-12-01 00:00:00", NA, "1979-12-01 00…
$ normal_code              <chr> NA, NA, NA, "F", NA, "D", NA, "F", "F", NA, N…

Now I’ll filter this list down to those in Halifax, NS using distance to the bike counter latitude/longitude means:

climate_stations_halifax <- climate_stations %>%
  filter(prov_state_terr_code == "NS") %>%
  mutate(
    # Compare to the mean lat/lon from the bike counters
    diff_lat = abs(lat - mean_lat), diff_lon = abs(lon - mean_lon),
    # Use squared distance to determine the closest points
    diff2 = diff_lat^2 + diff_lon^2
  ) %>%
  # Look at the top 5 for now
  slice_min(diff2, n = 5)

climate_stations_halifax %>% rmarkdown::paged_table()

ABCDEFGHIJ0123456789

stn_id <int>	station_name <chr>	prov_state_terr_code <chr>	eng_prov_name <chr>
6357	HALIFAX CITADEL	NS	NOVA SCOTIA
43405	HALIFAX DOCKYARD	NS	NOVA SCOTIA
6356	HALIFAX	NS	NOVA SCOTIA
6355	HALIFAX	NS	NOVA SCOTIA
6332	DARTMOUTH	NS	NOVA SCOTIA

Visualize the locations of the stations and bike counters:

d <- bind_rows(
  site_locs %>% mutate(group = "bike counters", label = site_name),
  climate_stations_halifax %>%
    transmute(label = glue("{station_name} ({stn_id})"),
              lat, lon, diff2, group = "climate stations")
)
  
ggmap(halifax_map) +
  geom_point(data = d, size = 4,
             aes(fill = group), shape = 21, color = "white") +
  ggrepel::geom_label_repel(
    data = d,
    aes(color = group, label = str_trunc(label, width = 25)),
    box.padding = 2
  ) +
  theme_void() +
  theme(legend.position = "none")

Halifax Citadel is the closest to the center, but last_date is 2002-01-31 for this station, so it hasn’t been active for the past two decades. The next closest is the dockyard, which is actively being updated (last_date is 2022-04-24).

Now with the station name (“HALIFAX DOCKYARD”), I can request the daily climate reports:

resp <- GET(
  paste0(
    base_url,
    "collections/climate-daily/items?f=json&limit=10000&STATION_NAME=HALIFAX%20DOCKYARD"
  )
)

content_parsed <- content(resp, as = "parsed")
str(content_parsed, max.level = 1)

List of 6
 $ type          : chr "FeatureCollection"
 $ features      :List of 1399
 $ numberMatched : int 1399
 $ numberReturned: int 1399
 $ links         :List of 4
 $ timeStamp     : chr "2022-04-27T03:27:56.527206Z"

The features data:

daily_climate <- content_parsed$features
str(daily_climate[[1]])

List of 4
 $ id        : chr "8202240.2021.2.11"
 $ type      : chr "Feature"
 $ geometry  :List of 2
  ..$ coordinates:List of 2
  .. ..$ : num -63.6
  .. ..$ : num 44.7
  ..$ type       : chr "Point"
 $ properties:List of 34
  ..$ STATION_NAME            : chr "HALIFAX DOCKYARD"
  ..$ CLIMATE_IDENTIFIER      : chr "8202240"
  ..$ ID                      : chr "8202240.2021.2.11"
  ..$ LOCAL_DATE              : chr "2021-02-11 00:00:00"
  ..$ PROVINCE_CODE           : chr "NS"
  ..$ LOCAL_YEAR              : int 2021
  ..$ LOCAL_MONTH             : int 2
  ..$ LOCAL_DAY               : int 11
  ..$ MEAN_TEMPERATURE        : num -7.5
  ..$ MEAN_TEMPERATURE_FLAG   : NULL
  ..$ MIN_TEMPERATURE         : num -9.9
  ..$ MIN_TEMPERATURE_FLAG    : NULL
  ..$ MAX_TEMPERATURE         : num -5.1
  ..$ MAX_TEMPERATURE_FLAG    : NULL
  ..$ TOTAL_PRECIPITATION     : NULL
  ..$ TOTAL_PRECIPITATION_FLAG: NULL
  ..$ TOTAL_RAIN              : NULL
  ..$ TOTAL_RAIN_FLAG         : NULL
  ..$ TOTAL_SNOW              : NULL
  ..$ TOTAL_SNOW_FLAG         : NULL
  ..$ SNOW_ON_GROUND          : NULL
  ..$ SNOW_ON_GROUND_FLAG     : NULL
  ..$ DIRECTION_MAX_GUST      : int 28
  ..$ DIRECTION_MAX_GUST_FLAG : NULL
  ..$ SPEED_MAX_GUST          : int 47
  ..$ SPEED_MAX_GUST_FLAG     : NULL
  ..$ COOLING_DEGREE_DAYS     : int 0
  ..$ COOLING_DEGREE_DAYS_FLAG: NULL
  ..$ HEATING_DEGREE_DAYS     : num 25.5
  ..$ HEATING_DEGREE_DAYS_FLAG: NULL
  ..$ MIN_REL_HUMIDITY        : int 41
  ..$ MIN_REL_HUMIDITY_FLAG   : NULL
  ..$ MAX_REL_HUMIDITY        : int 66
  ..$ MAX_REL_HUMIDITY_FLAG   : NULL

Unfortunately, this station does not report some helpful measurements, like precipitation and snowfall. I might have to expand my search to find a more informative station:

climate_stations_halifax <- climate_stations %>%
  filter(prov_state_terr_code == "NS",
         # Only include stations with recent data
         last_date > "2022-04-21") %>%
  mutate(
    diff_lat = abs(lat - mean_lat), diff_lon = abs(lon - mean_lon),
    diff2 = diff_lat^2 + diff_lon^2
  ) %>%
  slice_min(diff2, n = 5)

climate_stations_halifax %>% rmarkdown::paged_table()

ABCDEFGHIJ0123456789

stn_id <int>	station_name <chr>	prov_state_terr_code <chr>	eng_prov_name <chr>
43405	HALIFAX DOCKYARD	NS	NOVA SCOTIA
43403	HALIFAX WINDSOR PARK	NS	NOVA SCOTIA
10945	SHEARWATER JETTY	NS	NOVA SCOTIA
7169	MCNABS ISLAND (AUT)	NS	NOVA SCOTIA
43124	HALIFAX KOOTENAY	NS	NOVA SCOTIA

Visualize these station locations in a zoomed out map:

hrm_map <- get_googlemap(c(mean_lon, mean_lat),
                         zoom = 12, maptype = "satellite")

d <- bind_rows(
  site_locs %>% mutate(group = "bike counters", label = site_name),
  climate_stations_halifax %>%
    transmute(label = glue("{station_name} ({stn_id})"),
              lat, lon, diff2, group = "climate stations")
)
  
ggmap(hrm_map) +
  geom_point(data = d, size = 4,
             aes(fill = group), shape = 21, color = "white") +
  ggrepel::geom_label_repel(
    data = d,
    aes(color = group, label = str_trunc(label, width = 25)),
    box.padding = 0.5, force = 1.5
  ) +
  theme_void() +
  theme(legend.position = "none")

Exploring the data from these stations a bit (not shown), Halifax Windsor Park seems a reasonable choice in terms of available data.

resp <- GET(
  paste0(base_url,
         "collections/climate-daily/items?f=json&limit=10000&STATION_NAME=",
         URLencode("HALIFAX WINDSOR PARK"))
)

content_parsed <- content(resp, as = "parsed")

daily_climate <- map_dfr(
  content_parsed$features,
  ~ discard(.x$properties, is.null) %>% as_tibble()
) %>%
  janitor::clean_names()

glimpse(daily_climate)

Rows: 1,431
Columns: 29
$ station_name             <chr> "HALIFAX WINDSOR PARK", "HALIFAX WINDSOR PARK…
$ climate_identifier       <chr> "8202255", "8202255", "8202255", "8202255", "…
$ id                       <chr> "8202255.2021.2.11", "8202255.2021.2.12", "82…
$ local_date               <chr> "2021-02-11 00:00:00", "2021-02-12 00:00:00",…
$ province_code            <chr> "NS", "NS", "NS", "NS", "NS", "NS", "NS", "NS…
$ local_year               <int> 2021, 2021, 2021, 2021, 2021, 2018, 2018, 201…
$ local_month              <int> 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
$ local_day                <int> 11, 12, 13, 14, 15, 14, 15, 16, 17, 18, 19, 2…
$ mean_temperature         <dbl> -8.1, -8.1, -5.2, -5.0, -1.5, 10.8, 12.1, 9.0…
$ min_temperature          <dbl> -11.1, -13.0, -8.0, -8.9, -5.4, 5.0, 9.1, 5.4…
$ max_temperature          <dbl> -5.2, -3.2, -2.3, -1.2, 2.4, 16.5, 15.1, 12.5…
$ snow_on_ground           <int> 13, 12, 11, 11, 11, NA, NA, NA, NA, NA, NA, N…
$ direction_max_gust       <int> 26, 30, NA, NA, NA, 22, 21, 33, NA, NA, NA, 2…
$ speed_max_gust           <int> 42, 32, NA, NA, NA, 38, 48, 32, NA, NA, NA, 4…
$ cooling_degree_days      <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
$ heating_degree_days      <dbl> 26.1, 26.1, 23.2, 23.0, 19.5, 7.2, 5.9, 9.0, …
$ min_rel_humidity         <int> 39, 66, 52, 67, 40, 37, 68, 40, 38, 21, 30, 8…
$ max_rel_humidity         <int> 68, 87, 90, 90, 83, 90, 96, 87, 97, 89, 96, 9…
$ total_precipitation_flag <chr> NA, NA, NA, NA, "M", NA, NA, NA, NA, NA, NA, …
$ total_precipitation      <dbl> NA, NA, NA, NA, NA, 0.0, 0.6, 0.0, 0.0, 0.0, …
$ mean_temperature_flag    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ min_temperature_flag     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ max_temperature_flag     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ direction_max_gust_flag  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ speed_max_gust_flag      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ cooling_degree_days_flag <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ heating_degree_days_flag <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ min_rel_humidity_flag    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ max_rel_humidity_flag    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

EDA and cleaning

Variable summaries:

skimr::skim(daily_climate)

Data summary
Name	daily_climate
Number of rows	1431
Number of columns	29
_______________________
Column type frequency:
character	15
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
station_name	0	1.00	20	20	1
climate_identifier	0	1.00	7	7	1
id	0	1.00	16	18	1431
local_date	0	1.00	19	19	1431
province_code	0	1.00	2	2	1
total_precipitation_flag	1098	0.23	1	1	1
mean_temperature_flag	1412	0.01	1	1	1
min_temperature_flag	1412	0.01	1	1	1
max_temperature_flag	1412	0.01	1	1	1
direction_max_gust_flag	1412	0.01	1	1	1
speed_max_gust_flag	1412	0.01	1	1	1
cooling_degree_days_flag	1412	0.01	1	1	1
heating_degree_days_flag	1412	0.01	1	1	1
min_rel_humidity_flag	1412	0.01	1	1	1
max_rel_humidity_flag	1409	0.02	1	1	1

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
local_year	0	1.00	2019.83	1.20	2018.0	2019.0	2020.0	2021.00	2022.0	▅▇▇▇▂
local_month	0	1.00	6.59	3.44	1.0	4.0	7.0	10.00	12.0	▇▅▅▅▇
local_day	0	1.00	15.85	8.76	1.0	8.0	16.0	23.00	31.0	▇▇▇▇▆
mean_temperature	19	0.99	8.51	9.16	-13.5	1.3	8.2	16.30	26.8	▂▆▇▇▅
min_temperature	19	0.99	4.12	8.95	-17.3	-2.1	3.9	11.93	20.7	▂▅▇▆▆
max_temperature	19	0.99	12.90	9.66	-11.0	5.0	12.9	21.20	34.1	▂▇▇▇▃
snow_on_ground	1061	0.26	2.82	4.07	0.0	0.0	1.0	3.00	24.0	▇▁▁▁▁
direction_max_gust	554	0.61	23.77	8.16	1.0	21.0	24.0	30.00	36.0	▂▃▅▇▆
speed_max_gust	554	0.61	42.48	10.34	31.0	35.0	40.0	47.00	96.0	▇▂▁▁▁
cooling_degree_days	19	0.99	0.61	1.48	0.0	0.0	0.0	0.00	8.8	▇▁▁▁▁
heating_degree_days	19	0.99	10.10	8.32	0.0	1.7	9.8	16.70	31.5	▇▅▅▃▁
min_rel_humidity	19	0.99	56.63	18.72	15.0	42.0	56.0	70.00	100.0	▂▇▇▆▂
max_rel_humidity	22	0.98	92.31	9.40	47.0	88.0	96.0	100.00	100.0	▁▁▁▂▇
total_precipitation	554	0.61	4.16	9.75	0.0	0.0	0.0	3.00	102.4	▇▁▁▁▁

Drop some un-needed variables:

daily_climate <- daily_climate %>%
  select(-station_name, -climate_identifier, -id, -province_code)

Process the date variable:

daily_climate <- daily_climate %>%
  mutate(report_date = as.POSIXct(local_date) %>% as.Date()) %>%
  # Can drop these now
  select(-local_date, -local_year, -local_month, -local_day)

There happens to be some missing days:

tibble(report_date = seq.Date(as.Date("2018-05-14"), as.Date("2022-04-22"),
                              by = "days")) %>%
  anti_join(daily_climate, by = "report_date") %>%
  pull(report_date)

 [1] "2020-01-02" "2020-01-03" "2020-01-04" "2020-01-05" "2020-01-06"
 [6] "2021-01-03" "2021-01-04" "2021-01-05" "2021-01-06" "2021-01-07"
[11] "2022-01-03"

Seems odd that all of the missing days are in January of different years.

There are also some missing temperature values:

daily_climate %>%
  filter(
    is.na(mean_temperature) | is.na(min_temperature) | is.na(max_temperature)
  ) %>%
  select(report_date, contains("_temperature")) %>%
  rmarkdown::paged_table()

ABCDEFGHIJ0123456789

report_date <date>	mean_temperature <dbl>	min_temperature <dbl>	max_temperature <dbl>
2018-06-04	NA	NA	NA
2018-11-21	NA	NA	NA
2018-11-25	NA	NA	NA
2019-03-15	NA	NA	NA
2019-07-24	NA	NA	NA
2020-01-01	NA	NA	NA
2020-01-07	NA	NA	NA
2020-02-11	NA	NA	NA
2020-05-14	NA	NA	NA
2020-09-02	NA	NA	NA

The report_dates range from 2018 to 2022. The flag values (*_temperature_flag) are all “M”, telling us what we already know: the data is missing.

For non-missing values, here is the trend over time:

daily_climate %>%
  filter(!is.na(mean_temperature)) %>%
  ggplot(aes(x = report_date)) +
  geom_line(aes(y = mean_temperature), color = td_colors$nice$ruby_red)

The total_precipitation values:

daily_climate %>%
  count(total_precipitation, total_precipitation_flag) %>%
  arrange(desc(is.na(total_precipitation))) %>%
  rmarkdown::paged_table()

ABCDEFGHIJ0123456789

total_precipitation <dbl>	total_precipitation_flag <chr>	n <int>
NA	M	333
NA	NA	221
0.0	NA	472
0.2	NA	69
0.4	NA	26
0.6	NA	12
0.8	NA	9
1.0	NA	6
1.2	NA	6
1.4	NA	16

There are missing total_precipitation values with NA total_precipitation_flag, which makes me think that the flag variables are not going to be useful/reliable.

Visualize the non-missing:

daily_climate %>%
  filter(!is.na(total_precipitation)) %>%
  ggplot(aes(x = report_date)) +
  geom_point(aes(y = total_precipitation), color = td_colors$nice$spanish_blue)

The snow_on_ground values:

daily_climate %>%
  count(snow_on_ground) %>%
  arrange(desc(is.na(snow_on_ground))) %>%
  rmarkdown::paged_table()

ABCDEFGHIJ0123456789

snow_on_ground <int>	n <int>
NA	1061
0	100
1	109
2	59
3	18
4	9
5	15
6	5
7	6
8	3

daily_climate %>%
  filter(!is.na(snow_on_ground)) %>%
  ggplot(aes(x = report_date)) +
  geom_point(aes(y = snow_on_ground), color = td_colors$nice$spanish_blue)

The speed_max_gust values (in km/h):

daily_climate %>%
  count(speed_max_gust, speed_max_gust_flag) %>%
  arrange(desc(is.na(speed_max_gust))) %>%
  rmarkdown::paged_table()

ABCDEFGHIJ0123456789

speed_max_gust <int>	speed_max_gust_flag <chr>	n <int>
NA	M	19
NA	NA	535
31	NA	42
32	NA	58
33	NA	56
34	NA	63
35	NA	45
36	NA	52
37	NA	41
38	NA	38

daily_climate %>%
  filter(!is.na(speed_max_gust)) %>%
  ggplot(aes(x = report_date)) +
  geom_point(aes(y = speed_max_gust), color = td_colors$nice$emerald)

Combining the data

Now I’ll combine the two data sets (joining on the date), only taking the most useful variables from the climate report (temperature, precipitation, wind speed, snow):

bike_counts_climate <- bike_data_daily_counts %>%
  left_join(
    daily_climate %>%
      select(report_date, mean_temperature, total_precipitation,
             speed_max_gust, snow_on_ground),
    by = c("count_date" = "report_date")
  )
glimpse(bike_counts_climate)

Rows: 2,840
Columns: 9
$ site_name           <chr> "Dartmouth Harbourfront Greenway", "Dartmouth Harb…
$ installation_date   <date> 2021-07-08, 2021-07-08, 2021-07-08, 2021-07-08, 2…
$ count_date          <date> 2021-07-08, 2021-07-09, 2021-07-10, 2021-07-11, 2…
$ n_records           <int> 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48…
$ n_bikes             <int> 130, 54, 180, 245, 208, 250, 182, 106, 152, 257, 1…
$ mean_temperature    <dbl> 18.2, 17.6, 21.0, 21.0, 20.6, 18.6, 17.7, 20.2, 20…
$ total_precipitation <dbl> 0.6, 10.0, 0.4, 0.0, 0.0, 0.0, 0.0, 11.6, 0.2, 0.4…
$ speed_max_gust      <int> NA, 54, 56, NA, NA, NA, NA, 32, 37, NA, NA, NA, NA…
$ snow_on_ground      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

Visualize the missing climate data:

bike_counts_climate %>%
  distinct(count_date, mean_temperature, total_precipitation,
           speed_max_gust, snow_on_ground) %>%
  mutate(across(where(is.numeric), is.na)) %>%
  pivot_longer(cols = -count_date) %>%
  ggplot(aes(x = count_date, y = name)) +
  geom_tile(aes(fill = value)) +
  labs(y = NULL, x = NULL, fill = "Missing") +
  scale_fill_manual(values = c(td_colors$nice$indigo_blue, "gray80")) +
  scale_x_date(expand = c(0, 0)) +
  scale_y_discrete(expand = c(0, 0)) +
  theme(legend.position = "top")

Quite a bit of missing data, but we should have enough to make for an interesting analysis. Save the data:

write_rds(bike_counts_climate, "bike-ridership-data.rds")

In my next post, I will use this data to develop and evaluate various prediction models.

Reproducibility

Session info

 setting  value
 version  R version 4.2.1 (2022-06-23 ucrt)
 os       Windows 10 x64 (build 19044)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Canada.utf8
 ctype    English_Canada.utf8
 tz       America/Curacao
 date     2022-10-27
 pandoc   2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)

ABCDEFGHIJ0123456789

	package <chr>	ondiskversion <chr>	loadedversion <chr>
assertthat	assertthat	0.2.1	0.2.1
backports	backports	1.4.1	1.4.1
base64enc	base64enc	0.1.3	0.1-3
bitops	bitops	1.0.7	1.0-7
broom	broom	1.0.0	1.0.0
cachem	cachem	1.0.6	1.0.6
callr	callr	3.7.1	3.7.1
cellranger	cellranger	1.1.0	1.1.0
checkmate	checkmate	2.1.0	2.1.0
cli	cli	3.3.0	3.4.1

Git repository

Local:    main C:/Users/tdunn/Documents/tdunn-quarto
Remote:   main @ origin (https://github.com/taylordunn/tdunn-quarto.git)
Head:     [4eb5bf2] 2022-10-26: Added font import to style sheet

Source code, R environment

Footnotes

Went through some trial and error to get the get_googlemap() function working here. In brief, I (1) downloaded the development version of ggmap from GitHub (remotes::install_github("dkahle/ggmap")) (2) created a new project on my Google Cloud Platform (GCP) account, (3) added an API key with access to the Google Static Maps API and registered it with register_google(), and (4) had to enable billing (because my free trial had been used).↩︎
To build these API queries, I found this documentation to be very helpful.↩︎

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{dunn2022,
  author = {Dunn, Taylor},
  title = {Predicting Bike Ridership: Getting the Data},
  date = {2022-04-27},
  url = {https://tdunn.ca/posts/2022-04-27-predicting-bike-ridership-getting-the-data},
  langid = {en}
}

For attribution, please cite this work as:

Dunn, Taylor. 2022. “Predicting Bike Ridership: Getting the Data.” April 27, 2022. https://tdunn.ca/posts/2022-04-27-predicting-bike-ridership-getting-the-data.