We wanted to analyze how demographic patterns correlate with public transportation usage trends because we believe that there might be areas or demographics with higher demands but less accessibility to public transit or alternately, areas with access but people are not using it as much. From here, we hope that some of our findings will help maximize efficiency in terms of reallocating resources to areas with more traffic.
As we proceeded with our research, these were the questions that helped us form a cohesive path in terms of data gathering and visualization:
What is the correlation between income and ridership in the Twin Cities?
How do college students influence ridership?
library(urltools)
library(jsonlite)
library(readr)
library(tidyverse)
library(ggmap)
library(ggplot2)
library(lubridate)
library(geosphere)
library(tigris)
library(leaflet)
library(htmltools)
library(reshape2)
options(tigris_use_cache = TRUE)
# loading the data on daily boarding and bus stops
stop_usage <- read_csv("daily_boarding.csv")
stops <- read_csv("bus_stops.csv")
First, we wanted to examine the trend in public transportation usage to determine whether public transit is a popular means of transportation among the population. We then gathered this data from the Minnesota Compass on the average number of jobs reachable within 30 minutes by public transit, walking, or cycling in the Twin Cities seven-county area, and we examined it to see how it could be incorporated into our research.
# getting the data
distance_county <- read.csv("http://api.mncompass.org/embed/b12115-2.csv",
header=TRUE, sep=",", row.names=NULL)
# cleaning the data
shift <- function(x, n){
c(x[-(seq(n))], rep(NA, n))}
distance_county <- transform(distance_county, Average.number.of.jobs.reachable..br...within.30.minutes.by.public.transit.or.walking = shift(distance_county$Average.number.of.jobs.reachable..br...within.30.minutes.by.public.transit.or.walking, 1))
distance_county <- distance_county %>%
filter(distance_county$row.names != "Average number of jobs that can be reached from each block within 30 minutes")
distance_county <- distance_county[-c(1,2,10, 11,12,13,14,15,16,17,18,19,20,21),]
names(distance_county) <- c("county", "within30")
cleanwithin30 <- c(4441, 1908, 10897, 51618, 47213, 469, 1703)
distance_county <-
distance_county %>%
mutate(within30=cleanwithin30)
distance_county
## county within30
## 1 Anoka 4441
## 2 Carver 1908
## 3 Dakota 10897
## 4 Hennepin 51618
## 5 Ramsey 47213
## 6 Scott 469
## 7 Washington 1703
With the above figures, we collected data from the Census for the corresponding regions to see how many people actually commuted to work via public transit, walking, or cycling.
# constructing a data frame
# the county names
county <- c('Anoka', 'Carver', 'Dakota', 'Hennepin', 'Ramsey', 'Scott', 'Washington')
# the number of people in each county who uses public transportation to work
people <- c(5858, 1103, 7465, 49100, 19162, 1198, 3273)
# adding the average number of jobs that can be reached from each block within 30 minutes
commute_county <- data.frame(county, people, cleanwithin30)
names(commute_county) <- c("County", "Commute by Public Transport/Walking/Biking", "Has a Job Accessible within 30 Minutes by Public Transport/Walking/Biking")
commute_county
## County Commute by Public Transport/Walking/Biking
## 1 Anoka 5858
## 2 Carver 1103
## 3 Dakota 7465
## 4 Hennepin 49100
## 5 Ramsey 19162
## 6 Scott 1198
## 7 Washington 3273
## Has a Job Accessible within 30 Minutes by Public Transport/Walking/Biking
## 1 4441
## 2 1908
## 3 10897
## 4 51618
## 5 47213
## 6 469
## 7 1703
Then, we compared these two figures for each county to see whether people were commuting to work by public transit, walking, or cycling when their workplace was accessible.
# constructing the visualization
commute.long<-melt(commute_county)
ggplot(commute.long, aes(County, value, fill=variable)) +
geom_bar(stat="identity", position="dodge") +
labs(x="County", y="Number of People", fill="Variable") +
theme(axis.title=element_text(size=10)) +
theme(axis.text.x=element_text(size=7)) +
theme(axis.text.y=element_text(size=8)) +
theme(legend.title=element_text(size=8)) +
theme(legend.text=element_text(size=7))
Here, we observed that:
With the exception of Ramsey county, these two numbers for each county did not differ too much from each other.
Three out of four counties, Dakota, Hennepin, and Ramsey, had more people with an accessible workplace than the number of people actually commuting by these methods.
Hennepin and Ransey county had significantly more traffic than the other five counties.
Consequently, we narrowed down our research to Hennepin and Ramsey county, specifically Minneapolis and Saint Paul, because with more data, we believed that we could better analyze how the population here interacts with MetroTransit. At the same time, as jobs and income are closely related, we began to investigate whether differences in incomes between areas would lead to a change in ridership.
# gathering and constructing the necessary data
# getting data on all zip codes and filter those for Minnesota
zip_code_database <- read_csv("free-zipcode-database.csv")
zip_code_database_mn <-
zip_code_database %>%
filter(State == "MN")
# filtering the zip codes in Saint Paul and Minneapolis and cleaning up the data
zip_code_saint_paul <-
zip_code_database_mn %>%
filter(City %in% c("SAINT PAUL", "W SAINT PAUL", "WEST SAINT PAUL", "N SAINT PAUL", "NORTH SAINT PAUL", "SO SAINT PAUL"), ZipCodeType == "STANDARD")
zip_code_minneapolis <-
zip_code_database_mn %>%
filter(City == "MINNEAPOLIS", ZipCodeType == "STANDARD")
# constructing a vector for the median income per household by zip codes in Saint Paul and Minneapolis
income_saint_paul <- c(31155, 34756, 29558, 38237, 53879, 36560, 39552, 39552, 39552, 43277, 50614, 50614, 50614, 59373, NA, 54365, 50616, 26895, 72877, 46863, 40132, 50325, 50325, 50325, 45666, 73901, 54358, 60298, 82004, 70253, 73754, 68941, 75091, 55632, 93218, NA, NA)
income_minneapolis <- c(40716, 30391, 30702, 20923, 40368, 40867, 37462, 34216, 55424,64084, 28434, 38818, 33774, 28426, 30156, 55252, 52127, 41344, 63513,47295,39526, 51885, 45598, 93481, 44538, 47145, 55685, 43970, 43511, 39620, 57197, 48370, 52128, 54887, 42487, 72361, 58483, 70011, 87302, 63068, 84451, 62347, 65043, 63662, 91431, 75029, 60226, 67617, NA, 14360, 66250)
# adding the vectors to the datasets and merging the datasets into a single data frame
zip_code_saint_paul <-
zip_code_saint_paul %>%
mutate(average_income_per_household=income_saint_paul)
zip_code_minneapolis <-
zip_code_minneapolis %>%
mutate(average_income_per_household=income_minneapolis)
income_data <- rbind(zip_code_minneapolis, zip_code_saint_paul)
head(income_data)
## # A tibble: 6 x 21
## RecordNumber Zipcode ZipCodeType City State LocationType Lat Long
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 29401 55401 STANDARD MINN~ MN PRIMARY 45.0 -93.3
## 2 29402 55402 STANDARD MINN~ MN PRIMARY 45.0 -93.3
## 3 29403 55403 STANDARD MINN~ MN PRIMARY 45.0 -93.3
## 4 29404 55404 STANDARD MINN~ MN PRIMARY 45.0 -93.3
## 5 29405 55405 STANDARD MINN~ MN PRIMARY 45.0 -93.3
## 6 29406 55406 STANDARD MINN~ MN PRIMARY 45.0 -93.3
## # ... with 13 more variables: Xaxis <dbl>, Yaxis <dbl>, Zaxis <dbl>,
## # WorldRegion <lgl>, Country <chr>, LocationText <chr>, Location <chr>,
## # Decommisioned <lgl>, TaxReturnsFiled <dbl>, EstimatedPopulation <dbl>,
## # TotalWages <dbl>, Notes <chr>, average_income_per_household <dbl>
# getting the shapes for each zip code in Minnesota
suppressMessages(mn_zips<- zctas(state="mn"))
# joining the shapes to each zip code from the income data
mn_zips_joined <-
geo_join(mn_zips, income_data, by_sp="ZCTA5CE10", by_df="Zipcode", how="inner")
As MetroTransit provides coverage throughout the entire area, we directed our research towards areas with bus stops that are less frequently used than others.
# filtering the stops that had less than 10 boardings in 2016
under_used <- stop_usage %>%
filter(calendar_date>=as.Date("2016-01-01")&calendar_date<as.Date("2016-12-31")) %>%
group_by(site_id) %>% summarise(count=sum(daily_boardings)) %>%
filter(count<10) %>%
inner_join(stops,by=c("site_id")) %>% select(site_id,site_latitude,site_longitude,count)
# constructing the visualization
leaflet(under_used) %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png') %>%
addProviderTiles(provider=providers$CartoDB.Positron) %>%
addCircleMarkers(data=under_used, ~site_longitude, ~site_latitude, stroke=FALSE, radius=5)
# constructing the visualization
pal <- colorBin("OrRd", domain=mn_zips_joined$average_income_per_household, 4)
leaflet(mn_zips_joined) %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png') %>%
addProviderTiles(provider=providers$CartoDB.Positron) %>%
addPolygons(fillColor = ~pal(average_income_per_household),
weight = 2,
opacity = 1,
color = "white",
dashArray = "3",
fillOpacity = 0.7, highlight = highlightOptions(weight = 5,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE)) %>%
addLegend(pal = pal,
values = ~average_income_per_household,
opacity = 0.7,
position = "bottomright",
title="Income",
labFormat = labelFormat(prefix = "$"))
When we visualized the two datasets, we observed a similar spatial pattern so we decided to combine the bus stops and the income maps to better compare the trend.
# constructing the visualization
pal <- colorBin("OrRd", domain = mn_zips_joined$average_income_per_household, 4)
leaflet(mn_zips_joined) %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png') %>%
addProviderTiles(provider=providers$CartoDB.Positron) %>%
addPolygons(fillColor = ~pal(average_income_per_household),
weight = 2,
opacity = 1,
color = "white",
dashArray = "3",
fillOpacity = 0.7, highlight = highlightOptions(weight = 5,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE)) %>%
addLegend(pal = pal,
values = ~average_income_per_household,
opacity = 0.7,
position = "bottomright",
title="Income",
labFormat = labelFormat(prefix = "$")) %>%
addMarkers(data=under_used,~site_longitude,~site_latitude,clusterOptions=markerClusterOptions())
We observed that:
Regarding median income, the figure is lower for areas in the center of each cities and it is higher as we move further away from the center.
Similarly, regarding bus stop boardings, the center of both cities had very little stops that were frequented less than 10 times in 2016. However, the outskirts of the cities had a lot of stops that were underused.
In order to begin our research, we made use of the Google Places API to get the locations of several colleges in the Twin Cities area.
# importing retrieved API information to a dataset
# Note, this was originally constructed using Google Places API, which has since changed # its terms of use and was no longer feasible to use for this project.
college_locations <- data.frame(
place = c("University of Northwester St. Paul","Hamline University","Macalester College","Bell Museum of Natural History", "Walden University"),
coords.lon = c(-93.16857,-93.16481,-93.16873,-93.23340,-93.26569),
coords.lat = c(45.03156,44.96597,44.93976,44.97783,44.98129)
)
head(college_locations)
## place coords.lon coords.lat
## 1 University of Northwester St. Paul -93.16857 45.03156
## 2 Hamline University -93.16481 44.96597
## 3 Macalester College -93.16873 44.93976
## 4 Bell Museum of Natural History -93.23340 44.97783
## 5 Walden University -93.26569 44.98129
Using the coordinates we retrieved from Google Places, we created a visualization of the average daily ridership over the course of 2016 in two areas - one around Macalester, and one around West Saint Paul. We chose these areas because the Macalester area has colleges around, whereas West Saint Paul does not. Both areas are of similar income levels.
# getting the average daily ridership around Macalester for every day in the year 2016
macalester_ridership <-
stop_usage %>%
filter(year(calendar_date)==2016) %>%
inner_join(stops,by="site_id") %>%
select(site_id, calendar_date, site_latitude, site_longitude, daily_boardings) %>%
filter(site_latitude >= 44.92 &
site_latitude <= 44.96 &
site_longitude >= -93.19 &
site_longitude <= -93.15) %>%
group_by(calendar_date) %>%
summarise(ridership=mean(daily_boardings)) %>%
mutate(around_mac=TRUE)
# getting the average daily ridership for an area of West Saint Paul for every day in the year 2016
w_st_paul_ridership <-
stop_usage %>%
filter(year(calendar_date)==2016) %>%
inner_join(stops,by="site_id") %>%
select(site_id, calendar_date, site_latitude, site_longitude, daily_boardings) %>%
filter(site_latitude >= 44.9 &
site_latitude <= 44.93 &
site_longitude >= -93.12 &
site_longitude <= -93.08) %>%
group_by(calendar_date) %>%
summarise(ridership=mean(daily_boardings)) %>%
mutate(around_mac=FALSE)
# combining the data sets - around_mac=TRUE indicates ridership at stops near Macalester
combined_ridership <-
rbind(macalester_ridership, w_st_paul_ridership)
head(combined_ridership)
## # A tibble: 6 x 3
## calendar_date ridership around_mac
## <dttm> <dbl> <lgl>
## 1 2016-01-01 06:00:00 9.99 TRUE
## 2 2016-01-02 06:00:00 14.3 TRUE
## 3 2016-01-03 06:00:00 9.82 TRUE
## 4 2016-01-04 06:00:00 16.7 TRUE
## 5 2016-01-05 06:00:00 15.4 TRUE
## 6 2016-01-06 06:00:00 14.1 TRUE
# creating a scatterplot on the average daily ridership over time - colored by location (around_mac)
ggplot(combined_ridership,aes(x=calendar_date, y=ridership, color=around_mac)) +
geom_point() +
geom_smooth() +
labs(x="Time", y="Ridership", color="Area around Macalester College")
We observed that what is important here is not so much the overall ridership numbers as the curvature of the trend lines:
There is a slight dip during the summer near Macalester, then ridership began to increase starting from the State Fair, going all the way up towards the time around winter break.
This pattern is much less pronounced in West Saint Paul, where there are no colleges nearby.
Next, we decided to take a different approach to visualizing this trend - a map. We processed the data to get the average school-year (winter) ridership and average summer ridership at each stop. We then took the ratio of summer to winter ridership, and colored each stop on the map accordingly. Purple colors on the map indicate a smaller ratio and therefore more wintertime ridership, whereas green represents a larger ratio and therefore higher summer ridership.
# creating dataset of average summer and winter ridership for each stop around Macalester
mac_seasonal_ridership <-
stop_usage %>%
filter(year(calendar_date)==2016) %>%
mutate(summer=month(calendar_date) %in% c(6,7,8,9)) %>%
group_by(site_id, summer) %>%
summarise(boardings=mean(daily_boardings)) %>%
inner_join(stops, by="site_id") %>%
select(site_id, summer, boardings, site_latitude, site_longitude) %>%
filter(site_latitude >= 44.92 &
site_latitude <= 44.96 &
site_longitude >= -93.19 &
site_longitude <= -93.15) %>%
spread(summer,boardings) %>%
mutate(winter=`FALSE`, summer=`TRUE`) %>%
select(site_id, winter, summer, site_latitude, site_longitude) %>%
mutate(diff=summer/winter)
head(mac_seasonal_ridership)
## # A tibble: 6 x 6
## # Groups: site_id [6]
## site_id winter summer site_latitude site_longitude diff
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3083 0.861 0.869 44.9 -93.2 1.01
## 2 3084 13.2 11.8 44.9 -93.2 0.892
## 3 3085 10.8 10.1 44.9 -93.2 0.932
## 4 3086 2.20 1.25 44.9 -93.2 0.571
## 5 3087 2.45 2.23 44.9 -93.2 0.911
## 6 3088 1.43 0.836 44.9 -93.2 0.586
# defining a color bin to color the map by
pal2 <- colorBin("PRGn", domain=mac_seasonal_ridership$diff, 6)
# creating the visualization
leaflet(mac_seasonal_ridership) %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png') %>%
addProviderTiles(provider=providers$CartoDB.Positron) %>%
addCircleMarkers(~site_longitude, ~site_latitude, color= ~pal2(diff)) %>%
addMarkers(data=college_locations, ~coords.lon, ~coords.lat, popup = ~htmlEscape(place)) %>%
addLegend(pal=pal2,
values=~diff,
opacity=0.7,
title="Purple=Winter, Green=Summer",
position="bottomright") %>%
setView(-93.16873,44.93976,13)
We noticed that:
There is a cluster of purple circles around the colleges (especially Macalester College) and throughout Snelling Avenue, suggesting an increase in the usage of public transit during school time, which we believe might be influenced by college students.
In the northern area around Snelling and near University Avenue, the same phenomenon is observed. In this area there are strip malls with a variety of retail and fast food chains as well as access to the light rail and University Avenue which makes us believe that the purple circles could again be caused by college students going to these locations when school is in session to shop, eat or access the light rail.
This image of the Macalester area does not mean much unless we compare it to other regions. First, we looked at West Saint Paul, an area where there are not any colleges around.
# creating a dataset for the average summer and winter ridership for each stop in West Saint Paul
w_st_paul_seasonal_ridership <-
stop_usage %>%
filter(year(calendar_date) == 2016) %>%
mutate(summer=month(calendar_date) %in% c(6, 7, 8, 9)) %>%
group_by(site_id, summer) %>%
summarise(boardings=mean(daily_boardings)) %>%
inner_join(stops, by="site_id") %>%
select(site_id, summer, boardings, site_latitude, site_longitude) %>%
filter(site_latitude >= 44.9 &
site_latitude <= 44.93 &
site_longitude >= -93.12 &
site_longitude <= -93.08) %>%
spread(summer,boardings) %>%
mutate(winter=`FALSE`, summer=`TRUE`) %>%
select(site_id, winter, summer, site_latitude, site_longitude) %>%
mutate(diff=summer/winter)
#creating the visualization
pal4 <- colorBin("PRGn", domain = w_st_paul_seasonal_ridership$diff, 5)
leaflet(w_st_paul_seasonal_ridership) %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png') %>%
addProviderTiles(provider=providers$CartoDB.Positron) %>%
addCircleMarkers(~site_longitude, ~site_latitude, color= ~pal4(diff)) %>%
addMarkers(data=college_locations, ~coords.lon, ~coords.lat, popup = ~htmlEscape(place)) %>%
addLegend(pal=pal4,
values=~diff,
opacity=0.7,
title="Purple=Winter, Green=Summer",
position="bottomright") %>%
setView(-93.1016,44.9161,13)
The overall trend here is that ridership is skewed towards the school year, but compared to the area where there are colleges and univerities, we are seeing much lighter colors and even some green ones indicative of higher summer ridership.
Next we looked at Roseville, in the region south of the University of Northwestern Saint Paul.
# creating the dataset for the average summer and winter ridership for each stop in Roseville
roseville_seasonal_ridership <-
stop_usage %>%
filter(year(calendar_date)==2016) %>%
mutate(summer=month(calendar_date) %in% c(6, 7, 8, 9)) %>%
group_by(site_id, summer) %>%
summarise(boardings=mean(daily_boardings)) %>%
inner_join(stops, by="site_id") %>%
select(site_id, summer, boardings, site_latitude, site_longitude) %>%
filter(site_latitude >= 44.99 &
site_latitude <= 45.03 &
site_longitude >= -93.18 &
site_longitude <= -93.13) %>%
spread(summer, boardings) %>%
mutate(winter=`FALSE`, summer=`TRUE`) %>%
select(site_id, winter, summer, site_latitude, site_longitude) %>%
mutate(diff=summer/winter)
# creating the visualization
pal3 <- colorBin("PRGn", domain=roseville_seasonal_ridership$diff, 5)
leaflet(roseville_seasonal_ridership) %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png') %>%
addProviderTiles(provider=providers$CartoDB.Positron) %>%
addCircleMarkers(~site_longitude, ~site_latitude, color= ~pal3(diff)) %>%
addMarkers(data=college_locations, ~coords.lon, ~coords.lat, popup=~htmlEscape(place)) %>%
addLegend(pal=pal3,
values=~diff,
opacity=0.7,
title="Purple=Winter,Green=Summer",
position="bottomright") %>%
setView(-93.1566,45.0061,13)
We can see that the stops closest to the university are used a lot during the school year, while others have a more even usage pattern. It is notable that the University of Northwestern Saint Paul’s closest bus stop is relatively far from the campus even though they are likely users of the transit system. This might be one of the few areas the MetroTransit needs to add a stop to.
In the course of our research, we kept in mind several factors that might be external influences to how people use public transit but some of the observations that we drew from our research are:
Higher income areas tend to use public transport less than lower income areas.
There is an increase in bus stop boardings surrounding colleges during the school period.