The Marketing Department of Cyclistic has asked the Analytics team to deliver actionable insights into the usage characteristics of Cyclistic’s customer base to determine how to convert existing Casual users into Membership holders, which represents a more profitable customer demographic.
library(tidyverse)
library(lubridate)
The data for this product was imported from the company’s records in the form of 12 zip compressed CSV files. These were imported from the server using a bash script to gather all files pertaining to calendar year 2022 and stored locally in an rstudio project folder and managed on the teams local server as a Git repository. Only members of the team were given access to the source data. The CSV were created as read-only to maintain the integrity of the source data. After the csv files were reviewed the Zip files were removed from the repository.
#Import Data from source
#Source URL: https://divvy-tripdata.s3.amazonaws.com/index.html
#!/bin/bash
# Set the base URL for the zip files
base_url="https://divvy-tripdata.s3.amazonaws.com/"
# Iterate through months 1 to 12
for month in {1..12}; do
# Add leading zero if the month is single digit
if ((month < 10)); then
month="0${month}"
fi
# Construct the file URL
file_url="${base_url}2022${month}-divvy-tripdata.zip"
# Download the file using curl
curl -O "$file_url"
done
The individual datasets were concatenated using the rbind function with the following script:
# Set the directory path where the CSV files are located
directory <- "~/Desktop/gdac_cs_1/source"
# Get a list of all CSV files in the directory
csv_files <- list.files(directory, pattern = "\\.csv$", full.names = TRUE)
# Initialize an empty data frame to store the concatenated data
combined_data <- data.frame()
# Loop through each CSV file and concatenate the data
for (file in csv_files) {
# Read the CSV file
data <- read.csv(file)
}
# Concatenate the data using rbind
combined_data <- rbind(combined_data, data)
The complete dataset was was wriiten to the file 2202-all-divvy-tripdata.csv. The dataset was then assigned to the object df.
df0 <- read.csv("~/R/gdac_cs_1/source/2022-all-divvy-tripdata.csv")
names(df0)
## [1] "X" "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name"
## [7] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" "end_lat"
## [13] "end_lng" "member_casual"
Upon inspection of the data the naming conventions were found to be
acceptable for the analysis and it was determined that there was no
obvious concern regarding bias in the population. The team chose to
create new variables to gain clearer insight into the dataset such as
trip_duration, day_of_week, day_of_year, and month_of_year. It was
determined that observations of docked_bike
were in need of
cleaning as this value was not pertinent to the business question.
unique(df0$member_casual)
## [1] "casual" "member"
unique(df0$rideable_type)
## [1] "electric_bike" "classic_bike" "docked_bike"
colnames(df)[1] <- "id"
df$trip_duration <- as.numeric(difftime(df$ended_at,
df$started_at,
units = "hours"))
df$day_of_week <- wday(df$started_at)
df$day_of_year <- yday(df$started_at)
df$month_of_year <- month(df$started_at)
df$week_of_year <- week(df$started_at)
df$hour_of_day <- hour(df$started_at)
names(df)
## [1] "rideable_type" "started_at" "ended_at" "start_station_name" "start_station_id" "end_station_name"
## [7] "end_station_id" "start_lat" "start_lng" "end_lat" "end_lng" "member_casual"
## [13] "trip_duration" "day_of_week" "day_of_year" "month_of_year" "hour_of_day"
round(prop.table(table(df$member_casual)), 2) * 100
##
## casual member
## 41 59
round(prop.table(table(df$rideable_type)), 2) * 100
##
## classic_bike docked_bike electric_bike
## 46 3 51
round(prop.table(table(df$member_casual, df$rideable_type)), 2) * 100
##
## classic_bike docked_bike electric_bike
## casual 16 3 22
## member 30 0 29
An initial summary was run on the numeric variable
trip_duration
to determine the distribution of data across
the dataset and generated the following boxplot.
summary(df$trip_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -172.5558 0.0969 0.1714 0.3241 0.3078 689.7875
boxplot(df$trip_duration)
This showed notable outliers above the third Quartile and there should have been no values equal to or less than Zero hours. The following Histogram gave further insight of the trends regarding trip duration.
df %>%
ggplot(aes(x = trip_duration)) +
geom_histogram(bins = 500) +
labs(x = "Ride Duration in Hours",
y = NULL, title = "Histogram of Ride Duration")
Since the vast majority of rides had a ride limit under 1 hour the data was filtered to more accuratly reflect trends including any negative trip value.
boxplot(df$trip_duration[df$trip_duration <= 1 & df$trip_duration > 0])
df %>%
filter(trip_duration > 0,
trip_duration <= 1) %>%
ggplot(aes(x = trip_duration)) +
geom_histogram(bins = 10) +
labs(x = "Ride Duration in Hours",
y = NULL,
title = "Histogram of Ride Duration")
Due to the large nature of the dataset a sample was generated
representing approx. .003% of the dataset. This sample represents a 99%
Confidence Level and a 1% Margin of Error. docked_bikes
were filtered out of the dataset as they only represented 3% of
preference by Casual Riders and had a 0% representaion of Member
riders.
df <- df %>%
filter(rideable_type != "docked_bike",
trip_duration > 0,
trip_duration <= 1) %>%
sample_frac(.003)
Again running a distribution summary of the
trip_duration
variable on the sample_data
reveals that the sample is statistically significant enough to use in
place of the full population going forward with the analysis.
summary(df$trip_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0002778 0.0941667 0.1625000 0.2144061 0.2816667 0.9972222
Here is the output of the same plot. The number and degree of outliers are significantly reduced, potentially eliminating the need to filter the data further.
df %>%
ggplot(aes(x = trip_duration)) +
geom_histogram(bins = 10) +
labs(x = "Ride Duration in Hours",
y = NULL, title = "Histogram of Ride Duration")
The sample data is saved in order to recall it for future observation.
output_sample_file <- "~/R/gdac_cs_1/source/2022-sample-divvy-tripdata.csv"
write.csv(df_frac, file = output_sample_file, row.names = TRUE)
Further investigation shows that the rideable_type
variable docked_bike
is populated only my “Casual” users,
also the duration of the rides mostly greatly exceed the mean ride
duration so the docked_bike
observations are filtered out
of the dataset.
unique(df$rideable_type)
## [1] "electric_bike" "classic_bike"
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = sum)
## Group.1 x
## 1 classic_bike 1752.03
## 2 electric_bike 1695.62
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = max)
## Group.1 x
## 1 classic_bike 0.9963889
## 2 electric_bike 0.9972222
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = min)
## Group.1 x
## 1 classic_bike 0.0002777778
## 2 electric_bike 0.0002777778
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = mean)
## Group.1 x
## 1 classic_bike 0.2301970
## 2 electric_bike 0.2002149
hist(df$trip_duration)
df %>%
ggplot(aes(rideable_type, fill = rideable_type)) +
geom_bar()
df %>%
ggplot(aes(rideable_type, trip_duration, fill = rideable_type)) +
geom_col()
Assesments are made comparing multiple time variables to
member_casual
and rideable_type
observations.
#Average trip duration v. user type
df %>%
group_by(member_casual) %>%
summarise(mean_trip_dur = mean(trip_duration) * 60) %>%
ggplot(aes(member_casual, mean_trip_dur, fill = member_casual)) +
geom_col() +
labs(x = "Subscriber Type",
y = "Minutes",
title = "User Type v. Mean Trip Duration in Minutes",
fill = "Subscriber Type")
# Median trip duration v. member type
df %>%
group_by(member_casual) %>%
summarise(median_trip_dur = median(trip_duration) * 60) %>%
ggplot(aes(member_casual,
median_trip_dur,
fill = member_casual)) +
geom_col() +
labs(x = "Subscriber Type",
y = "Minutes",
title = "User Type v. Median Trip Duration in Minutes",
fill = "Subscriber Type")
# Plots of resulting tables
df %>%
count(member_casual, rideable_type) %>%
group_by(rideable_type) %>%
ggplot(aes(rideable_type, n, fill = member_casual)) +
geom_col() +
labs(x = "Subscriber Type",
y = "Total Rides",
title = "Total Number of Rides in 2022 per Bicyle and Subscriber Types",
fill = "Subsciber")
df %>%
filter(!rideable_type == "docked_bike") %>%
count(member_casual, rideable_type) %>%
group_by(rideable_type) %>%
mutate(prop = n / (sum(n)) * 100) %>%
ggplot(aes(rideable_type, prop, fill = member_casual)) +
geom_col() +
labs(title = "Subscriber Preference of Bicycle Type as Percentage",
x = "Bicycle Type",
y = "Percentage",
fill = "Subscriber Type")
# Plots of resulting tables
df %>%
count(member_casual, rideable_type) %>%
group_by(member_casual) %>%
mutate(prop = n / (sum(n)) * 100) %>%
ggplot(aes(member_casual, prop, fill = rideable_type)) +
geom_col() +
labs(x = "Subscriber Type",
y = "Percentage ofTotal Rides",
title = "Percent Total # of Rides in 2022 v. Subscriber Type",
fill = "Bicycle Type")
# Plot of Total Annual Rides per Day of the Week
df %>%
count(member_casual, day_of_week) %>%
group_by(day_of_week) %>%
ggplot(aes(day_of_week, n, fill = member_casual)) +
geom_col() +
labs(x = "Day of the Week",
y = "Total Rides",
title = "Total Annual Rides per Day of the Week",
fill = "Subscriber Type")
# Plots of Proportion of Subscriber Type Usage v. Day of the Week
df %>%
count(member_casual, day_of_week) %>%
group_by(day_of_week) %>%
mutate(prop = n / (sum(n)) * 100) %>%
ggplot(aes(day_of_week, prop, fill = member_casual)) +
geom_col() +
labs(x = "Day of the Week",
y = "Percentage",
title = "Proportion of Subscriber Type Usage v. Day of the Week",
fill = "Subscriber Type")
df %>%
ggplot(aes(month_of_year, fill = member_casual)) +
geom_bar() +
labs(title = "Member Type v. Month of Year",
x = "Month of Year",
y = "Number of Rides")
df %>%
ggplot(aes(day_of_week, fill = member_casual)) +
geom_bar() +
labs(title = "Member Type v. Day of Week",
x = "Day of Week",
y = "Number of Rides")
df %>%
ggplot(aes(month_of_year, fill = rideable_type)) +
geom_bar() +
facet_wrap(~member_casual) +
labs(title = "Member Type v. Month of Year",
x = "Month of Year",
y = "Number of Rides")
df %>%
ggplot(aes(hour_of_day, fill = rideable_type)) +
geom_bar() +
facet_wrap(~member_casual) +
labs(title = "Member Type v. Hour of Day",
x = "Day of Week",
y = "Number of Rides")
The Analysis team has determined that there are clearly defined differences in the way that the two subscriber groups utilize the services of Cyclistic.
Since the goal of this project was to determine how Casual Subscribers could be encouraged to become more profitable Member subscribers key characteristics of their behaviors should be considered. To a greater degree Casual subscribers tend to be positively effected by the onset of moderate weather so a marketing campaign to existing Casual members in late Winter to early spring would be most effective as they are preparing for the time of year they are anticipating getting out and more actively utilizing Cyclistic’s services, whereas existing members appear to be less effected by weather changes as they tend to rely on the service for their daily commute.
The campaign should point out the likely cost savings of a Member subscription as Casual riders actually trend towards slightly longer ride durations. If they will save money per ride by committing to a Member subscriptions and not having to be as concerned about the length of their ride duration they may find the new arrangement more appealing and thus hold more personal value . Casual subscribers also tend to prefer the electric bikes even though they cost more per minute than the classic bikes. Offering a 30 day discount on electric bike shares for new Member subscribers would likely be an incentive to sign up for an Annual Member subscription.