The Marketing Department of Cyclistic has asked the Analytics team to deliver actionable insights into the usage characteristics of Cyclistic’s customer base to determine how to convert existing Casual users into Membership holders, which represents a more profitable customer demographic.
library(tidyverse)
library(lubridate)
The data for this product was imported from the company’s records in the form of 12 zip compressed CSV files. These were imported from the server using a bash script to gather all files pertaining to calendar year 2022 and stored locally in an rstudio project folder and managed on the teams local server as a Git repository. Only members of the team were given access to the source data. The CSV were created as read-only to maintain the integrity of the source data. After the csv files were reviewed the Zip files were removed from the repository.
#Import Data from source
#Source URL: https://divvy-tripdata.s3.amazonaws.com/index.html
#!/bin/bash
# Set the base URL for the zip files
base_url="https://divvy-tripdata.s3.amazonaws.com/"
# Iterate through months 1 to 12
for month in {1..12}; do
# Add leading zero if the month is single digit
if ((month < 10)); then
month="0${month}"
fi
# Construct the file URL
file_url="${base_url}2022${month}-divvy-tripdata.zip"
# Download the file using curl
curl -O "$file_url"
done
The individual datasets were concatenated using the rbind function with the following script:
# Set the directory path where the CSV files are located
directory <- "~/Desktop/gdac_cs_1/source"
# Get a list of all CSV files in the directory
csv_files <- list.files(directory, pattern = "\\.csv$", full.names = TRUE)
# Initialize an empty data frame to store the concatenated data
combined_data <- data.frame()
# Loop through each CSV file and concatenate the data
for (file in csv_files) {
# Read the CSV file
data <- read.csv(file)
}
# Concatenate the data using rbind
combined_data <- rbind(combined_data, data)
The complete dataset was was wriiten to the file 2202-all-divvy-tripdata.csv. The dataset was then assigned to the object df.
df0 <- read.csv("~/R/gdac_cs_1/source/2022-all-divvy-tripdata.csv")
names(df0)
## [1] "X" "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name"
## [7] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" "end_lat"
## [13] "end_lng" "member_casual"
Upon inspection of the data the naming conventions were found to be
acceptable for the analysis and it was determined that there was no
obvious concern regarding bias in the population. The team chose to
create new variables to gain clearer insight into the dataset such as
trip_duration, day_of_week, day_of_year, and month_of_year. It was
determined that observations of docked_bike
were in need of
cleaning as this value was not pertinent to the business question.
unique(df0$member_casual)
## [1] "casual" "member"
unique(df0$rideable_type)
## [1] "electric_bike" "classic_bike" "docked_bike"
colnames(df)[1] <- "id"
df$trip_duration <- as.numeric(difftime(df$ended_at,
df$started_at,
units = "hours"))
df$day_of_week <- wday(df$started_at)
df$day_of_year <- yday(df$started_at)
df$month_of_year <- month(df$started_at)
df$week_of_year <- week(df$started_at)
df$hour_of_day <- hour(df$started_at)
names(df)
## [1] "rideable_type" "started_at" "ended_at" "start_station_name" "start_station_id" "end_station_name"
## [7] "end_station_id" "start_lat" "start_lng" "end_lat" "end_lng" "member_casual"
## [13] "trip_duration" "day_of_week" "day_of_year" "month_of_year" "hour_of_day"
round(prop.table(table(df$member_casual)), 2) * 100
##
## casual member
## 41 59
round(prop.table(table(df$rideable_type)), 2) * 100
##
## classic_bike docked_bike electric_bike
## 46 3 51
round(prop.table(table(df$member_casual, df$rideable_type)), 2) * 100
##
## classic_bike docked_bike electric_bike
## casual 16 3 22
## member 30 0 29
An initial summary was run on the numeric variable
trip_duration
to determine the distribution of data across
the dataset and generated the following boxplot.
summary(df$trip_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -172.5558 0.0969 0.1714 0.3241 0.3078 689.7875
boxplot(df$trip_duration)
This showed notable outliers above the third Quartile and there should have been no values equal to or less than Zero hours. The following Histogram gave further insight of the trends regarding trip duration.
df %>%
ggplot(aes(x = trip_duration)) +
geom_histogram(bins = 500) +
labs(x = "Ride Duration in Hours",
y = NULL, title = "Histogram of Ride Duration")
Since the vast majority of rides had a ride limit under 1 hour the data was filtered to more accuratly reflect trends including any negative trip value.
boxplot(df$trip_duration[df$trip_duration <= 1 & df$trip_duration > 0])
df %>%
filter(trip_duration > 0,
trip_duration <= 1) %>%
ggplot(aes(x = trip_duration)) +
geom_histogram(bins = 10) +
labs(x = "Ride Duration in Hours",
y = NULL,
title = "Histogram of Ride Duration")
Due to the large nature of the dataset a sample was generated
representing approx. .003% of the dataset. This sample represents a 99%
Confidence Level and a 1% Margin of Error. docked_bikes
were filtered out of the dataset as they only represented 3% of
preference by Casual Riders and had a 0% representaion of Member
riders.
df <- df %>%
filter(rideable_type != "docked_bike",
trip_duration > 0,
trip_duration <= 1) %>%
sample_frac(.003)
Again running a distribution summary of the
trip_duration
variable on the sample_data
reveals that the sample is statistically significant enough to use in
place of the full population going forward with the analysis.
summary(df$trip_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0002778 0.0941667 0.1625000 0.2144061 0.2816667 0.9972222
Here is the output of the same plot. The number and degree of outliers are significantly reduced, potentially eliminating the need to filter the data further.
df %>%
ggplot(aes(x = trip_duration)) +
geom_histogram(bins = 10) +
labs(x = "Ride Duration in Hours",
y = NULL, title = "Histogram of Ride Duration")
The sample data is saved in order to recall it for future observation.
output_sample_file <- "~/R/gdac_cs_1/source/2022-sample-divvy-tripdata.csv"
write.csv(df_frac, file = output_sample_file, row.names = TRUE)
Further investigation shows that the rideable_type
variable docked_bike
is populated only my “Casual” users,
also the duration of the rides mostly greatly exceed the mean ride
duration so the docked_bike
observations are filtered out
of the dataset.
unique(df$rideable_type)
## [1] "electric_bike" "classic_bike"
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = sum)
## Group.1 x
## 1 classic_bike 1752.03
## 2 electric_bike 1695.62
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = max)
## Group.1 x
## 1 classic_bike 0.9963889
## 2 electric_bike 0.9972222
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = min)
## Group.1 x
## 1 classic_bike 0.0002777778
## 2 electric_bike 0.0002777778
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = mean)
## Group.1 x
## 1 classic_bike 0.2301970
## 2 electric_bike 0.2002149
hist(df$trip_duration)
df %>%
ggplot(aes(rideable_type, fill = rideable_type)) +
geom_bar()
df %>%
ggplot(aes(rideable_type, trip_duration, fill = rideable_type)) +
geom_col()
Assesments are made comparing multiple time variables to
member_casual
and rideable_type
observations.
#Average trip duration v. user type
df %>%
group_by(member_casual) %>%
summarise(mean_trip_dur = mean(trip_duration) * 60) %>%
ggplot(aes(member_casual, mean_trip_dur, fill = member_casual)) +
geom_col() +
labs(x = "Subscriber Type",
y = "Minutes",
title = "User Type v. Mean Trip Duration in Minutes",
fill = "Subscriber Type")
# Median trip duration v. member type
df %>%
group_by(member_casual) %>%
summarise(median_trip_dur = median(trip_duration) * 60) %>%
ggplot(aes(member_casual,
median_trip_dur,
fill = member_casual)) +
geom_col() +
labs(x = "Subscriber Type",
y = "Minutes",
title = "User Type v. Median Trip Duration in Minutes",
fill = "Subscriber Type")