Ask

The Marketing Department of Cyclistic has asked the Analytics team to deliver actionable insights into the usage characteristics of Cyclistic’s customer base to determine how to convert existing Casual users into Membership holders, which represents a more profitable customer demographic.

Setup

library(tidyverse)
library(lubridate)

Prepare

The data for this product was imported from the company’s records in the form of 12 zip compressed CSV files. These were imported from the server using a bash script to gather all files pertaining to calendar year 2022 and stored locally in an rstudio project folder and managed on the teams local server as a Git repository. Only members of the team were given access to the source data. The CSV were created as read-only to maintain the integrity of the source data. After the csv files were reviewed the Zip files were removed from the repository.

#Import Data from source
#Source URL: https://divvy-tripdata.s3.amazonaws.com/index.html

#!/bin/bash

# Set the base URL for the zip files
base_url="https://divvy-tripdata.s3.amazonaws.com/"


# Iterate through months 1 to 12
for month in {1..12}; do
    # Add leading zero if the month is single digit
    if ((month < 10)); then
        month="0${month}"
    fi

    # Construct the file URL
    file_url="${base_url}2022${month}-divvy-tripdata.zip"

    # Download the file using curl
    curl -O "$file_url"
done

The individual datasets were concatenated using the rbind function with the following script:

# Set the directory path where the CSV files are located
directory <- "~/Desktop/gdac_cs_1/source"

# Get a list of all CSV files in the directory
csv_files <- list.files(directory, pattern = "\\.csv$", full.names = TRUE)

# Initialize an empty data frame to store the concatenated data
combined_data <- data.frame()

# Loop through each CSV file and concatenate the data
for (file in csv_files) {
  # Read the CSV file
  data <- read.csv(file)
}

# Concatenate the data using rbind
combined_data <- rbind(combined_data, data)

The complete dataset was was wriiten to the file 2202-all-divvy-tripdata.csv. The dataset was then assigned to the object df.

df0 <- read.csv("~/R/gdac_cs_1/source/2022-all-divvy-tripdata.csv")

List of Variable Names

names(df0)
##  [1] "X"                  "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name"
##  [7] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"          "end_lat"           
## [13] "end_lng"            "member_casual"

Process

Upon inspection of the data the naming conventions were found to be acceptable for the analysis and it was determined that there was no obvious concern regarding bias in the population. The team chose to create new variables to gain clearer insight into the dataset such as trip_duration, day_of_week, day_of_year, and month_of_year. It was determined that observations of docked_bike were in need of cleaning as this value was not pertinent to the business question.

unique(df0$member_casual)
## [1] "casual" "member"
unique(df0$rideable_type)
## [1] "electric_bike" "classic_bike"  "docked_bike"
colnames(df)[1] <- "id"
df$trip_duration <- as.numeric(difftime(df$ended_at,
                                        df$started_at,
                                        units = "hours"))
df$day_of_week <- wday(df$started_at)
df$day_of_year <- yday(df$started_at)
df$month_of_year <- month(df$started_at)
df$week_of_year <- week(df$started_at)
df$hour_of_day <- hour(df$started_at)
names(df)
##  [1] "rideable_type"      "started_at"         "ended_at"           "start_station_name" "start_station_id"   "end_station_name"  
##  [7] "end_station_id"     "start_lat"          "start_lng"          "end_lat"            "end_lng"            "member_casual"     
## [13] "trip_duration"      "day_of_week"        "day_of_year"        "month_of_year"      "hour_of_day"

Proportion Tables

round(prop.table(table(df$member_casual)), 2) * 100
## 
## casual member 
##     41     59
round(prop.table(table(df$rideable_type)), 2) * 100
## 
##  classic_bike   docked_bike electric_bike 
##            46             3            51
round(prop.table(table(df$member_casual, df$rideable_type)), 2) * 100
##         
##          classic_bike docked_bike electric_bike
##   casual           16           3            22
##   member           30           0            29

An initial summary was run on the numeric variable trip_duration to determine the distribution of data across the dataset and generated the following boxplot.

summary(df$trip_duration)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -172.5558    0.0969    0.1714    0.3241    0.3078  689.7875
boxplot(df$trip_duration)

This showed notable outliers above the third Quartile and there should have been no values equal to or less than Zero hours. The following Histogram gave further insight of the trends regarding trip duration.

df %>%
  ggplot(aes(x = trip_duration)) +
  geom_histogram(bins = 500) +
  labs(x = "Ride Duration in Hours",
       y = NULL, title = "Histogram of Ride Duration")

Since the vast majority of rides had a ride limit under 1 hour the data was filtered to more accuratly reflect trends including any negative trip value.

boxplot(df$trip_duration[df$trip_duration <= 1 & df$trip_duration > 0])

df %>%
  filter(trip_duration > 0,
         trip_duration <= 1) %>%
  ggplot(aes(x = trip_duration)) +
  geom_histogram(bins = 10) +
  labs(x = "Ride Duration in Hours",
       y = NULL,
       title = "Histogram of Ride Duration")

Sample

Due to the large nature of the dataset a sample was generated representing approx. .003% of the dataset. This sample represents a 99% Confidence Level and a 1% Margin of Error. docked_bikes were filtered out of the dataset as they only represented 3% of preference by Casual Riders and had a 0% representaion of Member riders.

df  <- df %>%
  filter(rideable_type != "docked_bike",
         trip_duration > 0,
         trip_duration <= 1) %>%
  sample_frac(.003)

Again running a distribution summary of the trip_duration variable on the sample_data reveals that the sample is statistically significant enough to use in place of the full population going forward with the analysis.

summary(df$trip_duration)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0002778 0.0941667 0.1625000 0.2144061 0.2816667 0.9972222

Here is the output of the same plot. The number and degree of outliers are significantly reduced, potentially eliminating the need to filter the data further.

df %>%
  ggplot(aes(x = trip_duration)) +
  geom_histogram(bins = 10) +
  labs(x = "Ride Duration in Hours",
       y = NULL, title = "Histogram of Ride Duration")

Export

The sample data is saved in order to recall it for future observation.

output_sample_file <- "~/R/gdac_cs_1/source/2022-sample-divvy-tripdata.csv"
write.csv(df_frac, file = output_sample_file, row.names = TRUE)

Analysis

Further investigation shows that the rideable_type variable docked_bike is populated only my “Casual” users, also the duration of the rides mostly greatly exceed the mean ride duration so the docked_bike observations are filtered out of the dataset.

unique(df$rideable_type)
## [1] "electric_bike" "classic_bike"
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = sum)
##         Group.1       x
## 1  classic_bike 1752.03
## 2 electric_bike 1695.62
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = max)
##         Group.1         x
## 1  classic_bike 0.9963889
## 2 electric_bike 0.9972222
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = min)
##         Group.1            x
## 1  classic_bike 0.0002777778
## 2 electric_bike 0.0002777778
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = mean)
##         Group.1         x
## 1  classic_bike 0.2301970
## 2 electric_bike 0.2002149
hist(df$trip_duration)

df %>%
  ggplot(aes(rideable_type, fill = rideable_type)) +
  geom_bar()

df %>%
  ggplot(aes(rideable_type, trip_duration, fill =  rideable_type)) +
  geom_col()

Visualization

Assesments are made comparing multiple time variables to member_casual and rideable_type observations.

#Average trip duration v. user type
df %>%
  group_by(member_casual) %>%
  summarise(mean_trip_dur = mean(trip_duration) * 60) %>%
  ggplot(aes(member_casual, mean_trip_dur, fill = member_casual)) +
  geom_col() +
  labs(x = "Subscriber Type",
       y = "Minutes",
       title = "User Type v. Mean Trip Duration in Minutes",
       fill = "Subscriber Type")

# Median trip duration v. member type
df %>%
  group_by(member_casual) %>%
  summarise(median_trip_dur = median(trip_duration) * 60) %>%
  ggplot(aes(member_casual,
             median_trip_dur,
             fill = member_casual)) +
  geom_col() +
  labs(x = "Subscriber Type",
       y = "Minutes",
       title = "User Type v. Median Trip Duration in Minutes",
       fill = "Subscriber Type")

# Plots of resulting tables
df %>%
  count(member_casual, rideable_type) %>%
  group_by(rideable_type) %>%
  ggplot(aes(rideable_type, n, fill = member_casual)) +
  geom_col() +
  labs(x = "Subscriber Type",
       y = "Total Rides",
       title = "Total Number of Rides in 2022 per Bicyle and Subscriber Types",
       fill = "Subsciber")

df %>%
  filter(!rideable_type == "docked_bike") %>%
  count(member_casual, rideable_type) %>%
  group_by(rideable_type) %>%
  mutate(prop = n / (sum(n)) * 100) %>%
  ggplot(aes(rideable_type, prop, fill = member_casual)) +
  geom_col() +
  labs(title = "Subscriber Preference of Bicycle Type as Percentage",
       x = "Bicycle Type",
       y = "Percentage",
       fill = "Subscriber Type")

# Plots of resulting tables
df %>%
  count(member_casual, rideable_type) %>%
  group_by(member_casual) %>%
  mutate(prop = n / (sum(n)) * 100) %>%
  ggplot(aes(member_casual, prop, fill = rideable_type)) +
  geom_col() +
  labs(x = "Subscriber Type",
       y = "Percentage ofTotal Rides",
       title = "Percent Total # of Rides in 2022 v. Subscriber Type",
       fill = "Bicycle Type")

# Plot of Total Annual Rides per Day of the Week
df %>%
  count(member_casual, day_of_week) %>%
  group_by(day_of_week) %>%
  ggplot(aes(day_of_week, n, fill = member_casual)) +
  geom_col() +
  labs(x = "Day of the Week",
       y = "Total Rides",
       title = "Total Annual Rides per Day of the Week",
       fill = "Subscriber Type")

# Plots of Proportion of Subscriber Type Usage v. Day of the Week
df %>%
  count(member_casual, day_of_week) %>%
  group_by(day_of_week) %>%
  mutate(prop = n / (sum(n)) * 100) %>%
  ggplot(aes(day_of_week, prop, fill = member_casual)) +
  geom_col() +
  labs(x = "Day of the Week",
       y = "Percentage",
       title = "Proportion of Subscriber Type Usage v. Day of the Week",
       fill = "Subscriber Type")

df %>%
  ggplot(aes(month_of_year, fill = member_casual)) +
  geom_bar() +
  labs(title = "Member Type v. Month of Year",
       x = "Month of Year",
       y = "Number of Rides")

df %>%
  ggplot(aes(day_of_week, fill = member_casual)) +
  geom_bar() +
  labs(title = "Member Type v. Day of Week",
       x = "Day of Week",
       y = "Number of Rides")

df %>%
  ggplot(aes(month_of_year, fill = rideable_type)) +
  geom_bar() +
  facet_wrap(~member_casual) +
  labs(title = "Member Type v. Month of Year",
       x = "Month of Year",
       y = "Number of Rides")

df %>%
  ggplot(aes(hour_of_day, fill = rideable_type)) +
  geom_bar() +
  facet_wrap(~member_casual) +
  labs(title = "Member Type v. Hour of Day",
       x = "Day of Week",
       y = "Number of Rides")

Conclusion

The Analysis team has determined that there are clearly defined differences in the way that the two subscriber groups utilize the services of Cyclistic.

Since the goal of this project was to determine how Casual Subscribers could be encouraged to become more profitable Member subscribers key characteristics of their behaviors should be considered. To a greater degree Casual subscribers tend to be positively effected by the onset of moderate weather so a marketing campaign to existing Casual members in late Winter to early spring would be most effective as they are preparing for the time of year they are anticipating getting out and more actively utilizing Cyclistic’s services, whereas existing members appear to be less effected by weather changes as they tend to rely on the service for their daily commute.

The campaign should point out the likely cost savings of a Member subscription as Casual riders actually trend towards slightly longer ride durations. If they will save money per ride by committing to a Member subscriptions and not having to be as concerned about the length of their ride duration they may find the new arrangement more appealing and thus hold more personal value . Casual subscribers also tend to prefer the electric bikes even though they cost more per minute than the classic bikes. Offering a 30 day discount on electric bike shares for new Member subscribers would likely be an incentive to sign up for an Annual Member subscription.