Ask

The Marketing Department of Cyclistic has asked the Analytics team to deliver actionable insights into the usage characteristics of Cyclistic’s customer base to determine how to convert existing Casual users into Membership holders, which represents a more profitable customer demographic.

Setup

library(tidyverse)
library(lubridate)

Prepare

The data for this product was imported from the company’s records in the form of 12 zip compressed CSV files. These were imported from the server using a bash script to gather all files pertaining to calendar year 2022 and stored locally in an rstudio project folder and managed on the teams local server as a Git repository. Only members of the team were given access to the source data. The CSV were created as read-only to maintain the integrity of the source data. After the csv files were reviewed the Zip files were removed from the repository.

#Import Data from source
#Source URL: https://divvy-tripdata.s3.amazonaws.com/index.html

#!/bin/bash

# Set the base URL for the zip files
base_url="https://divvy-tripdata.s3.amazonaws.com/"


# Iterate through months 1 to 12
for month in {1..12}; do
    # Add leading zero if the month is single digit
    if ((month < 10)); then
        month="0${month}"
    fi

    # Construct the file URL
    file_url="${base_url}2022${month}-divvy-tripdata.zip"

    # Download the file using curl
    curl -O "$file_url"
done

The individual datasets were concatenated using the rbind function with the following script:

# Set the directory path where the CSV files are located
directory <- "~/Desktop/gdac_cs_1/source"

# Get a list of all CSV files in the directory
csv_files <- list.files(directory, pattern = "\\.csv$", full.names = TRUE)

# Initialize an empty data frame to store the concatenated data
combined_data <- data.frame()

# Loop through each CSV file and concatenate the data
for (file in csv_files) {
  # Read the CSV file
  data <- read.csv(file)
}

# Concatenate the data using rbind
combined_data <- rbind(combined_data, data)

The complete dataset was was wriiten to the file 2202-all-divvy-tripdata.csv. The dataset was then assigned to the object df.

df0 <- read.csv("~/R/gdac_cs_1/source/2022-all-divvy-tripdata.csv")

List of Variable Names

names(df0)
##  [1] "X"                  "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name"
##  [7] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"          "end_lat"           
## [13] "end_lng"            "member_casual"

Process

Upon inspection of the data the naming conventions were found to be acceptable for the analysis and it was determined that there was no obvious concern regarding bias in the population. The team chose to create new variables to gain clearer insight into the dataset such as trip_duration, day_of_week, day_of_year, and month_of_year. It was determined that observations of docked_bike were in need of cleaning as this value was not pertinent to the business question.

unique(df0$member_casual)
## [1] "casual" "member"
unique(df0$rideable_type)
## [1] "electric_bike" "classic_bike"  "docked_bike"
colnames(df)[1] <- "id"
df$trip_duration <- as.numeric(difftime(df$ended_at,
                                        df$started_at,
                                        units = "hours"))
df$day_of_week <- wday(df$started_at)
df$day_of_year <- yday(df$started_at)
df$month_of_year <- month(df$started_at)
df$week_of_year <- week(df$started_at)
df$hour_of_day <- hour(df$started_at)
names(df)
##  [1] "rideable_type"      "started_at"         "ended_at"           "start_station_name" "start_station_id"   "end_station_name"  
##  [7] "end_station_id"     "start_lat"          "start_lng"          "end_lat"            "end_lng"            "member_casual"     
## [13] "trip_duration"      "day_of_week"        "day_of_year"        "month_of_year"      "hour_of_day"

Proportion Tables

round(prop.table(table(df$member_casual)), 2) * 100
## 
## casual member 
##     41     59
round(prop.table(table(df$rideable_type)), 2) * 100
## 
##  classic_bike   docked_bike electric_bike 
##            46             3            51
round(prop.table(table(df$member_casual, df$rideable_type)), 2) * 100
##         
##          classic_bike docked_bike electric_bike
##   casual           16           3            22
##   member           30           0            29

An initial summary was run on the numeric variable trip_duration to determine the distribution of data across the dataset and generated the following boxplot.

summary(df$trip_duration)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -172.5558    0.0969    0.1714    0.3241    0.3078  689.7875
boxplot(df$trip_duration)

This showed notable outliers above the third Quartile and there should have been no values equal to or less than Zero hours. The following Histogram gave further insight of the trends regarding trip duration.

df %>%
  ggplot(aes(x = trip_duration)) +
  geom_histogram(bins = 500) +
  labs(x = "Ride Duration in Hours",
       y = NULL, title = "Histogram of Ride Duration")

Since the vast majority of rides had a ride limit under 1 hour the data was filtered to more accuratly reflect trends including any negative trip value.

boxplot(df$trip_duration[df$trip_duration <= 1 & df$trip_duration > 0])

df %>%
  filter(trip_duration > 0,
         trip_duration <= 1) %>%
  ggplot(aes(x = trip_duration)) +
  geom_histogram(bins = 10) +
  labs(x = "Ride Duration in Hours",
       y = NULL,
       title = "Histogram of Ride Duration")

Sample

Due to the large nature of the dataset a sample was generated representing approx. .003% of the dataset. This sample represents a 99% Confidence Level and a 1% Margin of Error. docked_bikes were filtered out of the dataset as they only represented 3% of preference by Casual Riders and had a 0% representaion of Member riders.

df  <- df %>%
  filter(rideable_type != "docked_bike",
         trip_duration > 0,
         trip_duration <= 1) %>%
  sample_frac(.003)

Again running a distribution summary of the trip_duration variable on the sample_data reveals that the sample is statistically significant enough to use in place of the full population going forward with the analysis.

summary(df$trip_duration)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0002778 0.0941667 0.1625000 0.2144061 0.2816667 0.9972222

Here is the output of the same plot. The number and degree of outliers are significantly reduced, potentially eliminating the need to filter the data further.

df %>%
  ggplot(aes(x = trip_duration)) +
  geom_histogram(bins = 10) +
  labs(x = "Ride Duration in Hours",
       y = NULL, title = "Histogram of Ride Duration")

Export

The sample data is saved in order to recall it for future observation.

output_sample_file <- "~/R/gdac_cs_1/source/2022-sample-divvy-tripdata.csv"
write.csv(df_frac, file = output_sample_file, row.names = TRUE)

Analysis

Further investigation shows that the rideable_type variable docked_bike is populated only my “Casual” users, also the duration of the rides mostly greatly exceed the mean ride duration so the docked_bike observations are filtered out of the dataset.

unique(df$rideable_type)
## [1] "electric_bike" "classic_bike"
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = sum)
##         Group.1       x
## 1  classic_bike 1752.03
## 2 electric_bike 1695.62
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = max)
##         Group.1         x
## 1  classic_bike 0.9963889
## 2 electric_bike 0.9972222
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = min)
##         Group.1            x
## 1  classic_bike 0.0002777778
## 2 electric_bike 0.0002777778
aggregate(df$trip_duration, by = list(df$rideable_type), FUN = mean)
##         Group.1         x
## 1  classic_bike 0.2301970
## 2 electric_bike 0.2002149
hist(df$trip_duration)

df %>%
  ggplot(aes(rideable_type, fill = rideable_type)) +
  geom_bar()

df %>%
  ggplot(aes(rideable_type, trip_duration, fill =  rideable_type)) +
  geom_col()

Visualization

Assesments are made comparing multiple time variables to member_casual and rideable_type observations.

#Average trip duration v. user type
df %>%
  group_by(member_casual) %>%
  summarise(mean_trip_dur = mean(trip_duration) * 60) %>%
  ggplot(aes(member_casual, mean_trip_dur, fill = member_casual)) +
  geom_col() +
  labs(x = "Subscriber Type",
       y = "Minutes",
       title = "User Type v. Mean Trip Duration in Minutes",
       fill = "Subscriber Type")

# Median trip duration v. member type
df %>%
  group_by(member_casual) %>%
  summarise(median_trip_dur = median(trip_duration) * 60) %>%
  ggplot(aes(member_casual,
             median_trip_dur,
             fill = member_casual)) +
  geom_col() +
  labs(x = "Subscriber Type",
       y = "Minutes",
       title = "User Type v. Median Trip Duration in Minutes",
       fill = "Subscriber Type")