Data Analytics Case Study: Cyclistic

Avatar of Chen Jen-Shuo.
Avatar of Chen Jen-Shuo.

Data Analytics Case Study: Cyclistic

Music / Data
Taipei City, Taiwan

Case Study: Cyclistic (Tool: SQL, Tableau)

Kaggle好讀版連結: https://www.kaggle.com/code/jenshuochen/case-study-cyclistic

About

This is a mock case study of a virtual bike-share company, Cyclistic.

I will be following the data analytics process outlined in the Google Data Analytics Certificate Program:

Ask → Prepare → Process → Analyze → Share → Act

The tools I will use for this analysis are:

  1. Excel: To prepare and clean the data.
  2. SQL (Google BigQuery): For data cleaning and analysis.
  3. Tableau: To create data visualizations.

In the end, I will provide answers to specific questions and offer recommendations based on the insights gained from the data analysis.

Scenario

Cyclistic is a Chicago based bike-share company with 2 types of customers.

Customers with a single-ride passes or full-day passes are considered casual customer.

Customers with annual memberships are considered Cyclistic member.

Now, the company has found that annual membership is more profitable than casual customers. They are seeking to understand the differences between these two types of customers and are working on developing a new marketing strategy to convert casual customers into annual members.

Ask

Problem

There are 3 main questions to lead the marketing strategy:

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

My supervisor, Moreno, has assigned me the first question. Through these three questions, we can identify the profile of each type of customer and understand how a casual rider transitions to an annual membership. This information will assist in refining the marketing strategy to target the goal more accurately.

Business task

Identify how the casual riders and annual members use the Cyclistic service differently.

Prepare

Data sources

The dataset is in this link. The data has been made available by Motivate International Inc. under this license.

First, I download the data from the past 12 months.

In [1]:

202212-divvy-tripdata.csv
202301-divvy-tripdata.csv
202302-divvy-tripdata.csv
202303-divvy-tripdata.csv
202304-divvy-tripdata.csv
202305-divvy-tripdata.csv
202306-divvy-tripdata.csv
202307-divvy-tripdata.csv
202308-divvy-tripdata.csv
202309-divvy-tripdata.csv
202310-divvy-tripdata.csv
202311-divvy-tripdata.csv

The datasets are organized in rows and columns, each row represents a trip. In the column, we have:

* ride_id               
* rideable_type         
* started_at            #day and time
* ended_at              #day and time
* start_station_name    
* start_station_id      
* end_station_name      
* end_station_id        
* start_lat             #Trip start latitude  
* start_lng             #Trip start longitute   
* end_lat               #Trip end latitude  
* end_lat               #Trip end longitute   
* member_casual         #Rider type

Credibility

ROCCC test

  1. Reliable: This is a public data with complete, accurate, and unbiased information.
  2. Original: The data is made from Cyclistic’s bike activities.
  3. Comprehensive: The data includes a large amount of activities from different types of customers.
  4. Current: The data spans the past 12 months, making it current and relevant to our business task.
  5. Cited: The data is public available and provided by Cyclistic.


In Excel

First, I copy the original files to a subfolder for safety. Then, I import the .csv files into Excel and save them as .XLS for cleaning purposes.

For each .XLS file, I follow these steps to clean the data:

Remove Duplicates

  • Click the "Remove Duplicates" button to ensure there are no duplicates in the data.

Add Ride Length Column

  • Use the formula = ended_at - started_at to calculate the ride length of each ride.
  • Format the result in h:mm:ss.

Create Ride Date & Ride Time Columns

  • Utilize =INT and =MOD functions to extract the date and time.

Create Ride Year Column

  • Apply the =YEAR() function and format it as general to extract the year.

Create Ride Month Column

  • Use =TEXT(target, "mm") and format it as a number.

Create Ended Date & Ended Time Columns

  • Similar to ride_date and time, use INT and MOD to extract the date and time.

Create Weekday Column

  • Implement the =WEEKDAY() function to extract the weekday, then format the weekday as a number (Sunday = 1 to Saturday).
Repeat the process for each .XLS file and save it as a new .csv file.
Additionally, I break down the large .XLS files into smaller files to comply with the BigQuery 100Mb upload limit.


Process

Cleaning in BigQuery

1. Union to combine the files into a table

SELECT ride_id, rideable_type, started_at, ended_at, week_day, ride_length, start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng, member_casual
FROM loyal-lore-407211.Cyclist_dataset.trip202305
UNION ALL
SELECT ride_id, rideable_type, started_at, ended_at, week_day, ride_length,start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng, member_casual
FROM loyal-lore-407211.Cyclist_dataset.trip202306

-- Then repeat the process for each table

Then export the one full table as “trip_combined”.

2. Use DISTINCT to check the STRING columns

See if there’s duplicate or misspelling.

SELECT DISTINCT [the columns]
FROM loyal-lore-407211.Cyclist_dataset.trip_combined

In the end_station_id column, there's an unusual entry named "chargingstx07," so I decided to investigate.

SELECT DISTINCT end_station_id
FROM loyal-lore-407211.Cyclist_dataset.trip_combined

SELECT end_station_name, end_station_id
FROM loyal-lore-407211.Cyclist_dataset.trip_combined
WHERE end_station_id = "chargingstx07"

However, it appears that this is not an incorrect input. The other columns seem to be in good order.

3. Check if there’s NULL

SELECT
 *
FROM
 loyal-lore-407211.Cyclist_dataset.trip_combined_add_columns
WHERE
 start_station_name IS NULL

About 150,000 start_station_name, and about 130,000 end_station_name are NULL.

Also, there are 838 end_lat, and end_lng are NULL.

4. Create ride_length, day_of_week, month column in SQL

Even though I have already done this step in Excel, but here’s how to do it in BigQuery.

SELECT ride_id,
            rideable_type,
            started_at,
            ended_at,
            ROUND(TIMESTAMP_DIFF(ended_at, started_at, second)/60, 1) AS ride_length_minutes,
            EXTRACT(DAYOFWEEK FROM started_at) AS day_of_week,
            EXTRACT(MONTH FROM started_at) AS month,
            start_station_name,
            start_station_id,
            end_station_name,
            end_station_id,
            start_lat,
            start_lng,
            end_lat,
            end_lng,
            member_casual
FROM loyal-lore-407211.Cyclist_dataset.trip_combined


Analyze

Ride length


AVG, MAX, MIN of ride_length_minute

First I want to know the average, max, min of ride_length_minute.

SELECT 
 AVG(ride_length_minutes) AS avg_ride,
 MIN(ride_length_minutes) AS min_ride,
 MAX(ride_length_minutes) AS max_ride
FROM
 loyal-lore-407211.Cyclist_dataset.trip_combined_add_columns

And here’s the return:

avg_ridemin_ridemax_ride15.466-16657.029210.0

It is unusual that the min_ride is negative, prompting me to run the following query to determine how many rides have negative values.

SELECT
 COUNT(ride_length_minutes)
FROM
 loyal-lore-407211.Cyclist_dataset.trip_combined_add_columns
WHERE
 ride_length_minutes<0

There are 73 rides in negative ride length, I believe these are errors. Therefore, I am going to filter out those rides and export the results as a new table.

SELECT *
 FROM loyal-lore-407211.Cyclist_dataset.trip_combined_add_columns
 WHERE ride_length_minutes > 0

Now let’s calculate the average, min, and max again.

avg_ridemin_ridemax_ride15.7331.029210.0


Mode of day_of_week

Now I want to know the mode(the most commonly occurring value) in day_of_week.

(1=Sunday, 7=Saturday)

Mode of day_of_week member

SELECT
 day_of_week,
 COUNT(*) AS count
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
GROUP BY
 day_of_week
ORDER BY
 count DESC

Rowday_of_weekcount15158419241562753715404643147411561436656113382972133270

Mode is 5, which is Thursday.


Average ride length of casual riders and members

  • Average ride length for casual riders:
SELECT
 AVG(ride_length_minutes) AS avg
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 member_casual = "casual"

AVG = 22.97 minutes

  • Average ride length for members:
SELECT
 AVG(ride_length_minutes) AS avg
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 member_casual = "member"

AVG = 12.25 minutes


Average ride length by day of week

1=Sunday, 7=Saturday

SELECT
 day_of_week,
 AVG(ride_length_minutes) AS avg
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
GROUP BY
 day_of_week
ORDER BY
 day_of_week

day_of_weekavg119.709233424743559214.511180310647507313.599331121829396413.612337226043794514.289857908457964615.545588695924604719.190579437310

The highest average ride length is on Sunday.

The lowest average ride length is on Tuesday.


Average length for casual riders by day of week

SELECT
 day_of_week,
 AVG(ride_length_minutes) AS avg
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 member_casual='casual'
GROUP BY
 day_of_week
ORDER BY
 day_of_week

day_of_weekavg127.419156795754471221.954451885243767319.143364217669383418.74479269011594520.635555857016939622.647593729560946726.266489027274


Average length for member by day of week

SELECT
 day_of_week,
 AVG(ride_length_minutes) AS avg
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 member_casual='member'
GROUP BY
 day_of_week
ORDER BY
 day_of_week

day_of_weekavg113.872636802856835211.500943167567685311.652242062037111411.80421069027282511.831990542078943612.048957078156796713.7726147356838


Number of rides

Number of rides for members and casual riders

SELECT
 member_casual,
 COUNT(*) AS ride_number
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
GROUP BY
 member_casual

Member = 693422

Casual = 333493


Number of ride for casual rider in day of week

SELECT
 day_of_week,
 COUNT(*) AS ride_number
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 member_casual = 'casual'
GROUP BY
 day_of_week
ORDER BY
 day_of_week

day_of_weekride_number157661238377338315440712544229647397766802


Number of ride for member in day of week

SELECT
 day_of_week,
 COUNT(*) AS ride_number
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 member_casual = 'member'
GROUP BY
 day_of_week
ORDER BY
 day_of_week

day_of_weekride_number176168294893310909641155635114190696268787244


Number of ride for member in hours of the day

Then I want to know how 2 kinds of riders different in each hour of the day. To avoid confusion, I turn the day_of_week number to actual day of week.

SELECT CASE
            WHEN day_of_week = 1 THEN 'Sunday'
            WHEN day_of_week = 2 THEN 'Monday'
            WHEN day_of_week = 3 THEN 'Tuesday'
            WHEN day_of_week = 4 THEN 'Wednesday'
            WHEN day_of_week = 5 THEN 'Thursday'
            WHEN day_of_week = 6 THEN 'Friday'
            WHEN day_of_week = 7 THEN 'Saturday' END AS day_of_the_week,
            EXTRACT(hour FROM started_at) AS hour,
            COUNT(*) AS rides_taken
FROM loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE member_casual = 'member'
GROUP BY day_of_week, hour
ORDER BY day_of_week, hour

Here’s a part of the return

day_of_the_weekhourrides_takenSunday01735Sunday11262Sunday2762Sunday3411Sunday4304Sunday5356Sunday6950Sunday71394Sunday82370Sunday93766Sunday105046


Number of ride for casual rider in hours of the day

SELECT CASE
            WHEN day_of_week = 1 THEN 'Sunday'
            WHEN day_of_week = 2 THEN 'Monday'
            WHEN day_of_week = 3 THEN 'Tuesday'
            WHEN day_of_week = 4 THEN 'Wednesday'
            WHEN day_of_week = 5 THEN 'Thursday'
            WHEN day_of_week = 6 THEN 'Friday'
            WHEN day_of_week = 7 THEN 'Saturday' END AS day_of_the_week,
            EXTRACT(hour FROM started_at) AS hour,
            COUNT(*) AS rides_taken
FROM loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE member_casual = 'casual'
GROUP BY day_of_week, hour
ORDER BY day_of_week, hour

Part of the return

day_of_the_weekhourrides_takenSunday11310Sunday2839Sunday3443Sunday4317Sunday5285Sunday6501Sunday7882Sunday81451Sunday92576Sunday103823Sunday114924Sunday124960


Rideable type

Then I want to see how 2 types of rider choose the bikes differently.

Rideable type for casual rider

First let’s see how the casual riders choose the bike type.

SELECT
 rideable_type,
 COUNT(rideable_type) AS bike_type
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 member_casual = 'casual'
GROUP BY
 rideable_type

rideable_typebike_typeclassic_bike163567electric_bike168251docked_bike1675


Rideable type for member

Then let’s see how the members choose the bike type.

SELECT
 rideable_type,
 COUNT(rideable_type) AS bike_type
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 member_casual = 'member'
GROUP BY
 rideable_type

rideable_typebike_typeclassic_bike367533electric_bike325889


Locations

Most 100 popular start stations for casual riders

Include the latitude and longitude of the start station

SELECT
 start_station_name,
 start_lat,
 start_lng,
 COUNT(*) AS ride_number
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 start_station_name !='null'
 AND member_casual = 'casual'
GROUP BY
 start_station_name, start_lat, start_lng
ORDER BY
 ride_number DESC
LIMIT
 100

The most popular start station for casual rider is: Streeter Dr & Grand Ave


Most popular start station for member

SELECT
 start_station_name,
 start_lat,
 start_lng,
 COUNT(*) AS ride_number
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 start_station_name !='null'
 AND member_casual = 'member'
GROUP BY
 start_station_name, start_lat, start_lng
ORDER BY
 ride_number DESC
LIMIT
 30

The most popular start station for casual rider is: Ellis Ave & 60th St.


Most popular end station for casual rider

SELECT
 end_station_name,
 end_lat,
 end_lng,
 COUNT(*) AS ride_number
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 start_station_name !='null'
 AND member_casual = 'casual'
GROUP BY
 end_station_name, end_lat, end_lng
ORDER BY
 ride_number DESC
LIMIT
 30


Most popular end station for member

SELECT
 end_station_name,
 end_lat,
 end_lng,
 COUNT(*) AS ride_number
FROM
 loyal-lore-407211.Cyclist_dataset.removed_errors
WHERE
 start_station_name !='null'
 AND member_casual = 'member'
GROUP BY
 end_station_name, end_lat, end_lng
ORDER BY
 ride_number DESC
LIMIT
 30


Most total visited stations by member

Then I want to know where is the most visited station by members.

So I use JOIN to combine the results of 2 tables: “member start station count” and “member end station count”, to calculate the total visit

SELECT start_station_name AS station_name,
start_station_count_member.ride_number +
end_station_count_member.ride_number AS total_visit
FROM
 loyal-lore-407211.Cyclist_dataset.start_station_count_member
 JOIN loyal-lore-407211.Cyclist_dataset.end_station_count_member
 ON start_station_name = end_station_name
ORDER BY
 total_visit DESC


Most total visited stations by casual riders

Same thing applies to casual riders.

SELECT start_station_name AS station_name,
start_station_count_casual.ride_number +
end_station_count_casual.ride_number AS total_visit
FROM
 loyal-lore-407211.Cyclist_dataset.start_station_count_casual
 JOIN loyal-lore-407211.Cyclist_dataset.end_station_count_casual
 ON start_station_name = end_station_name
ORDER BY
 total_visit DESC


Share

After running the statistic queries above, I saved and exported the result as .csv file, and import into Tableau.

https://public.tableau.com/shared/SGK8HKRHS?:display_count=n&:origin=viz_share_link


Average ride length in day of week

Average Ride Length

  • Casual riders, on average, ride longer than members.

Ride Lengths by Day of the Week

  • Casual riders tend to have longer rides on weekends and shorter rides on weekdays. Specifically, Wednesday has the lowest average ride length for casual riders.
  • Members don't exhibit significant differences in ride length between weekends and weekdays.


Number of ride in day of week

Number of Rides

  • Interestingly, despite casual riders having longer rides than members, members boast a significantly higher number of rides.
  • This discrepancy may be attributed to the fact that members pay a one-time fee for unlimited bike usage, while casual riders incur charges per ride, resulting in a lower ride count despite longer ride lengths.

Number of Rides by Day of the Week

  • Members demonstrate a higher ride frequency on weekdays, reaching a peak on Wednesdays, and a lower number of rides on weekends.
  • Conversely, casual riders exhibit a lower number of rides on weekdays and a higher count on weekends.
  • Suggestion: We can assume that members primarily use the rides for commuting, whereas casual riders opt for recreational rides, particularly on holidays.


Number of ride in hour

Weekend

  • On weekends, both members and casual riders exhibit the same pattern.
  • The peak ride number occurs from 10:00 to 18:00, aligning with the time when people typically engage in outdoor activities the most. 

Weekdays

  • Interestingly, there's a notable difference on weekdays for members. They show a distinct peak in bike usage at 7-8 am and another peak at 5 pm.
  • Both of these peak times coincide with rush hours on weekdays when people are commuting to and leaving the office. This observation further supports the assumption that members primarily use the bikes for commuting purposes.


Bike choice

  • Members show a preference for classic bikes over electric bikes.
  • Casual riders, on the other hand, slightly favor electric bikes over classic bikes.
  • Integrating this information with the average ride length, it appears that members use bikes for shorter distances, providing little incentive to choose electric bikes.
  • Conversely, given the assumption that casual riders engage in longer and recreational bike rides, this might explain their slight preference for electric bikes.


Location

No Clear Station Relationship

  • There is no overlap of stations between casual riders and members.

Near the Water

  • According to the map, casual riders tend to use bikes near the water.
  • The most popular end and start stations, such as Streeter Dr. & Grand Ave., DuSable Lake Shore Dr. & North Blvd, are stations near the water.
  • Members use bikes further inland, with the most popular stations for members located farther from the water than those for casual riders.
  • This observation further supports the assumption that casual riders prefer leisurely bike use, while members tend to use bikes for commuting purposes.


Act

Key Takeaways

After analyzing the disparities in activities between members and casual riders, the following key takeaways have emerged, effectively addressing the primary question of this analysis:

How do annual members and casual riders use Cyclistic bikes differently?

  1. Members predominantly use bikes on weekdays, indicating a focus on commuting to and from work.
  2. Members tend to have shorter ride lengths but significantly higher ride numbers, suggesting a preference for short-distance commuting.
  3. Given members' emphasis on shorter commutes, the choice of bike type appears to be less critical.
  4. Members show lower usage of bikes in recreational or leisure areas compared to casual riders.

Recommendations

The identified characteristics of our target customers, those with a higher likelihood of becoming members, suggest the following recommendations for the marketing team:

  1. Concentrate marketing efforts on individuals with short-distance, regular, and consistent commuting needs.
  2. Target residents rather than tourists, focusing on individuals living in the city.
  3. Prioritize areas where bike stations are within a 13-minute biking radius of work or residence.
  4. Direct marketing strategies toward business areas of the city rather than leisure spots, such as locations near the water.


這是一個芝加哥虛擬腳踏車租借公司Cyclistic的Case study。 內容包含我使用Excel、SQL來清理、分析腳踏車使用資料,並在最後用Tableau呈現整個Data story,並作出分析建議。
Avatar of the user.
Please login to comment.

Published: Feb 2nd 2024
27
6
0

Tools

mysql
MySQL

case study
data science
data cleaning
data analytics

Share