Kaggle好讀版連結: https://www.kaggle.com/code/jenshuochen/case-study-cyclistic
This is a mock case study of a virtual bike-share company, Cyclistic.
I will be following the data analytics process outlined in the Google Data Analytics Certificate Program:
Ask → Prepare → Process → Analyze → Share → Act
The tools I will use for this analysis are:
In the end, I will provide answers to specific questions and offer recommendations based on the insights gained from the data analysis.
Cyclistic is a Chicago based bike-share company with 2 types of customers.
Customers with a single-ride passes or full-day passes are considered casual customer.
Customers with annual memberships are considered Cyclistic member.
Now, the company has found that annual membership is more profitable than casual customers. They are seeking to understand the differences between these two types of customers and are working on developing a new marketing strategy to convert casual customers into annual members.
There are 3 main questions to lead the marketing strategy:
My supervisor, Moreno, has assigned me the first question. Through these three questions, we can identify the profile of each type of customer and understand how a casual rider transitions to an annual membership. This information will assist in refining the marketing strategy to target the goal more accurately.
Identify how the casual riders and annual members use the Cyclistic service differently.
The dataset is in this link. The data has been made available by Motivate International Inc. under this license.
First, I download the data from the past 12 months.
In [1]:
202212-divvy-tripdata.csv 202301-divvy-tripdata.csv 202302-divvy-tripdata.csv 202303-divvy-tripdata.csv 202304-divvy-tripdata.csv 202305-divvy-tripdata.csv 202306-divvy-tripdata.csv 202307-divvy-tripdata.csv 202308-divvy-tripdata.csv 202309-divvy-tripdata.csv 202310-divvy-tripdata.csv 202311-divvy-tripdata.csv
The datasets are organized in rows and columns, each row represents a trip. In the column, we have:
* ride_id * rideable_type * started_at #day and time * ended_at #day and time * start_station_name * start_station_id * end_station_name * end_station_id * start_lat #Trip start latitude * start_lng #Trip start longitute * end_lat #Trip end latitude * end_lat #Trip end longitute * member_casual #Rider type
ROCCC test
First, I copy the original files to a subfolder for safety. Then, I import the .csv files into Excel and save them as .XLS for cleaning purposes.
For each .XLS file, I follow these steps to clean the data:
Repeat the process for each .XLS file and save it as a new .csv file.
Additionally, I break down the large .XLS files into smaller files to comply with the BigQuery 100Mb upload limit.
SELECT ride_id, rideable_type, started_at, ended_at, week_day, ride_length, start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng, member_casual FROM loyal-lore-407211.Cyclist_dataset.trip202305 UNION ALL SELECT ride_id, rideable_type, started_at, ended_at, week_day, ride_length,start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng, member_casual FROM loyal-lore-407211.Cyclist_dataset.trip202306 -- Then repeat the process for each table
Then export the one full table as “trip_combined”.
See if there’s duplicate or misspelling.
SELECT DISTINCT [the columns] FROM loyal-lore-407211.Cyclist_dataset.trip_combined
In the end_station_id
column, there's an unusual entry named "chargingstx07," so I decided to investigate.
SELECT DISTINCT end_station_id FROM loyal-lore-407211.Cyclist_dataset.trip_combined SELECT end_station_name, end_station_id FROM loyal-lore-407211.Cyclist_dataset.trip_combined WHERE end_station_id = "chargingstx07"
However, it appears that this is not an incorrect input. The other columns seem to be in good order.
SELECT * FROM loyal-lore-407211.Cyclist_dataset.trip_combined_add_columns WHERE start_station_name IS NULL
About 150,000 start_station_name, and about 130,000 end_station_name are NULL.
Also, there are 838 end_lat, and end_lng are NULL.
Even though I have already done this step in Excel, but here’s how to do it in BigQuery.
SELECT ride_id, rideable_type, started_at, ended_at, ROUND(TIMESTAMP_DIFF(ended_at, started_at, second)/60, 1) AS ride_length_minutes, EXTRACT(DAYOFWEEK FROM started_at) AS day_of_week, EXTRACT(MONTH FROM started_at) AS month, start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng, member_casual FROM loyal-lore-407211.Cyclist_dataset.trip_combined
First I want to know the average, max, min of ride_length_minute.
SELECT AVG(ride_length_minutes) AS avg_ride, MIN(ride_length_minutes) AS min_ride, MAX(ride_length_minutes) AS max_ride FROM loyal-lore-407211.Cyclist_dataset.trip_combined_add_columns
And here’s the return:
avg_ridemin_ridemax_ride15.466-16657.029210.0
It is unusual that the min_ride
is negative, prompting me to run the following query to determine how many rides have negative values.
SELECT COUNT(ride_length_minutes) FROM loyal-lore-407211.Cyclist_dataset.trip_combined_add_columns WHERE ride_length_minutes<0
There are 73 rides in negative ride length, I believe these are errors. Therefore, I am going to filter out those rides and export the results as a new table.
SELECT * FROM loyal-lore-407211.Cyclist_dataset.trip_combined_add_columns WHERE ride_length_minutes > 0
Now let’s calculate the average, min, and max again.
avg_ridemin_ridemax_ride15.7331.029210.0
Now I want to know the mode(the most commonly occurring value) in day_of_week.
(1=Sunday, 7=Saturday)
Mode of day_of_week member
SELECT day_of_week, COUNT(*) AS count FROM loyal-lore-407211.Cyclist_dataset.removed_errors GROUP BY day_of_week ORDER BY count DESC
Rowday_of_weekcount15158419241562753715404643147411561436656113382972133270
Mode is 5, which is Thursday.
SELECT AVG(ride_length_minutes) AS avg FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE member_casual = "casual"
AVG = 22.97 minutes
SELECT AVG(ride_length_minutes) AS avg FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE member_casual = "member"
AVG = 12.25 minutes
1=Sunday, 7=Saturday
SELECT day_of_week, AVG(ride_length_minutes) AS avg FROM loyal-lore-407211.Cyclist_dataset.removed_errors GROUP BY day_of_week ORDER BY day_of_week
day_of_weekavg119.709233424743559214.511180310647507313.599331121829396413.612337226043794514.289857908457964615.545588695924604719.190579437310
The highest average ride length is on Sunday.
The lowest average ride length is on Tuesday.
SELECT day_of_week, AVG(ride_length_minutes) AS avg FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE member_casual='casual' GROUP BY day_of_week ORDER BY day_of_week
day_of_weekavg127.419156795754471221.954451885243767319.143364217669383418.74479269011594520.635555857016939622.647593729560946726.266489027274
SELECT day_of_week, AVG(ride_length_minutes) AS avg FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE member_casual='member' GROUP BY day_of_week ORDER BY day_of_week
day_of_weekavg113.872636802856835211.500943167567685311.652242062037111411.80421069027282511.831990542078943612.048957078156796713.7726147356838
SELECT member_casual, COUNT(*) AS ride_number FROM loyal-lore-407211.Cyclist_dataset.removed_errors GROUP BY member_casual
Member = 693422
Casual = 333493
SELECT day_of_week, COUNT(*) AS ride_number FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE member_casual = 'casual' GROUP BY day_of_week ORDER BY day_of_week
day_of_weekride_number157661238377338315440712544229647397766802
SELECT day_of_week, COUNT(*) AS ride_number FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE member_casual = 'member' GROUP BY day_of_week ORDER BY day_of_week
day_of_weekride_number176168294893310909641155635114190696268787244
Then I want to know how 2 kinds of riders different in each hour of the day. To avoid confusion, I turn the day_of_week number to actual day of week.
SELECT CASE WHEN day_of_week = 1 THEN 'Sunday' WHEN day_of_week = 2 THEN 'Monday' WHEN day_of_week = 3 THEN 'Tuesday' WHEN day_of_week = 4 THEN 'Wednesday' WHEN day_of_week = 5 THEN 'Thursday' WHEN day_of_week = 6 THEN 'Friday' WHEN day_of_week = 7 THEN 'Saturday' END AS day_of_the_week, EXTRACT(hour FROM started_at) AS hour, COUNT(*) AS rides_taken FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE member_casual = 'member' GROUP BY day_of_week, hour ORDER BY day_of_week, hour
Here’s a part of the return
day_of_the_weekhourrides_takenSunday01735Sunday11262Sunday2762Sunday3411Sunday4304Sunday5356Sunday6950Sunday71394Sunday82370Sunday93766Sunday105046
SELECT CASE WHEN day_of_week = 1 THEN 'Sunday' WHEN day_of_week = 2 THEN 'Monday' WHEN day_of_week = 3 THEN 'Tuesday' WHEN day_of_week = 4 THEN 'Wednesday' WHEN day_of_week = 5 THEN 'Thursday' WHEN day_of_week = 6 THEN 'Friday' WHEN day_of_week = 7 THEN 'Saturday' END AS day_of_the_week, EXTRACT(hour FROM started_at) AS hour, COUNT(*) AS rides_taken FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE member_casual = 'casual' GROUP BY day_of_week, hour ORDER BY day_of_week, hour
Part of the return
day_of_the_weekhourrides_takenSunday11310Sunday2839Sunday3443Sunday4317Sunday5285Sunday6501Sunday7882Sunday81451Sunday92576Sunday103823Sunday114924Sunday124960
Then I want to see how 2 types of rider choose the bikes differently.
First let’s see how the casual riders choose the bike type.
SELECT rideable_type, COUNT(rideable_type) AS bike_type FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE member_casual = 'casual' GROUP BY rideable_type
rideable_typebike_typeclassic_bike163567electric_bike168251docked_bike1675
Then let’s see how the members choose the bike type.
SELECT rideable_type, COUNT(rideable_type) AS bike_type FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE member_casual = 'member' GROUP BY rideable_type
rideable_typebike_typeclassic_bike367533electric_bike325889
Include the latitude and longitude of the start station
SELECT start_station_name, start_lat, start_lng, COUNT(*) AS ride_number FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE start_station_name !='null' AND member_casual = 'casual' GROUP BY start_station_name, start_lat, start_lng ORDER BY ride_number DESC LIMIT 100
The most popular start station for casual rider is: Streeter Dr & Grand Ave
SELECT start_station_name, start_lat, start_lng, COUNT(*) AS ride_number FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE start_station_name !='null' AND member_casual = 'member' GROUP BY start_station_name, start_lat, start_lng ORDER BY ride_number DESC LIMIT 30
The most popular start station for casual rider is: Ellis Ave & 60th St.
SELECT end_station_name, end_lat, end_lng, COUNT(*) AS ride_number FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE start_station_name !='null' AND member_casual = 'casual' GROUP BY end_station_name, end_lat, end_lng ORDER BY ride_number DESC LIMIT 30
SELECT end_station_name, end_lat, end_lng, COUNT(*) AS ride_number FROM loyal-lore-407211.Cyclist_dataset.removed_errors WHERE start_station_name !='null' AND member_casual = 'member' GROUP BY end_station_name, end_lat, end_lng ORDER BY ride_number DESC LIMIT 30
Then I want to know where is the most visited station by members.
So I use JOIN to combine the results of 2 tables: “member start station count” and “member end station count”, to calculate the total visit
SELECT start_station_name AS station_name, start_station_count_member.ride_number + end_station_count_member.ride_number AS total_visit FROM loyal-lore-407211.Cyclist_dataset.start_station_count_member JOIN loyal-lore-407211.Cyclist_dataset.end_station_count_member ON start_station_name = end_station_name ORDER BY total_visit DESC
Same thing applies to casual riders.
SELECT start_station_name AS station_name, start_station_count_casual.ride_number + end_station_count_casual.ride_number AS total_visit FROM loyal-lore-407211.Cyclist_dataset.start_station_count_casual JOIN loyal-lore-407211.Cyclist_dataset.end_station_count_casual ON start_station_name = end_station_name ORDER BY total_visit DESC
After running the statistic queries above, I saved and exported the result as .csv file, and import into Tableau.
https://public.tableau.com/shared/SGK8HKRHS?:display_count=n&:origin=viz_share_link
After analyzing the disparities in activities between members and casual riders, the following key takeaways have emerged, effectively addressing the primary question of this analysis:
How do annual members and casual riders use Cyclistic bikes differently?
The identified characteristics of our target customers, those with a higher likelihood of becoming members, suggest the following recommendations for the marketing team: