## to prepare data
df <- tibble(Fraud)
df$type <- factor(df$type)
df$isFraud <- factor(df$isFraud)
df$isFlaggedFraud <- factor(df$isFlaggedFraud)
## to create transaction types ratio
x_pie <- table(df$type)
y_pie <- unique(df$type)
per_pie <- round(x_pie/sum(x_pie) * 100, 2)
pie(x_pie,
labels = paste(y_pie, per_pie, "%", sep = " "),
main = "Transaction Type")
After transaction observation, I want to find the popular transactions that customers will likely to do by the bank’s dataset. I decided to code a pie chart within R Studio and found that customers are using transfer and debit transactions for 68.98% in the similar ratios. For the third popularity, customers are using payment services via the bank for 21.99% from all of transactions. In my opinion, the spread of contactless payment and technological improvement affected decreasing in traditional withdrawal transactions. Hence, customers are using digital transfer and payment instead.
## to create processed amount per transaction histogram plot
# to clean data and calculate mean and standard deviation
amount <- df %>%
filter(oldbalanceOrg != 0, newbalanceOrig != 0,
oldbalanceDest != 0, newbalanceDest != 0) %>%
mutate(amounts = abs(oldbalanceOrg - newbalanceOrig)) %>%
filter(amounts >= 1000 & !is.na(amounts))
m_amount <- round(mean(amount$amounts), 2)
sd_amount <- round(sd(amount$amounts), 2)
# to generate random sample
set.seed(14)
random_amount <- tibble(V1 = as.numeric(rnorm(
n = 1338561,
mean = m_amount, sd = sd_amount))) %>%
filter(V1 > 1000)
# to calculate quantile ranges for the random sample
hist_list <- list()
for (name in colnames(random_amount)) {
print(paste("Calculating range for:", name))
hist_list[[name]] <- quantile(random_amount[[name]],
probs = c(0, 0.2, 0.4, 0.6, 0.8, 1))
}
# to calculate mode value
mode_amount <- round(density(random_amount$V1)$x[which.max(density(random_amount$V1)$y)], 2)
# to plot histogram
ggplot(random_amount, aes(x = V1)) +
geom_histogram(aes(fill = cut(V1, breaks = hist_list$V1))) +
ggtitle("Distribution of Processed Amounts per Transaction") +
xlab("Amount") +
ylab("Count") +
scale_fill_manual(
values = c("#F8766D", "#F9A41B", "#D2D611", "#8FD944", "#00BFC4"),
labels = c(
"1k - 81.7k", "81.7k - 140k", "140k - 195k",
"195k - 263k", "263k - 726k"
)
) +
scale_x_continuous(labels = comma_format()) +
theme(axis.text.x = element_text(size = 8)) +
geom_vline(xintercept = mode_amount, color = "blue", linetype = "dashed", size = 1) +
labs(fill = "Amount") +
annotate("text", x = mode_amount, y = 0, label = paste0("Mode: ", comma(mode_amount)))
When I knew the type of transactions that customers mostly did with the bank services, I want to know how much money that was processed and the most popular amounts that were operated. This data set has six million transactions which is too much for my laptop capacity, I decided to filter only transactions greater than $1,000 only. After approximately getting around one million and three hundred thousand transactions, I randomly sampled by a normal distribution method to distribute equal opportunities that transactions will happen. Finally, I created a quantile range for the processed transactions to specify suitable amounts in the histogram chart and found that most customers will make transactions around $81,700 to $195,000 for 40% with the most popular amount being $154,956.