Data engineer and data scientist with over four half years of experience. Proven success in processing big volume of data (6TB per day) in Spark in Scala and MPI in R and Python, developing a machine learning model with Spark in Scala on 30 billions of records for IoT device recognition and developing algorithms to classify unlabeled network behaviors of customers to protect their devices from compromising. Skilled in programming, machine learning, cross-functional communication skills and creative problem solving. Objective is to become an expert in data science field to make people’s life better with statistics and machine learning.
♦ R / Python / SQL
♦ Machine Learning
♦ Spark / Parallel Computing (MPI)
♦ Hive / HDFS / MongoDB
♦ Scala / C++ / Shell Script
♦ Data Visualization / Tableau
♦ AWS / Azure
** Quality Management System – Data Engineer
• Collect URD from quality engineers to develop systems for tracking quality issues.
• Design QMS for quality engineers to trace PRR, VLRR and perform FA of Azure hardware like server, CPU, memory, SSD, HDD and motherboard.
** Home Network Security – Data Engineer
• Reduced 90% time of reports from 1B security events every day. This helps marketing and sales people in Japan, Singapore and Australlia to find opportunities to improve business.
• Visualized the relationship between security events for thread experts with word2vec and t-SNE.
** Network Behavior Analysis Project – Data Scientist
• Developed a machine learning model to recognize IoT devices based on 30 billion records of netflows via Spark in Scala and Python.
• Reached a 90% accuracy rate in identifying periodic network behaviors of IoT devices with a statistical model.
** Yield Improvement Project – Data Engineer and Data Scientist
• Processed the big volume of data (6TB per day) to maintain a data warehouse for machine learning projects.
• Reduced the out-of-control rate by 30% via a statistical model.
• Reduced scrapping rate by 80% with homemade anomaly detection algorithms.
• Reduced 80% time to find key factors of yield rates via data visualization and statistics
** Big Data Solutions – Data Engineer
• Digest 6TB data per day by building an on-premise big data solution via Scala, Spark and Hive.
• Reduced 95% implementation time of machine learning algorithms via R, MPI, Hive and Spark.
** Weekly Productivity Improvement Program – Leader
• Developed R packages to reduce reinventing the wheels and increase productivity.
• Taught writing clean and performant codes to data scientists and data engineers.
• Organized study groups to share knowledge of machine learning and statistics with colleagues.
** Main role
• Decreased data processing time by 80% via R and MongoDB to process millions of records of data per day.
• Got a 40% lowered RMSE in imputing missing values with home-made machine learning than other methods.
== Achievements ==
• Completed a master’s thesis entitled “A Classification Approach Based on Density Ratio Estimation with Subspace Projection.” Advisor: Ray-Bing Chen.
• Earned a grade of 95% in my statistical methods, generalized linear models, and statistical data mining classes, and 92% in my linear models class. I am thus confident with building models and inferences from models.
• Completed an advanced probability theory class designed for Ph. D. students.
With an advanced plan and hard work, I earned 175 credits for 2 majors within 4 years.