• Implemented Airflow ETL pipelines using Docker and integrated various databases including SQL Server, MSSQL, Vertica,
MongoDB, enhancing reliability, accessibility of the pipelines, and improving data processing efficiency by 10x
• Designed a data alert system using Python that monitors hundreds of ETL processes, and provides updates on the latest data status
every minute, and delivers status alerts to communication services or email, resolving stale data and ETL failure issues.
• Developed Python tools such as encrypting and decrypting sensitive data, and real-time operational reporting systems using Kafka
and MongoDB, enabling stakeholders to access up-to-date information for performance monitoring and reporting
一月 2023 - 五月 2023
Designed and developed ETL pipelines
• Implemented and maintained ETL Pipelines which integrated with Jenkins on cloud services (GCP and AWS), provided by analysts
with reliable data, and saved 50% effort (Groovy, Dataproc, Spark, BigQuery, EMR, Hive, Jenkins)
• Designed ETL pipeline framework and implemented it to cloud service and held user training sessions for engineers and analysts. This project was reduced by 40% maintenance efforts and increased by 60% deployment efficiency by leveraging Airflow and
Kubernetes (Gitlab, Airflow, Python, K8s)
Developed and deployed API to retrieve data in the cloud environment
• Developed an API to retrieve data that included geo-location data from BigQuery and deployed it on the GCP environment.
The API saved 80% of the time on fetching data (Cloud Run, IAM, BigQuery)
十月 2019 - 七月 2021
Maintained distributed system and database
Constructed and managed the Hadoop ecosystem with Ambari. Built ETL pipeline to query multi-source database which
processing more than three terabytes (TB) provided 90% of the analysis needs (Hive, HBase, Python, ELK, MySQL)
Established data collection and analysis workflow, saving Data scientists’ 30% of the time to analyze and build machine
learning models with collected data (Elasticsearch, PySpark, Airflow) Constructed backend system and API
Researched webpage user preference and behavior, and modified advertising performance evaluation system to enable
precision marketing, increasing the accuracy by 300% for advertising targeting
Constructed articles to classify API and embedded a machine learning model (linear regression, random forest, XGBoost) to
categorize the articles. The tool has been implemented as the product and processed 90% of the articles every day (Flask)
Upgraded an advertising API and deployed it on cloud service (GCP). Increased the total monthly revenue by 33% after
implementing the new API (Python, Docker, Nginx, Celery, Redis, Load-balance system, MySQL, HBase)
二月 2019 - 九月 2019
Construct data pipeline and data analysis
Developed bioinformatics pipeline, which saved 80% effort for non-technical scientists to analyze and visualize genome
sequencing. Published the research paper and the software in the Frontiers journal as the first author (Python, R, Linux)
十月 2017 - 四月 2018
2021 - 2023
2014 - 2016
十一月 2025 到期