Junyan Cloud Relates Works

Avatar of 莊鈞諺.
Avatar of 莊鈞諺.

Junyan Cloud Relates Works

Expertise & Innovation Lead, Cloud
Taipei City, Taiwan

Cloud Architecture Solution Showcases

Welcome to my portfolio, showcasing my diverse range of technical expertise and experience in data engineering, serverless architectures, deep learning, and IoT systems. Here you'll find real-world applications of tools like Google Cloud Platform, AWS, BigQuery, and Python, demonstrating my unique problem-solving abilities. From designing complex data pipelines to improving IoT system architectures, this portfolio reflects my commitment to creating efficient, scalable solutions and my continuous drive to learn and adapt. Enjoy exploring!

Cloud Data Warehouse System


At my current role, I'm responsible for managing and operating a government subsidy disbursement platform. Given the substantial budget and size of the project team, there was a significant demand for data analysis. The original outsourced dashboard vendor could not keep up with the client's needs. Hence, I proposed to establish our own data analysis system within the team, taking full responsibility for it.

Data Sources

  • Our data sources consisted of a diverse set, including data derived from transaction databases, GA4 logs, PageSpeed Insights, Search Console data, and various other public datasets.

Data Extraction and Storage

  • Google's Compute Engine was utilized for data extraction. After processing the extracted data using Python, it was sent directly to our database via the BigQuery API.
  • To ensure critical data preservation, the data was sent to Cloud Storage.

Front-end Product Numbers Monitoring

  • To maintain a pulse on the platform's user-facing offerings, I devised a monitoring system for the number of available front-end plans. This system makes a GraphQL call every 15 minutes to log the current count of visible plans.

Cloud Storage

  • Apart from serving as a repository for raw file backups, Cloud Storage also stores performance reports. These reports are generated through weekly checks via PageSpeed Insights.
  • For long-term tracking purposes, we extract certain data points from these performance reports and send them to BigQuery.

Google Analytics 4 Integration

  • Data from Google Analytics 4 was directly imported into BigQuery through preset configurations.

Alert System

  • Three alerts were set up in the Cloud Monitor to ensure smooth system operations. These alerts, which notify me via email upon being triggered, monitor:The health status of the Virtual Machine.
  • The request count for recent batch data stored in Cloud Storage to verify the ETL process is functioning as expected.
  • The number of rows written into specific BigQuery tables.

Future Planning

  • As we move forward, I'm considering the adoption of a data lake strategy. This would involve backing up GA4 logs and PageSpeed Insight reports in Cloud Storage, primarily aimed at reducing the cost of scans in BigQuery.

Website Traffic and User Behavior Tracking


In the early stages of my career, I developed skills in digital advertising projects which primarily fell into three categories:

  1. Website traffic and user behavior tracking
  2. Performance-based digital advertising delivery and optimization, which feeds off data gathered from the first point
  3. Search Engine Optimization (SEO)

During various advertising projects, I garnered experience in digital marketing across numerous industries, including securities finance, government websites, and large-scale 3C e-commerce. I handled traffic tracking for over 15 websites. While my current focus lies in the first category, I continue to collect data relevant to all aspects. For instance, in the website I am currently responsible for, I manage the system that collects data that impacts website traffic and conversion rates, such as keyword rankings and tracking the effectiveness of marketing campaigns.

Integrating Heterogeneous Marketing Data: Building a Data Lake with Databricks

Digital marketers grapple daily with the challenge of retrieving and consolidating data from diverse sources, often encountering unexpected formats and locations. To address this predicament, we turned to Databricks as a primary tool, harnessing its robust data processing prowess coupled with the concepts of Data Lakes and Delta Lake. Data lakes facilitate centralized storage of data in varied formats, while Delta Lake offers data version control and efficient querying. This piece delves into the intricacies of data integration, presenting a viable solution through Databricks and affiliated technologies, empowering marketers to optimize their data assets.

This solution architecture diagram provides a panoramic view of a comprehensive data workflow, spanning from data acquisition, transformation, to the final applications and presentations. Here are the pivotal phases and their corresponding actions:

  • Data Collection Phase:
  • Here, we leverage various compute utilities to either make API calls or directly fetch data from BigQuery. The retrieved data is then stored in specific cloud storage buckets like S3, Google Cloud Storage, or Azure Blob Storage.
  • Databricks Processing:
  • Data Transformation: Through the Databricks platform, we transmute original data formats (e.g., CSV, Parquet, JSON) into Delta format, indicating its storage within Delta Lake.
  • Data Storage: Once data transitions into a Delta Table, it resides in the DBFS (Databricks File System), a large-scale distributed filesystem tailored for Databricks.
  • Data Consolidation:
  • Utilizing Databricks notebooks, we venture further into data computations and amalgamation, ensuring the data aligns with subsequent application prerequisites.
  • Data Offloading & Presentation:
  • Post integration, data can be offloaded to alternate database systems, or directly employed using BI tools for visualization and analysis, offering users lucid insights from the data.

MLOps with MLflow + SHAP

In the evolving landscape of AI and machine learning, our solution seamlessly integrates MLflow and SHAP for comprehensive MLOps. From structured model training, logging, and versioning to insightful model interpretation, our approach enhances transparency. Leveraging SHAP values, we offer both global and local model insights. Further, through MLflow's capabilities, models are effectively deployed, monitored, and updated. This solution not only ensures optimal model performance but also prioritizes clarity and trust, making complex AI models more understandable and reliable.

Model Training and Logging:

  • Model Training: Train models using appropriate machine learning frameworks such as Scikit-learn, TensorFlow, PyTorch, etc.
  • Model Versioning: Utilize MLflow's tracking feature to log and version each model training iteration, including parameters, metrics, and model artifacts.

Model Interpretation:

  • Compute SHAP Values: Employ SHAP to interpret the model and calculate SHAP values for each feature.
  • Global Interpretation: Use SHAP's summary_plot to swiftly understand the global impact of each feature on the model.
  • Local Interpretation: Delve deeper into individual predictions using SHAP's force_plot.

Model Refinement and Retraining:

  • Feature Selection: Based on SHAP values, determine which features have the most significant impact on model predictions and optimize feature selection.
  • Parameter Tuning: Track model performance metrics such as ROC AUC or MSE using MLflow logs, aiding in parameter adjustments.
  • Model Retraining: Retrain the model using the refined features and parameters.

Model Deployment:

  • Model Packaging: With MLflow's Model Registry, encapsulate the model into a container and provide an API interface for it.
  • Model Launch: Deploy the packaged model into suitable environments, such as Kubernetes, AWS Lambda, or Azure ML.

Model Monitoring and Feedback:

  • Performance Monitoring: Continuously monitor the model's real-world performance using MLflow and other monitoring tools.
  • Immediate Adjustments: Based on the model's actual performance and SHAP's interpretation, promptly adjust the model or data to ensure continuous optimization.

Model Retraining and Updating:

  • Data Updates: As new data emerges, use MLflow for incremental model training or retraining.
  • Model Updates: Use MLflow's Model Registry to deploy the new model into production and manage model versions.

Leveraging LLM Bard for Natural Language Generation Tasks in BigQuery

Situation:

As BigQuery continuously integrates robust natural language processing (NLP) capabilities, developers are presented with increasingly convenient tools. The recent inclusion of text-embedding features in BigQuery amplifies the diversity and potency of NLP applications. However, while numerous overviews touch upon these new features, the practical application and operational steps are often overlooked.

Task:

This article aims to fill that gap by offering a step-by-step guide on harnessing these advanced NLP tools within BigQuery. A highlight includes the integration of LLM Bard, a recently added feature allowing users to directly conduct AI operations within the data warehouse.

Action:

  1. Integration of BigQuery and LLM Bard: Google's recent update enables the execution of LLM Bard directly within the BigQuery data warehouse. This eliminates the need for HTTP clients and bypasses various operational constraints. It's vital to note that BigQuery's underlying technology is not exclusively Bard, but rather based on PaLM 2, known as text-bison in Vertex AI.
  2. Dataset Selection: For the purpose of demonstration, the "smsspamcollection" dataset was chosen, comprising 5,572 messages categorized as spam or legitimate. The dataset's simplistic yet representative structure makes it an ideal choice to demonstrate the LLM Bard capability.
  3. BigQuery Setup: Instructions from Faraday AI's June 2023 article titled "How to use Google Bard in BigQuery (quickstart)" were employed. The steps involved:
  • Creating a BigQuery Dataset based in the US.
  • Establishing connections within BigQuery.
  • Associating service accounts with the connection.
  • Granting necessary permissions.
  • Creating remote models within BigQuery.
  1. Hands-on Implementation: Three distinct queries were constructed using the set-up language model to analyze spam messages. They included:
  • Sentiment analysis of specific spam messages.
  • Distinguishing between legitimate and spam messages.
  • Identifying common patterns or features in spam messages.

Result:

The integration of LLM Bard into BigQuery presents a groundbreaking approach to conducting AI operations directly within the data warehouse. The sentiment analysis query, for example, deduced that a message advertising free entry to an FA Cup final was deemed "positive". The distinction between legitimate and spam messages was clearly identified, emphasizing that while legitimate messages have a clear purpose and relevance to the recipient, spam messages are unsolicited and sent in bulk.

Deep Learning-Based Recommendation System (Ongoing)

  1. I use Vertex AI's Notebook to train the model, with the data source being BigQuery. Model registration and management is conducted within Vertex AI, which also produces an endpoint or deploys the model to a VM for execution of model predictions.
  2. The VM on Compute Engine ingests user data coming from the front end, then either makes endpoint calls or conducts model computation on the VM, returning the results to the front end. Computed results are also sent back to the database for storage.
  3. Load Balancing and Auto-scaling Groups are utilized for VM scaling. Another crucial function is to initiate a new VM to replace the current one in case of unhealthiness.

Serverless-Based Budget Alert Module

This module checks if the issuance of points on the Taiwan Cloud Market platform exceeds the budget. It queries the budget data in a BigQuery table, and if the remaining budget falls below a threshold, it sends an email to notify relevant personnel.

System Architecture

  • Cloud Scheduler: Triggers a Cloud Function that queries the database every 2 hours.
  • Cloud Function: Comprises two functions:The first function queries BigQuery and uses logic to analyze the returned data. If the remaining budget is below the alert threshold, it publishes a topic to Pub/Sub.
  • The second function sends an alert email and includes the queried budget data.
  • Pub/Sub: Serves to separate the logic of the two functions and triggers the second Cloud Function.
  • IAM: This module has an independent service account with only corresponding permissions.

Optimizing Data Analysis and Quality through Dataplex

In today's data-driven landscape, the need to harness data efficiently and ensure its quality has never been more critical. Dataplex, an innovative tool in the world of data management, stands at the forefront of this evolution. It promises not just enhanced data analytics but also an assurance of impeccable data quality.

Situation:

In the digital age, ensuring data quality has become pivotal for business success.

Task:

To implement a robust system that monitors and enhances data quality automatically.

Action:

  1. Defined clear data quality standards leveraging Dataplex's automated data quality features, offering both pre-set rules and advanced customizations in the Google Cloud console.
  2. Conducted automatic data scans with Dataplex, verifying against previously established data quality standards. Any deviations triggered automatic alerts.
  3. Integrated Dataplex with Cloud Logging for real-time monitoring and alerting capabilities, enabling timely notifications upon detecting any data quality issues.
  4. Visualized data quality results by exporting scan outcomes directly to BigQuery and using BI tools like Looker for an intuitive representation.
  5. Adopted a continuous improvement approach, adjusting data quality rules based on the visualized reports and received alerts to ensure ongoing monitoring and enhancement of data quality.

Result:

This strategy not only safeguarded high-quality data but also increased transparency in data quality management. The visualizations in Looker combined with the real-time alerting capabilities of Cloud Logging provided a more immediate and clear view of data quality, ensuring smooth business operations.


More details can be found in my medium blog:透過 Dataplex 優化資料分析和資料品質


Event-Trigger Based AWS Glue: Implementing the Processing of GA4 Log Files


This solution is designed for effective management and analysis of Google Analytics 4 (GA4) log data. Given the substantial volume of GA4 log data, we apply a data flow involving Google Cloud Function, Amazon S3, AWS Glue ETL, and Amazon Athena for efficient data processing.

System Architecture

  • BigQuery Query & Cloud Function: We use Google Cloud Function to periodically trigger BigQuery queries, organize daily GA4 log data, and convert them into CSV files.
  • Storage to S3: The converted CSV files are stored in Amazon S3 for subsequent data processing.
  • AWS Lambda & Glue Crawler: Once S3 receives new CSV files, it triggers an AWS Lambda function. This function initiates an AWS Glue Crawler to further analyze the new data files.
  • Glue ETL: When Cloud Watch observes the completion of data structure exploration by the Crawler, we utilize Lambda to trigger AWS Glue ETL processing. This step involves necessary data transformations to meet our requirements.
  • Storage & Analysis: The transformed data is stored back into another S3 Bucket. Finally, we use Amazon Athena for data reading and analysis.


More details can be found in my medium blog:基於事件觸發的AWS Glue:實作處理GA4日誌檔案

Implementation of Google Cloud Functions and Pub/Sub Integration

This project involves the utilization of Google Cloud Functions and Pub/Sub for the analysis and communication of stock data.

  1. Acquiring and Analyzing Stock Data: We use a Cloud Function to retrieve and analyze stock data.
  2. Publishing Analysis Results: The results from the analysis are published to a Pub/Sub topic.
  3. Subscribing and Communication: Another Cloud Function subscribes to the topic. Upon receiving new messages, it sends the results via email.


For a detailed walkthrough of the process, please refer to my article: 利用 Google Cloud Functions 和 Pub/Sub 串接實作

Architectural Modification Project for an IoT-Based Aquaculture Data Collection System

This project involved consultation and suggestions for architectural modifications on a system designed to collect data from IoT devices in aquaculture settings. Though I didn't implement the changes, my role was instrumental in designing the improved system's architecture.

The original system structure worked as follows:

  1. A local server collected data from IoT devices, aquaculture equipment, and cameras. Numeric data was stored directly in MySQL, while image data was stored in Blob storage.
  2. Another virtual machine (VM) pulled images for training AI models on Azure ML. These models were then deployed on the same machine.
  3. The front-end, back-end, and database were all housed on the same VM.

This project served as a Proof of Concept (PoC) for a digital twin system. In the next section, I will provide the proposed architectural changes.

The main modifications to the IoT-Based Aquaculture Data Collection System architecture are as follows:

  1. Separating the database layer from the front and back end VMs, and switching to managed Azure Database MySQL for the DB.
  2. Changing IoT transmissions to use the IoT Hub and Azure Functions to trigger regular data transfers to the DB.
  3. Adding Container Registries for managing deployed containers.
  4. Employing Azure Monitor to oversee the VM health status, IoT Hub and Blob data reception, DB write operations, and the execution of Functions.

Expexted Benefits:

  1. Improved Reliability and Management: Separation of database from VMs improves system performance.
  2. Enhanced Scalability: Managed services like Azure Database MySQL adapt to fluctuations in load.
  3. Efficient Data Transfers: IoT Hub and Azure Functions streamline data transfers and reduce data loss.
  4. Better Deployment Management: Container Registries improve efficiency and reliability of deployments.
  5. Proactive Monitoring: Azure Monitor allows early identification and resolution of issues.
  6. Cost Savings: Managed services, efficient deployments, and proactive monitoring contribute to long-term cost savings.



Dive into this portfolio to explore data engineering, serverless solutions, deep learning, and IoT innovations, all under one roof. Real-world applications, ingenious problem-solving, and continuous learning await you!
Avatar of the user.
Please login to comment.

Published: Jul 29th 2023
164
6
0

Tools

tensorflow
TensorFlow
docker
Docker
python
Python
azure
Azure
amazonwebservices
Amazon Web Services

GA4
Azure
Cloud Architecture
GCP
AWS

Share