my avatar

Connor Hsu (leafwind)

對數據與資料有著高度好奇心,喜歡研究資訊落差、數據操弄與資料分析等議題,現為新創軟體工程師,每天在人工智慧與現實數據拉扯,閒暇時經營部落格《all about data》 

Experience

Software Engineer, Appier, Jun 2014 - Present

Cross-functional engineer, process data with Machine Learning, OLAP, ETL, RESTful API, Backend service and automation 

Build RTB bidding algorithm in a fast-growing business environment.

Conduct experiments to make daily improvements.

Inference root cause in an uncontrolled, time-sensitive environment to solve critical issues.

Design/consolidate pipelines to enable AI/ML products from hundred of terabyte data. 

appier logo

Research Assistant, CSIE, NTU, Oct 2012 - May 2014

Build an online video retrieval system via linking algorithms with open source softwares.

Co-develop retrieval algorithms (feature selection, ranking and indexing).

Collect video segments from digital TV signal and build terabyte database as the experiment dataset.

NTU logo

Education

National Taiwan University, Taiwan, Sep 2009 - Jun 2011

M.S., Department of Computer Science and Information Engineering

  • Me-link: Link me to the media - fusing audio and visual cues for robust and efficient mobile media interaction (WWW 2014)
  • Comp2Watch: enhancing the mobile video browsing experience (IMMPD 2011)
  • Snap2Read: Automatic Magazine Capturing and Analysis for Adaptive Mobile Reading (MMM 2011)
  • NTU logo

    National Chiao Tung University, Taiwan, Sep 2005 - Jun 2009

    B.S., Department of Computer Science

    GPA: 3.88/4.0

    NCTU logo

    Selected Projects


    big_data_alchemy

    Continuously Improve ETL Processes

  • Gain 300% update frequency with 33% cores via reusing data flow and rescheduling the jobs according to its data recency. (2015 Q3)
  • Further achieve 200% update frequency with less resource by pipeline refactoring.
  • data-cleansing.jpg

    Data Cleansing & Data Governance

  • Survey over 250 (undocumented) major spark table fields, deprecated/noted: 80+, corrected 20+ of them within one week. (2016 Q3)
  • Build a monitoring dashboard for major columns which have issues before.
  • Trace complex (RDB, Spark, static csv/json/parquet, memory, API services) and undocumented data pipelines produced by different owners in a daily basis.
  • neuron-network-concept-vector-illustration

    Improve ML Model Performance

    Improve high quality inventory prediction by adapting suitable ML model (2016 Q2)

  • precision 5.3% -> 79%
  • volume increased to 1280%
  • Extend CPA model to different scenario

  • CPA reduced to 68%
  • volume increased to 240%
  • MeLink: An online multimedia retrieval system, NTU, 2014

    Scenario Demo: http://vimeo.com/leafwind/melink-scenario

    Using aural-visual signature captured by mobile to retrieve multimedia objects (image/video/audio) in million-scale dataset within a few seconds.

    Design/maintain index structures for different features.

    C, C++, Python, iOS app, MPlayer, Echoprint, OpenCV, Solr, TokyoTyrant

    Articles


    big_data_alchemy

    My Machine Learning Engineering in Appier

    I’ve stay more than 2.5 years in Appier, at the first 2 years, my job is to do anything that help AI work with the growing business, as well as the data platform/system which are getting more and more complicated (loop and twisted dependencies, propagated errors without monitoring, and no documents). Most of them are engineering job, but not always the case...

    Continue Reading
    machine learning vs. manufacturing

    先有軟體與數據文化,才有AI

    會有「可以跳過」的錯覺,或許是以為軟體與數據是硬技術,而且是「硬背的技術」:能夠照本宣科,有固定流程可以依循。但事實上它們不是,是軟的文化。人工智慧雖然可以跟硬體結合,但本質上無法逃離軟體服務的範疇。「軟體服務」,就是由看不見實體的「軟體」與沒有實體的「服務」所組成,而這些正是台灣長年漠視的文化:我們在意的是看得見的包子與看得見的價格;對深遠的影響與內在的價值卻鮮少認真評估...

    Continue Reading
    machine learning...how?

    《機器學習法則》繁中版(一)在機器學習之前

    這系列文章原文為 Martin Zinkevich (Research Scientist @ Google) 所著之 “Rules of Machine Learning”。看了一部分之後覺得深得我心(尤其「你會面對的絕大多數問題是工程問題」這一點)。作為一個參與過部分機器學習開發的工程師,很希望在任何一個產品被開發之前,就能讓團隊成員知道這些有用的「老生常談」,從而避免很多系統上的冤枉路。於是想要深入細讀、做些筆記,將它推廣到繁體中文界。因此我試著儘量在維持原意的情況下翻譯、排版,並將一些原文較為精簡的敘述,補上(以個人經驗推斷的)詳細說明...

    Continue Reading

    Data / ETL


    Production

    Spark, HiveQL, MySQL


    Toy / scripts

    Sqlite3, LMDB, Pandas, InfluxDB


    ML Library

    word2vec, janome, jieba

    Services


    AWS S3, VPS (Linode / Digital Ocean)

    Linux: shell script

    Nginx + uWSGI + Flask


    Python API

    gspread, boto, scrapy

    CI Toolchain


    Jenkins

    Remote trigger, Parameterized building

    Job dependency, Retry

    Python

    Pyflakes, Pylint, Nose, Vulture, Coverage

    Other Services

    Travis CI, Coveralls

    Bots


    Slack pokemon RPG / Tarot bot

    Line weather bot

    Twitch engagement bot

    Web & Visualization


    d3.js, bokeh (toy)

    jQuery

    Grafana + InfluxDB

    Software Development


    Git

    JIRA

    Scrum, Kanban and Scrumban


    Photograph

    Although not a professional photographer, I like to take pictures by my phone
    These photos were took by LG G4.
    2015-07-11 百合only-2015-07-11 07.08.02 2.jpg
    sky @ Ming Chuan University, Taipei
    2016-11-06_09@Appier關島員工旅遊-20161108_174905_2.jpg
    sunset @ Guam
    Powered by CakeResumePowered by CakeResume
    Powered by CakeResumePowered by CakeResume