Imporment projects / design
- IaC decouple:
* Break down IaC project to from mono-service to cohesion service
* Reduce IaC change task reaction time: 30 mins per plan / apply to <5 mins
* Migrate IaC tool form Terragrunt to Pulumi
- Log flow refactor(in design):
* Add Logstash as a middleware to filter log and handle alert which trigger by log
* Separate Elasticsearch log index form different Team’s service and setting index expired day
- Zero downtime observability:
* Build up monitoring standards for key services with Developer, Customer Success and PM teams
* Setting monitor and alert for key service
Maintains cloud infrastructure and network routes between two
cloud provider:
- AWS, Alicloud
- special network resource cross cloud: AWS direct connect,
transit gateway, Alicloud VBR, CEN
- Kubernetes platform
Maintains observability components
- ELK, fluent-bit
- Grafana, Prometheus, alert manager
- Jaeger(Tracing)
- Cloud resource log: S3 log, etc.
Handling production troubleshooting, alert and emergency issue
Review Cloud’s bill and resource cost, develop cost down plans