What should we learn from Figma's strategic shift to Kubernetes
As an experienced cloud native architect at Taikun, I've seen firsthand how Kubernetes can revolutionize a company's infrastructure. Figma's recent migration to Amazon EKS (Elastic Kubernetes Service) is a prime example of the transformative power of Kubernetes for scaling cloud-native applications. This transition was completed in less than 12 months and involved several key considerations and steps.
Why Figma moved to Kubernetes?
Figma's decision to move to EKS was influenced by multiple factors:
-
Limitations of ECS and ECS Fargate:
- ECS is AWS-specific, limiting Figma's ability to adopt multi-cloud strategies.
- ECS Fargate, while serverless, lacks fine-grained control over infrastructure.
- Both have a more limited ecosystem compared to Kubernetes.
-
Orchestration:
- Kubernetes offers sophisticated scheduling and bin-packing algorithms.
- Better support for stateful applications and persistent storage, crucial for running services like etcd(mostly, if you're company you know how much you come across this term daily).
-
Scalability and Performance:
- Kubernetes' horizontal pod autoscaling is more flexible than ECS's service auto-scaling.
- Kubernetes allows for more efficient resource utilization across nodes.
-
Rich Ecosystem and Tooling:
- Access to CNCF ecosystem tools like Keda for auto-scaling and Envoy for service mesh capabilities.
- Easier adoption of open-source software packaged as Helm charts.
-
Future-Proofing:
- Kubernetes positions Figma to easily adopt emerging cloud-native technologies.
- Vendor-neutral nature allows for potential multi-cloud or hybrid cloud strategies.
Technical challenges and solutions
-
Running etcd:
- ECS lacked support for StatefulSets, requiring complex workarounds.
- Kubernetes natively supports StatefulSets, simplifying etcd deployment.
-
Auto-scaling:
- Figma wasn't utilizing ECS's auto-scaling, leading to over-provisioning.
- Implemented Karpenter on EKS for dynamic node scaling based on demand.
-
Service Mesh:
- ECS relied on AWS ALBs and NLBs, causing slow target registration/deregistration.
- Kubernetes opens the door for service mesh solutions like Istio.
-
Logging Pipeline:
- Complex pipeline using CloudWatch, Lambda, Datadog, and Snowflake.
- Planned future implementation of Vector for more efficient log processing.
Figma's migration strategy and execution
-
Careful Scoping:
- Focused on core system swap (ECS to EKS) without changing abstractions for users.
- Identified key wins: improved developer experience, reliability, and cost efficiency.
-
Developer Experience Improvements:
- Moved from Terraform-based service definitions to Bazel configuration files.
- Automated YAML generation for services and ingress objects.
-
Reliability Enhancements:
- Implemented three separate EKS clusters per service for improved fault tolerance.
- Reduced full outage potential to one-third of a service.
-
Incremental Rollout:
- Used weighted DNS entries for gradual traffic shifting between ECS and EKS.
- Allowed for quick rollbacks and minimal user impact.
What we can learn from Figma's Journey ?
Diving into the technical terms, Figma's migration reveals some fascinating details. They were running etcd inside their clusters, which ECS couldn't natively support without StatefulSets, forcing them to create workarounds. Their reliance on off-the-shelf OSS software, often packaged as Helm charts, was becoming a burden as they had to rewrite these for ECS using Terraform. Surprisingly, they weren't utilizing ECS's autoscaling capabilities, which likely led to significant over-provisioning and unnecessary costs.
They also hit performance bottlenecks with AWS load balancers (ALB/NLB) during deployments. In their new Kubernetes setup, they've made some interesting choices, like running three clusters per service (possibly for AZ isolation) and using Karpenter for node autoscaling. Their logging pipeline, which routes data through CloudWatch Logging, Lambda, and finally to Datadog and Snowflake, is a complex but common pattern in large-scale deployments.
Lessons from Figma's Migration
-
Gradual Rollout: Figma's approach of incrementally shifting traffic using weighted DNS entries is a best practice we often recommend. It allows for careful monitoring and quick rollbacks if issues arise.
-
Focus on Developer Experience: By abstracting away raw Kubernetes YAML from developers, Figma has created a more user-friendly platform. This is crucial for adoption and productivity.
-
Run Real Workloads Early: Migrating a service before completing the staging environment provided valuable insights.
-
Embracing Cloud-Native Patterns: Moving to EKS allowed Figma to adopt cloud-native patterns like sidecar containers for logging, which can significantly simplify and optimize data pipelines.
-
Tooling Refinement:
Simplified debug tooling to abstract cluster complexity from users. Improved RBAC integration for easier permission management. -
Ongoing Optimizations:
Simplifying logging pipeline design with horizontal pod auto-scaling with Keda. Migrating services to Graviton processors for cost savings.
Figma's journey demonstrates the transformative power of Kubernetes in enabling rapid growth and innovation. Their successful migration, completed in less than a year, showcases the importance of careful planning, incremental rollout, and close collaboration with stakeholders. As cloud-native technologies continue to evolve, Kubernetes remains at the forefront, providing a robust foundation for companies to build and scale their applications with confidence.
At Taikun, we've seen similar transformations in our clients' infrastructures after adopting Kubernetes. Figma's experience reinforces that it's not just about container orchestration; it's about embracing a new paradigm of cloud-native development and operations that can scale with your business needs.