Mastermind, Chatwise
Connect your conversations with the tools and services that you use to get the job done. With over 2,500+ apps and a robust API, the Slack platform team works with partners and developers globally to build apps and integrations.
At Notion, we use Spark for many types of production workloads that process large amounts of data. Some jobs are short, while others run for hours, storing and moving massive datasets between hundreds of workers.
As our workloads grew, it became important to reduce infrastructure costs without sacrificing reliability. We built a cost-efficient Spark setup on Kubernetes, but pushing the savings further led to failures that Spark could not handle well.
To solve this, we created Spot Balancer. This tool lets Spark jobs choose between cost and stability. Along with better resource use from our Kubernetes-based infrastructure, it has helped us save 60-90 percent on costs across our workloads.
Spark is designed to process data across many machines. A driver manages the work across executors, which can be added or removed as the job runs. Executors exchange data through shuffle operations. Spark handles single-executor failures well by retrying tasks, recomputing data, and typically keeping the job moving forward. Spark is designed to process data across many machines. A driver manages the work across executors, which can be added or removed as the job runs.
Spark is designed to process data across many machines. A driver manages the work across executors, which can be added or removed as the job runs. Executors exchange data through shuffle operations. Spark handles single-executor failures well by retrying tasks, recomputing data, and typically keeping the job moving forward. Spark is designed to process data across many machines.
A driver manages the work across executors, which can be added or removed as the job runs.
We used to run Spark on Amazon EMR with fixed EC2 instances. As our usage increased, this setup became limiting and created significant maintenance overhead. Clusters were often overprovisioned, underutilized, or required tuning for each job.
We switched to EMR on EKS, keeping EMR’s APIs and tools but running jobs on a shared Kubernetes platform. This gave us centralized infrastructure and enabled cost optimization across all jobs simultaneously.
On Kubernetes, we use Karpenter with EKS Auto Mode for node management.
Using this setup, Spark jobs do not need to set instance types or cluster sizes. They state their CPU and memory requirements, and Karpenter picks the optimal capacity, starts nodes as needed, and removes them when they are no longer needed. This dynamic provisioning both simplified and optimized our node management.
We added the MostAllocated scheduler to efficiently place executors from different jobs on the same nodes, prioritizing existing capacity. This concept, known as bin-packing, helps us use existing nodes before adding new ones. This way, we can run more work on the same hardware and minimize costs for underutilized nodes.
On Kubernetes, we use Karpenter with EKS Auto Mode for node management.
Using this setup, Spark jobs do not need to set instance types or cluster sizes. They state their CPU and memory requirements, and Karpenter picks the optimal capacity, starts nodes as needed, and removes them when they are no longer needed. This dynamic provisioning both simplified and optimized our node management.
We added the MostAllocated scheduler to efficiently place executors from different jobs on the same nodes, prioritizing existing capacity. This concept, known as bin-packing, helps us use existing nodes before adding new ones. This way, we can run more work on the same hardware and minimize costs for underutilized nodes.
On Kubernetes, we use Karpenter with EKS Auto Mode for node management.
Using this setup, Spark jobs do not need to set instance types or cluster sizes. They state their CPU and memory requirements, and Karpenter picks the optimal capacity, starts nodes as needed, and removes them when they are no longer needed. This dynamic provisioning both simplified and optimized our node management.
We added the MostAllocated scheduler to efficiently place executors from different jobs on the same nodes, prioritizing existing capacity. This concept, known as bin-packing, helps us use existing nodes before adding new ones. This way, we can run more work on the same hardware and minimize costs for underutilized nodes.
On Kubernetes, we use Karpenter with EKS Auto Mode for node management.
Using this setup, Spark jobs do not need to set instance types or cluster sizes. They state their CPU and memory requirements, and Karpenter picks the optimal capacity, starts nodes as needed, and removes them when they are no longer needed. This dynamic provisioning both simplified and optimized our node management.
We added the MostAllocated scheduler to efficiently place executors from different jobs on the same nodes, prioritizing existing capacity. This concept, known as bin-packing, helps us use existing nodes before adding new ones. This way, we can run more work on the same hardware and minimize costs for underutilized nodes.
We used to run Spark on Amazon EMR with fixed EC2 instances. As our usage increased, this setup became limiting and created significant maintenance overhead. Clusters were often overprovisioned, underutilized, or required tuning for each job.
We switched to EMR on EKS, keeping EMR’s APIs and tools but running jobs on a shared Kubernetes platform. This gave us centralized infrastructure and enabled cost optimization across all jobs simultaneously.
On Kubernetes, we use Karpenter with EKS Auto Mode for node management.
Using this setup, Spark jobs do not need to set instance types or cluster sizes. They state their CPU and memory requirements, and Karpenter picks the optimal capacity, starts nodes as needed, and removes them when they are no longer needed. This dynamic provisioning both simplified and optimized our node management.
We added the MostAllocated scheduler to efficiently place executors from different jobs on the same nodes, prioritizing existing capacity. This concept, known as bin-packing, helps us use existing nodes before adding new ones. This way, we can run more work on the same hardware and minimize costs for underutilized nodes.
On Kubernetes, we use Karpenter with EKS Auto Mode for node management.
Using this setup, Spark jobs do not need to set instance types or cluster sizes. They state their CPU and memory requirements, and Karpenter picks the optimal capacity, starts nodes as needed, and removes them when they are no longer needed. This dynamic provisioning both simplified and optimized our node management.
We added the MostAllocated scheduler to efficiently place executors from different jobs on the same nodes, prioritizing existing capacity. This concept, known as bin-packing, helps us use existing nodes before adding new ones. This way, we can run more work on the same hardware and minimize costs for underutilized nodes.
On Kubernetes, we use Karpenter with EKS Auto Mode for node management.
Using this setup, Spark jobs do not need to set instance types or cluster sizes. They state their CPU and memory requirements, and Karpenter picks the optimal capacity, starts nodes as needed, and removes them when they are no longer needed. This dynamic provisioning both simplified and optimized our node management.
We added the MostAllocated scheduler to efficiently place executors from different jobs on the same nodes, prioritizing existing capacity. This concept, known as bin-packing, helps us use existing nodes before adding new ones. This way, we can run more work on the same hardware and minimize costs for underutilized nodes.