AWS Big Data Blog

Amazon EMR introduces EMR runtime for Apache Spark

Amazon EMR is happy to announce Amazon EMR runtime for Apache Spark, a performance-optimized runtime environment for Apache Spark that is active by default on Amazon EMR clusters. EMR runtime for Spark is up to 32 times faster than EMR 5.16, with 100% API compatibility with open-source Spark. This means that your workloads run faster, saving you compute costs without making any changes to your applications.

Amazon EMR has been adding Spark runtime improvements since EMR 5.24, and discussed them in Optimizing Spark Performance. EMR 5.28 features several new improvements.

To measure these improvements, we compared EMR 5.16 (with open source Apache Spark version 2.4) with EMR 5.28 (with EMR runtime for Apache Spark compatible with Apache Spark version 2.4). We used TPC-DS benchmark queries with 3 TB scale and ran them on a six-node c4.8xlarge EMR cluster with data in Amazon S3. We measured performance improvements as the geometric mean of improvement in total query execution time, and the total query execution time across all queries. The results showed considerable improvement—that the geometric mean was 2.4 times faster and the total query runtime was 3.2 times faster.

The following graph shows performance improvements measured as total runtime for 104 TPC-DS queries. EMR 5.28 has the better (lower) runtime.

The following graph shows performance improvements measured as the geometric mean for 104 TPC-DS queries. EMR 5.28 has the better (lower) geomean.

In breaking down the per-query improvements, you can observe the highest performance gains in long-running queries.

The following graph shows performance improvements in EMR 5.28 compared to EMR 5.16 for long-running queries (running for more than 130 seconds in EMR 5.16). In this comparison, the higher numbers are better.

The following graph shows performance improvements in EMR 5.28 compared to EMR 5.16 for short-running queries (running for less than 130 seconds). Again, the higher numbers are better.

Queries running for more than 130 seconds are up to 32 times faster as seen in query 72. Queries running for less than 130 seconds are up to 6 times faster, with an average improvement of 2 times faster across the board.

Customers use Spark for a wide array of analytics use cases ranging from large-scale transformations to streaming, data science, and machine learning. They choose to run Spark on EMR because EMR provides the latest, stable, open-source community innovations, performant storage with Amazon S3, and the unique cost savings capabilities of Spot Instances and Auto Scaling. It also provides ease of use with managed EMR Notebooks, notebook-scoped libraries, Git integration, and easy debugging and monitoring with off-cluster Spark History Services. Combined with the runtime improvements, and fine-grained access control using AWS Lake Formation, Amazon EMR presents an excellent choice for customers running Apache Spark.

With each of these performance optimizations to Apache Spark, you benefit from better query performance. Stay tuned for additional updates that improve Apache Spark performance in Amazon EMR. To keep up to date, subscribe to the Big Data blog’s RSS feed to learn about more Apache Spark optimizations, configuration best practices, and tuning advice.

 


About the Authors

Joseph Marques is a principal engineer for EMR at Amazon Web Services.

 

 

 

 

Peter Gvozdjak is a senior engineering manager for EMR at Amazon Web Services.