Spark Performance Analysis

Introduction

There has been significant effort dedicated towards improving the performance of data analytics frameworks, but comparatively little effort has been spent systematically identifying the performance bottlenecks of these systems. We set out to make data analytics framework performance easier to understand, both by quantifying the performance bottlenecks in distributed computation frameworks on a few benchmarks, and by providing tools that others can use to understand performance of the workloads we ran and of their own workloads.

This project is summarized in a paper that will appear at USENIX NSDI, 2015.

Traces

We collected JSON logs from running the big data benchmark and TPC-DS benchmark on clusters of 5-60 machines. All of the traces are licensed under a Creative Commons Attribution 4.0 International License.

How the traces were generated

The JSON logs linked below were generated by the Spark driver (version 1.2.1) while running various benchmarks on clusters of Amazon EC2 machines. All of the below traces were generated from clusters of m2.4xlarge machines, which each have 68.4GiB of memory, 2 840GB disks, 8 cores, and a 1Gbps network link.

These logs were enabled by setting spark.eventLog.enabled to true in the Spark configuration when running benchmark queries (see the Spark documentation for details).

Trace format

The traces are formatted as JSON data; if you’re interested in the nitty gritty, this file in the Spark code base generates the data. The trace includes a JSON object for each time a job starts and ends, each time a stage starts and ends, and each time a task starts and ends. The most useful information is the per-task information, which logs detailed information about the execution of each task (e.g., how long the task spent blocked on the network, blocked on disk I/O, performing Java garbage collection, and more).

Available traces

The traces labeled as TPC-DS ran a modified version of the TPC-DS benchmark. The modified version runs a subset of 20 queries selected by an industry benchmark.

Link	Benchmark	Number of machines	Data Format	Data Size	Concurrent Users	Notes
Trace	Big data benchmark	5	Compressed, Disk	60GB	1	This trace includes 6 trials; in each trial, the 10 queries were executed in random order.
Trace	Big data benchmark	5	SparkSQL Columnar Cache	60GB	1	This trace includes 6 trials; in each trial, the 10 queries were executed in random order.
Trace	TPC-DS	20	Parquet, Disk	850GB	13	This trace includes a warmup period where each of the queries is run once. After the warmup period completes, the 13 users run in parallel. Each user runs the 20 queries in series, in a random order.
Trace	TPC-DS	20	Parquet, Disk	850GB	1	The trace includes 6 trials, where in each trial, the 20 queries are executed in series. This trace is useful for understanding the total CPU, network, and disk resources consumed by each query (understanding the resource consumption of each query cannot be done from the experiment with many multiple users, because the CPU, network, and disk counters reflect all tasks running on each machine, so when one machine is executing multiple queries, there's no way to determine which resources were used by which query.
Trace	TPC-DS	20	Spark SQL Columnar Cache	17GB when stored as on-disk Parquet files; 200GB in-memory	7	This trace includes a warmup period where each of the queries is run once. After the warmup period completes, the 7 users run in parallel. Each user runs the 20 queries in series, in a random order.
Trace	TPC-DS	20	Spark SQL Columnar Cache	17GB when stored as on-disk Parquet files; 200GB in-memory	1	The trace includes 6 trials, where in each trial, the 20 queries are executed in series. This trace is useful for understanding the total CPU, network, and disk resources consumed by each query (understanding the resource consumption of each query cannot be done from the experiment with many multiple users, because the CPU, network, and disk counters reflect all tasks running on each machine, so when one machine is executing multiple queries, there's no way to determine which resources were used by which query.
Trace	TPC-DS	60	Parquet, Disk	2500GB	15	This trace includes a warmup period where each of the queries is run once. After the warmup period completes, the 15 users run in parallel. Each user runs the 20 queries in series, in a random order.

Performance Visualization

To understand Apache Spark’s performance, I wrote a suite of visualization tools. Those tools are now deprecated, because the visualization is now part of Spark’s UI. You can visualize how the tasks in a stage are spending their time, click on the detail page for a particular stage, and then click on the “Event Timeline” link. The UI will display a timeline like this:

Event Timeline

(credit to the Databricks Blog for this screenshot) that shows the breakdown of time for each task.

Contact

If you have questions about this project, the performance analysis tools, or the traces, contact Kay Ousterhout.