There has been significant effort dedicated towards improving the performance of data analytics frameworks, but comparatively little effort has been spent systematically identifying the performance bottlenecks of these systems. We set out to make data analytics framework performance easier to understand, both by quantifying the performance bottlenecks in distributed computation frameworks on a few benchmarks, and by providing tools that others can use to understand performance of the workloads we ran and of their own workloads.
This project is summarized in a paper that will appear at USENIX NSDI, 2015.
We collected JSON logs from running the big data benchmark and TPC-DS benchmark on clusters of 5-60 machines. All of the traces are licensed under a Creative Commons Attribution 4.0 International License.
The JSON logs linked below were generated by the Spark driver (version 1.2.1) while running various benchmarks on clusters of Amazon EC2 machines. All of the below traces were generated from clusters of m2.4xlarge machines, which each have 68.4GiB of memory, 2 840GB disks, 8 cores, and a 1Gbps network link.
These logs were enabled by
setting spark.eventLog.enabled
to true
in the Spark configuration when running benchmark
queries (see the
Spark documentation for details).
The traces are formatted as JSON data; if you’re interested in the nitty gritty, this file in the Spark code base generates the data. The trace includes a JSON object for each time a job starts and ends, each time a stage starts and ends, and each time a task starts and ends. The most useful information is the per-task information, which logs detailed information about the execution of each task (e.g., how long the task spent blocked on the network, blocked on disk I/O, performing Java garbage collection, and more).
The traces labeled as TPC-DS ran a modified version of the TPC-DS benchmark. The modified version runs a subset of 20 queries selected by an industry benchmark.
Link | Benchmark | Number of machines | Data Format | Data Size | Concurrent Users | Notes |
Trace | Big data benchmark | 5 | Compressed, Disk | 60GB | 1 | This trace includes 6 trials; in each trial, the 10 queries were executed in random order. |
Trace | Big data benchmark | 5 | SparkSQL Columnar Cache | 60GB | 1 | This trace includes 6 trials; in each trial, the 10 queries were executed in random order. |
Trace | TPC-DS | 20 | Parquet, Disk | 850GB | 13 | This trace includes a warmup period where each of the queries is run once. After the warmup period completes, the 13 users run in parallel. Each user runs the 20 queries in series, in a random order. |
Trace | TPC-DS | 20 | Parquet, Disk | 850GB | 1 | The trace includes 6 trials, where in each trial, the 20 queries are executed in series. This trace is useful for understanding the total CPU, network, and disk resources consumed by each query (understanding the resource consumption of each query cannot be done from the experiment with many multiple users, because the CPU, network, and disk counters reflect all tasks running on each machine, so when one machine is executing multiple queries, there's no way to determine which resources were used by which query. |
Trace | TPC-DS | 20 | Spark SQL Columnar Cache | 17GB when stored as on-disk Parquet files; 200GB in-memory | 7 | This trace includes a warmup period where each of the queries is run once. After the warmup period completes, the 7 users run in parallel. Each user runs the 20 queries in series, in a random order. |
Trace | TPC-DS | 20 | Spark SQL Columnar Cache | 17GB when stored as on-disk Parquet files; 200GB in-memory | 1 | The trace includes 6 trials, where in each trial, the 20 queries are executed in series. This trace is useful for understanding the total CPU, network, and disk resources consumed by each query (understanding the resource consumption of each query cannot be done from the experiment with many multiple users, because the CPU, network, and disk counters reflect all tasks running on each machine, so when one machine is executing multiple queries, there's no way to determine which resources were used by which query. |
Trace | TPC-DS | 60 | Parquet, Disk | 2500GB | 15 | This trace includes a warmup period where each of the queries is run once. After the warmup period completes, the 15 users run in parallel. Each user runs the 20 queries in series, in a random order. |
To understand Apache Spark’s performance, I wrote a suite of visualization tools. Those tools are now deprecated, because the visualization is now part of Spark’s UI. You can visualize how the tasks in a stage are spending their time, click on the detail page for a particular stage, and then click on the “Event Timeline” link. The UI will display a timeline like this:
(credit to the Databricks Blog for this screenshot) that shows the breakdown of time for each task.
If you have questions about this project, the performance analysis tools, or the traces, contact Kay Ousterhout.