Table of Contents

Driven User Guide

version 1.2

Tuning Your Application

The performance view can address many questions about an application. A couple of typical issues that can be addressed include:

  • How does the application decompose to MapReduce tasks?

  • Is there a particular cause for performance degradation: data skew, network storm, poor application logic, or inadequate cluster resource provisioning?

You know that you are in the performance view when the Slice Performance dashboard appears under a directed acyclic graph (DAG).

The Cascading Query Planner

A key component of the Cascading application is the query planner. When the Cascading application executes, the query planner compiles all the data-processing steps, analyzes dependencies of the steps, and develops a DAG for the application.

Operation_Dag

Figure 1: DAG rendering as compiled by the query planner. Some options to click for additional information are noted on the top.

The DAG is a dependency graph of the higher-level Cascading steps. The Cascading query planner iterates through the DAG, breaking it into smaller and smaller graphs–called expression graphs–until the graph matches a pattern associated with a unit of work, such as a mapper or a reducer.

Mapper_Reducer2

Figure 2: Steps associated with their mappers and reducers, as well as their expression graphs

By creating this decoupling between the DAG and the units of work on the computation fabric, Cascading can support running your application on many Big Data fabrics (such as MapReduce), without requiring a rewrite or change to your code.

You can further add granular metrics to the slice level of your application by adding counters. Click the Add counters button to display the available counters. Select the desired counter by clicking the checkbox.

Add_Counters

Figure 3: Adding counters to the slice performance dashboard

Understanding Bottlenecks in Your Application

In the slice performance dashboard, you can see the slice (a unit of work such as a map or a reduce task) information at the individual or at an aggregate level.

Skew_Data

Figure 4: This example shows skewed data at the slice level

Observe if any of your slices are skewed. In a MapReduce application, the data is divided and processed in equal-sized chunks. If certain slices are taking more time to finish processing a similar type of task with (assumed) similarly sized data, then it is an anomaly and could indicate application execution problems.

Often, these skews indicate that applications are processing a large number of small files, which usually means that you need to optimize the environment. In other cases, depending on the skew dimension, they could indicate a network issue, which can delay the shuffle-sort operations in MapReduce.

Viewing the Hadoop Dashboard

If there is a Hadoop dashboard for a step, the row for the step has a Job Tracker hyperlink.

Hadoop_JobTracker_Link

Figure 5: Link to Hadoop dashboard

Next