Driven User Guide

version 1.1.1

Tuning your application with Performance View

One of the most important views into your Cascading application is the Performance view. This screen helps you address many question about your application:

How is the application decomposed to MapReduce tasks?
Is there a particular cause for performance degradation: data skew, network storm, poor application logic, or inadequate cluster resource provisioning?

Cascading Query Planner

Underneath the Cascading application is the Query Planner. When the Cascading application executes, the Query Planner compiles all the data-processing steps, analyzing dependencies, and developing a Directed Acyclic Graph (DAG) for the application.

Operation_Dag

Figure 1: This is a sample of the Operation Directed Acyclic Graph (DAG) rendering compiled by the Query Planner.

The DAG is a dependency graph of the higher-level Cascading steps. The Cascading Query planner iterates through the DAG, breaking it into smaller and smaller graphs – called expression graphs-- until the graph matches a pattern associated with the unit of work – a mapper or a reducer.

Mapper_Reducer2

Figure 2: The application DAG further represented in smaller components of slices and steps.

By creating this decoupling between the DAG and the units of work on the computation fabric, Cascading can support running your application on many "Big Data" fabrics (such as, MapReduce and Tez), without requiring a rewrite or change to your code.

You can further add grandular metrics to the slice level of your application by adding counters. Click the Add counters button to display the available counters. Select the desired counter by clicking the checkbox.

Add_Counters

Figure 3: Adding counters to slices.

Understanding bottlenecks in your application

Driven provides insights to your application not previously available through the Hadoop dashboard.

In the Performance view, you can see the slice (a unit of work such as a map or a reduce task) information at the individual or at an aggregate level.

Skew_Data

Figure 3: This example shows skewed data at the slice level.

Observe if any of your slices are skewed. In a MapReduce application, the data is divided and processed in equal sized chunks. If certain slices are taking more time to finish processing a similar type of task with (assumed) similar sized data, then it is an anomoly and could indicate a variety of problems that are beyond the scope of this document to address.

Often, these skews have indicated that application is processing a large number of small files, an opportunity for optimization. In other cases, depending on the 'skew dimension', they could indicate a network issues, which can cause delays in the shuffle-sort operations in MapReduce.

Viewing Hadoop Dashboard

To view the Hadoop Dashboard, click on the link associated at the step level.

Hadoop_JobTracker_Link

Figure 4: Link to Hadoop Dashboard.

Managing applications with Tags

Driven User Guide

Tuning your application with Performance View

Cascading Query Planner

Understanding bottlenecks in your application

Viewing Hadoop Dashboard

Next