Driven User Guide

version 1.1.1

Understanding the Anatomy of your application

The application view provides you insights into the construction of all steps and flows that are part of your Cascading application. This view is particularly useful to track the development of an application over a period of time as the applications grows in complexity and size.

DAG_Example_Figure1

Figure 1: This is a sample of a Directed Acyclic Graphs (DAG) which represents a Cascading application.

In addition, this application view can be used to:

Understand real-time dependencies between steps and flows
Visualize your application, tracking steps in the graph to line numbers in your code
Investigate log error messages and stack exceptions
Tune application logic

Understanding the Graph

When you execute your Cascading application, the underlying framework builds a rich state model to optimally execute the flow on the Hadoop cluster. The Driven plugin transmits the state model to the Driven application, where users can visualize their application as a Directed Acyclic Graph (DAG). Note that the DAG is automatically created by the Cascading layer and that the application developer does not author the application DAG shown by Driven through a separate interface.

Note

Driven renders the DAG from the execution instance of the application. This feature becomes very compelling when over a period of time the documentation analyst is no longer able to document how the application has been developed. Since data is stored in Driven’s persistence layer, applications DAG representation can be recreated. Without such an interface, it becomes very difficult to answer questions mapping business needs to technical implementation, especially one that has been delivered by large teams spread around different regions.

In the graph, each node corresponds to a step or a processing function in your application code. You can refer to a specific code for a step by clicking on the node link.

Tap_Details

Figure 2: Click the desired flow in the DAG to see details of the tap including the code line number.

Viewing the Graph

Your application can be viewed in three different ways:

Contracted View - The Contracted View is useful for complex and large applications.

Logical View - The Logical View (default) shows all the steps and taps (excluding implicit taps) and built-in functions.

Physical View - The Physical View shows all the steps including the implicit taps and built-in functions. The Physical View may show more details than the Logical View, if any exist.

Driven models Cascading’s pipes metaphor by connecting the steps with lines. Note that these dependencies are inferred by Cascading — a step is dependent on another step only if it relies on the execution of the previous step to process the data. Cascading dynamically determines the dependencies between the flows — if the output (sink) of one flow is consumed by another flow (as a source), Driven will notate that dependency by connecting the two flows.

Visualizing your end-to-end application as a DAG along with operational data such as data read and then written at each step can provide important insights into improving the performance of the application. For example, reviewing the DAG can expose opportunities to introduce Filter functions in your code upstream to reduce the volumes of the data being processed by the pipes or to make the Join functions more efficient.

Real-time visibility into your application

Driven renders your application upon the start of your Cascading application. Driven provides real-time progress on your application, which includes highlighting the current steps being executed, number of steps completed, data read and written, among other things.

Get the most current information can be very useful — you can terminate a long-running job if you feel that it is not executing properly. Also, for example, if you see sudden slow-down in the progress of your application, you may want to immediately start investigating the reason (network storm or a rogue job submitted to the cluster).

Counters_Updated

Figure 3: This example shows the counter for the application being updated in real-time progress.

One of the most interesting insight is the ability to track the percentage of applications that have completed in real-time. For long-running applications (which are also very expensive), it is often useful to spot check the behavior to ensure that there are no anamolies.

Status State of the Application

As the application is run, the status state is instantly displayed for further investigation, if necessary. The status state includes:

Successful_State - The Successful status indicates that the application successfully ran.

Running_State - The Running status indicates that the application is currently running.

Warning_State - The Pending status indicates that there are problems during runtime.

Warning_State - The Started status indicates that the application has started.

Submitted_State - The Submitted status indicates that the application has been submitted for running.

Stopped_State - The Stopped status indicates the the application was stopped from running.

Failed_State - The Failed status indicates that the application failed to run.

Drilling down to the Flow, Slice, and Step Level

Driven allows you to drill-down in your application to the individual flows and slices level to view the nodes (mapper and reducer), where upon you can further investigate associated tags.

Slice_Performance2

Figure 4: Drill-down to the flow and slices to investigate any performance issues.

Stack Trace and Hadoop Job Tracker

For applications with a Failed status Failed_State , you can view its stack trace to further investigate for errors. Click Show failure info to display the stack trace.

Stack_Trace2

Figure 5: In this example, the stack trace shows the steps processed and their code line number.

A Hadoop Job Tracker dashboard tracks the percentage completion of mapper and reducers — tracking the progress at the application level is not easy on other compute fabrics because there is no underlying framework that has end-to-end visibility into the application logic. For example, tracking progress in an ETL application developed with a scripts, extended with Java-based User Defined Functions (UDF), all being orchestrated with brittle bash scripts is not feasible. However, in Cascading, regardless of the complexity of the application, the application gets compiled into one JAR file, and runs in a single JVM, making deploying and monitoring it possible. In addition, this also enables tracing the code, getting meaningful stack traces.

Hadoop_JobTracker_Page

Figure 6: Hadoop Job Tracker page.

Uncovering bottlenecks with the Timeline view

Cascading-based applications benefit from the framework that creates a state model and collecting a rich set of instrumentation counters. Driven helps you visualize these counters in the right context to provide you with a methodology to tune your applications.

Timeline_Diagnostics

Figure 7: The Timeline view helps you quickly scan the flows of your application to uncover any bottlenecks.

The Timeline view provides a detailed dashboard to flows that comprise the application, helping you to quickly identify which part of your application needs attention (assuming you will first attempt to tune the more expensive parts of the application).

To help you understand if there are any processing latency in your application due to data, network, compute resources, or application logic, refer to the timing counters.

Tuning your application with Performance View