Table of Contents

Driven Administration Guide

version 1.1.1

Understanding the Driven Architecture

Driven is a web-application that provides visibility into all stages of your Cascading (or any framework such as Scalding that is developed on the Cascading framework) application. The Driven installation consists of a client-side install of the Driven plugin and the an application running within the Apache Tomcat (or any J2EE servlet container) server.

driven_diagram2 Figure 1: Driven deployment architecture

As a default, the Driven server – the Driven Application-- is installed as an all-inclusive Web Application Resource (WAR) file containing the Tomcat server and the persistence layer that is based on Elastic Search. The Driven application has been architected to allow for fault tolerance and scalability through implementing redundancy in the persistence layer, and features for backup and restore.

When executing your Cascading application, the framework underneath builds a rich state model to optimally execute the flow on the Hadoop cluster. This includes taking higher level Cascading primitives (see Cascading Users Guide) and mapping them to constructs available on the underlying computational fabric (MapReduce, Tez, local in-memory mode, etc.) using a sophisticated pattern-matching rules engine.

The Driven plugin collects this information and sends it to the Driven server for visualization. In addition, the Driven plugin collects rich meta-data information about each “slice” – a unit of work such as a mapper or a reducer – from the Hadoop NameNode to collect statistics that will be analyzed and correlated by the Driven application.

There are several implications as a result of this architecture:

  • Your Cascading application will not appear to have completed execution until all the computation is finished and the Driven plugin has successfully transmitted all telemetry data to the Driven server.

  • The additional latency that results from the transmission of the telemetry data does not affect the time taken by your application to finish its data computation. In other words, your application’s Service Level Agreements (SLA) such as producing a data set within a certain time period is not affected, since the computation of data on the Hadoop cluster and the transmission of the telemetry data from your client plugin are decoupled.

  • The amount of data collected by the Cascading framework and the Driven plugin is dependent on the complexity of your application, such as operations and branches. In addition, the Driven plugin collects information about each task/job that is part of your application. The larger your data sets, the more information that is collected for analysis from the Hadoop NameNode. As a result, make sure that you have provisioned adequate resources for your Hadoop NameNode to prevent a bottleneck in collecting the telemetry signals. Also, make sure that you provision adequate memory for your Cascading application running with the Driven plugin. Do not rely on system defaults. To set your JVM memory limits see Provisioning adequate memory for your Cascading applications.