Cascading 4.0 User Guide - Cascades

1. Introduction

1.1. What Is Cascading?

2. Diving into the APIs

2.1. Anatomy of a Word-Count Application

3. Cascading Basic Concepts

3.1. Terminology

3.3. Pipes

3.4. Platforms

3.6. Sink Modes

3.7. Flows

4. Tuple Fields

4.1. Field Sets

5. Pipe Assemblies

5.1. Each and Every Pipes

5.2. Merge

5.3. GroupBy

5.4. CoGroup

5.5. HashJoin

6. Flows

6.1. Creating Flows from Pipe Assemblies

7. Cascades

7.1. Creating a Cascade

8. Configuring

8.1. Introduction

9. Local Platform

9.1. Building an Application

10. The Apache Hadoop Platforms

10.1. What is Apache Hadoop?

11. Apache Hadoop MapReduce Platform

11.1. Configuring Applications

11.3. Building

12. Apache Tez Platform

12.1. Configuring Applications

12.2. Building

13. Using and Developing Operations

13.1. Introduction

13.2. Functions

13.3. Filters

13.4. Aggregators

13.5. Buffers

14. Custom Taps and Schemes

14.1. Introduction

14.2. Custom Taps

15. Advanced Processing

15.1. SubAssemblies

16. Built-In Operations

16.1. Identity Function

16.9. Assertions

16.11. Buffers

17. Built-in SubAssemblies

17.1. Optimized Aggregations

18. Cascading Best Practices

18.1. Unit Testing

19. Extending Cascading

19.1. Scripting

20. Cookbook: Code Examples of Cascading Idioms

20.1. Tuples and Fields

20.5. API Usage

21. The Cascading Process Planner

21.1. FlowConnector

21.3. RuleRegistry

Cascades

cascade

A Cascade allows multiple Flow instances to be executed as a single logical unit. If there are dependencies between the Flows, they are executed in the correct order.

Further, Cascade instances act like a compiler build file: a Cascade only executes Flows that have stale sinks (i.e., output data that is older than the input data). For more about flows and sinks, see Skipping Flows.

Creating a Cascade

Example 1. Creating a new Cascade
CascadeConnector connector = new CascadeConnector();
Cascade cascade = connector.connect( flowFirst, flowSecond, flowThird );

When passing Flows to the CascadeConnector, order is not important. The CascadeConnector automatically identifies the dependencies between the given Flows and creates a scheduler that starts each Flow as its data sources become available. If two or more Flow instances have no interdependencies, they are submitted together so that they can execute in parallel.

If an instance of cascading.flow.FlowSkipStrategy is given to a Cascade instance (via the Cascade.setFlowSkipStrategy() method), it is checked for every Flow instance managed by that Cascade, and all skip strategies on those Flow instances are ignored.

The Cascade Topological Scheduler

Cascading has a simple class, Cascade, that executes a collection of Cascading Flows on a target cluster in dependency order.

The CascadeConnector class constructs a Cascade by building a virtual, internal graph that renders each Flow as a "vertex" and renders each file as an "edge." As a Cascade executes, the processes trace the topology of the graph by plotting each vertex in order of dependencies. When all incoming edges (i.e., files) of a vertex are available, it is scheduled on the cluster.

Consider the following example.

  • Flow 1 reads input file A and outputs B.

  • Flow 2 expects input B and outputs C and D.

  • Flow 3 expects input C and outputs E.

In the example above, Flow 1 goes first, Flow 2 goes second, and Flow 3 is last.

If two or more Flows are independent of one another, they are scheduled concurrently.

By default, if any outputs from a Flow are newer than the inputs, the Flow is skipped. The assumption is that the Flow was executed recently, since the output is not stale. So there is no reason to re-execute it and use up resources or add time to the application. A compiler behaves analogously when a source file is not updated before a recompile.

The Cascade topological scheduler is particularly helpful when you have a large set of jobs, with varying interdependencies, that must be executed as a logical unit. You can just pass the jobs to the CascadeConnector, which can determine the sequence of flows to uphold the dependency order.