Cascading 4.0 User Guide - Cascades
- 1. Introduction
-
1.1. What Is Cascading?
1.2. Another Perspective
1.3. Why Use Cascading?
1.5. Who Are the Users?
- 2. Diving into the APIs
- 3. Cascading Basic Concepts
-
3.1. Terminology
3.2. Pipe Assemblies
3.3. Pipes
3.4. Platforms
3.6. Sink Modes
3.7. Flows
- 4. Tuple Fields
-
4.1. Field Sets
4.2. Field Algebra
4.3. Field Typing
4.4. Type Coercion
- 5. Pipe Assemblies
-
5.1. Each and Every Pipes
5.2. Merge
5.3. GroupBy
5.4. CoGroup
5.5. HashJoin
- 6. Flows
-
6.1. Creating Flows from Pipe Assemblies
6.2. Configuring Flows
6.3. Skipping Flows
6.6. Runtime Metrics
- 7. Cascades
-
7.1. Creating a Cascade
- 8. Configuring
-
8.1. Introduction
8.2. Creating Properties
8.3. Passing Properties
- 9. Local Platform
-
9.3. Source and Sink Taps
- 10. The Apache Hadoop Platforms
-
10.1. What is Apache Hadoop?
10.4. Configuring Applications
10.5. Building an Application
10.6. Executing an Application
10.8. Source and Sink Taps
10.9. Custom Taps and Schemes
- 11. Apache Hadoop MapReduce Platform
-
11.1. Configuring Applications
11.3. Building
- 12. Apache Tez Platform
-
12.1. Configuring Applications
12.2. Building
- 13. Using and Developing Operations
-
13.1. Introduction
13.2. Functions
13.3. Filters
13.4. Aggregators
13.5. Buffers
- 14. Custom Taps and Schemes
-
14.1. Introduction
14.2. Custom Taps
14.3. Custom Schemes
14.5. Tap Life-Cycle Methods
- 15. Advanced Processing
-
15.1. SubAssemblies
15.2. Stream Assertions
15.3. Failure Traps
15.4. Checkpointing
15.7. PartitionTaps
- 16. Built-In Operations
-
16.1. Identity Function
16.2. Debug Function
16.4. Insert Function
16.5. Text Functions
16.8. XML Operations
16.9. Assertions
16.10. Logical Filter Operators
16.11. Buffers
- 17. Built-in SubAssemblies
-
17.1. Optimized Aggregations
17.2. Stream Shaping
- 18. Cascading Best Practices
-
18.1. Unit Testing
18.2. Flow Granularity
18.7. Optimizing Joins
18.8. Debugging Streams
18.11. Fields Constants
18.12. Checking the Source Code
- 19. Extending Cascading
-
19.1. Scripting
- 20. Cookbook: Code Examples of Cascading Idioms
-
20.1. Tuples and Fields
20.2. Stream Shaping
20.3. Common Operations
20.4. Stream Ordering
20.5. API Usage
- 21. The Cascading Process Planner
-
21.1. FlowConnector
21.2. RuleRegistrySet
21.3. RuleRegistry
Cascades
A Cascade allows multiple Flow instances to be executed as a single logical unit. If there are dependencies between the Flows, they are executed in the correct order.
Further, Cascade instances act like a compiler build file: a Cascade only executes Flows that have stale sinks (i.e., output data that is older than the input data). For more about flows and sinks, see Skipping Flows.
Creating a Cascade
CascadeConnector connector = new CascadeConnector();
Cascade cascade = connector.connect( flowFirst, flowSecond, flowThird );
When passing Flows to the CascadeConnector, order is not important. The CascadeConnector automatically identifies the dependencies between the given Flows and creates a scheduler that starts each Flow as its data sources become available. If two or more Flow instances have no interdependencies, they are submitted together so that they can execute in parallel.
If an instance of cascading.flow.FlowSkipStrategy is given to a Cascade instance (via the Cascade.setFlowSkipStrategy() method), it is checked for every Flow instance managed by that Cascade, and all skip strategies on those Flow instances are ignored.
The Cascade Topological Scheduler
Cascading has a simple class, Cascade, that executes a collection of Cascading Flows on a target cluster in dependency order.
The CascadeConnector class constructs a Cascade by building a virtual, internal graph that renders each Flow as a "vertex" and renders each file as an "edge." As a Cascade executes, the processes trace the topology of the graph by plotting each vertex in order of dependencies. When all incoming edges (i.e., files) of a vertex are available, it is scheduled on the cluster.
Consider the following example.
-
Flow 1 reads input file A and outputs B.
-
Flow 2 expects input B and outputs C and D.
-
Flow 3 expects input C and outputs E.
In the example above, Flow 1 goes first, Flow 2 goes second, and Flow 3 is last.
If two or more Flows are independent of one another, they are scheduled concurrently.
By default, if any outputs from a Flow are newer than the inputs, the Flow is skipped. The assumption is that the Flow was executed recently, since the output is not stale. So there is no reason to re-execute it and use up resources or add time to the application. A compiler behaves analogously when a source file is not updated before a recompile.
The Cascade topological scheduler is particularly helpful when you have a large set of jobs, with varying interdependencies, that must be executed as a logical unit. You can just pass the jobs to the CascadeConnector, which can determine the sequence of flows to uphold the dependency order.