Cascading 4.0 User Guide - Introduction

1. Introduction: 1.1. What Is Cascading?

1.2. Another Perspective

1.3. Why Use Cascading?

1.4. The Cascading Philosophy

1.5. Who Are the Users?
2. Diving into the APIs: 2.1. Anatomy of a Word-Count Application

2.2. Fluid: An Alternative Fluent API
3. Cascading Basic Concepts: 3.1. Terminology

3.2. Pipe Assemblies

3.3. Pipes

3.4. Platforms

3.5. Sourcing and Sinking Data

3.6. Sink Modes

3.7. Flows
4. Tuple Fields: 4.1. Field Sets

4.2. Field Algebra

4.3. Field Typing

4.4. Type Coercion
5. Pipe Assemblies: 5.1. Each and Every Pipes

5.2. Merge

5.3. GroupBy

5.4. CoGroup

5.5. HashJoin
6. Flows: 6.1. Creating Flows from Pipe Assemblies

6.2. Configuring Flows

6.3. Skipping Flows

6.4. Creating Custom Flows

6.5. Process Levels in the Flow Hierarchy

6.6. Runtime Metrics
7. Cascades: 7.1. Creating a Cascade

7.2. The Cascade Topological Scheduler
8. Configuring: 8.1. Introduction

8.2. Creating Properties

8.3. Passing Properties
9. Local Platform: 9.1. Building an Application

9.2. Executing an Application

9.3. Source and Sink Taps

9.4. Troubleshooting and Debugging
10. The Apache Hadoop Platforms: 10.1. What is Apache Hadoop?

10.2. Hadoop 1 MapReduce vs. Hadoop 2 MapReduce

10.3. Hadoop 2 MapReduce vs Hadoop 2 Tez

10.4. Configuring Applications

10.5. Building an Application

10.6. Executing an Application

10.7. Troubleshooting and Debugging

10.8. Source and Sink Taps

10.9. Custom Taps and Schemes

10.10. Partial Aggregation instead of Combiners

10.11. Custom Types and Serialization
11. Apache Hadoop MapReduce Platform: 11.1. Configuring Applications

11.2. Creating Flows from a JobConf

11.3. Building
12. Apache Tez Platform: 12.1. Configuring Applications

12.2. Building
13. Using and Developing Operations: 13.1. Introduction

13.2. Functions

13.3. Filters

13.4. Aggregators

13.5. Buffers

13.6. Operation and BaseOperation
14. Custom Taps and Schemes: 14.1. Introduction

14.2. Custom Taps

14.3. Custom Schemes

14.4. Taps with File and Nonfile Resources

14.5. Tap Life-Cycle Methods
15. Advanced Processing: 15.1. SubAssemblies

15.2. Stream Assertions

15.3. Failure Traps

15.4. Checkpointing

15.5. Restarting a Checkpointed Flow

15.6. Flow and Cascade Event Handling

15.7. PartitionTaps

15.8. Partial Aggregation instead of Combiners
16. Built-In Operations: 16.1. Identity Function

16.2. Debug Function

16.3. Sample and Limit Functions

16.4. Insert Function

16.5. Text Functions

16.6. Regular Expression Operations

16.7. Java Expression Operations

16.8. XML Operations

16.9. Assertions

16.10. Logical Filter Operators

16.11. Buffers
17. Built-in SubAssemblies: 17.1. Optimized Aggregations

17.2. Stream Shaping
18. Cascading Best Practices: 18.1. Unit Testing

18.2. Flow Granularity

18.3. SubAssemblies, not Factories

18.4. Logical Responsibilities for SubAssemblies

18.5. Java Operators in Field Names

18.6. Debugging Planner Failures

18.7. Optimizing Joins

18.8. Debugging Streams

18.9. Handling Good and Bad Data

18.10. Maintaining State in Operations

18.11. Fields Constants

18.12. Checking the Source Code
19. Extending Cascading: 19.1. Scripting

19.2. Custom Types and Serialization

19.3. Custom Comparators and Hashing
20. Cookbook: Code Examples of Cascading Idioms: 20.1. Tuples and Fields

20.2. Stream Shaping

20.3. Common Operations

20.4. Stream Ordering

20.5. API Usage
21. The Cascading Process Planner: 21.1. FlowConnector

21.2. RuleRegistrySet

21.3. RuleRegistry

21.4. Debugging RuleRegistrySets

Introduction

What Is Cascading?

Cascading is a data processing API and processing planner used for defining, sharing, and executing data-oriented applications. These applications can execute on a single computing node or distributed computing platform with minimal code changes.

Cascading applications can be written in a way that allows the developer to create data-parallel applications that minimize the coupling to any specific computing platform. Fewer platform dependencies optimizes portability.

On a single node, Cascading’s "local mode" can be used to efficiently test code and process local files before being deployed on a cluster.

On a distributed computing cluster, Cascading natively supports the Apache Hadoop platform. Specific planners for both MapReduce and the Apache Tez directed acyclic graphs (DAGs) are provided.

Additionally, Cascading can support new platforms as they emerge so business logic will not need to be rewritten in order to leverage state-of-the-art technologies. Visit http://www.cascading.org for announcements and details on newly added platforms.

Cascading is open-source under the Apache 2.0 License.

Another Perspective

Another way to look at Cascading is to consider the typical stack behind running a query with syntax like SQL:

The query syntax is parsed to an abstract syntax tree (AST).
The AST is logically reduced to an intermediate model. During this processing, many logical optimizations can be applied.
The intermediate model is translated into an executable model. Much of the processing between the intermediate and executable models applies physical optimizations.

Cascading provides both the intermediate model, and a planner that will translate that model into a physical executable model. The intermediate model is presented as a Java API, and the planner is both modifiable and pluggable.

Why Use Cascading?

Cascading was developed to allow organizations to rapidly develop complex data processing applications. The need for Cascading is typically driven by one or more of four cases:

Increasing data size
Increasing computation requirements
Increasing complexity in data centers
Increasing need for accountability and manageability

Cascading applications scale to load seamlessly. The business logic for small data that may become big data does not require changes to the application logic.

Cascading provides many core data processing primitives and operations. Business logic can be coded directly against the Cascading APIs, or higher order frameworks or languages can be created on top. These frameworks or languages can be used to improve developer or analyst productivity for a specific use-case or type of enterprise.

Cascading as a platform allows for integration and business logic to remain decoupled until runtime so that different storage platforms can be bound to the logic on demand. New platforms can be leveraged during computation as business needs or infrastructure resources change.

Cascading can be managed. Besides a number of core features that allow for Cascading applications to be easily operationalized, Cascading works with commercial tools (like Driven) so that applications can be monitored in real-time. Developers gain in-depth understanding of application behavior at runtime in order to improve performance and reliability. Operations teams that are responsible for managing business processes can easily monitor application status.

The Cascading Philosophy

Cascading was designed to be extensible, to behave deterministically, and to fail fast where possible.

Many default features and behaviors of Cascading can be replaced or overridden. Where the Cascading User Guide does not provide guidance, review the Javadoc for APIs and properties that can be used or modified.

The execution plans created by the Cascading planner are both intuitive and stable:

As processing logic is added during application development, the resulting plan is only an incremental increase in complexity and is proportional to the changes that were just applied. After becoming familiar with Cascading, you can easily understand the implication of adding a new CoGroup to the assembly and how it will behave on the cluster.
Every execution plan is the same when there are no code changes. Across versions of Cascading, any likely behavioral differences are documented (if not accompanied by a way to revert the behavior).
The Cascading planner creates all units of work up front. The planner verifies all dependencies are met from available resources like sources and sinks, down to the field level. If a downstream operation requires a specific field (for example, "zipcode"), the planner guarantees it is available upstream from a data source or operation.

Many of these ideals seem obvious, but many systems regenerate plans on the fly or attempt shortcuts during execution. Even if these methods of other systems are successful, there is no certainty that they result in improved runtime performance.

It is not acceptable to fail part way, in a non-deterministic (non-repeatable) fashion, for applications that may run hours or days at a stretch.

Who Are the Users?

Cascading users typically fall into three roles:

The application executor is a person (for example, a developer or analyst) or process (for example, a cron job) that runs a data processing application on a given cluster. This is typically done via the command line, using a prepackaged JAR file compiled against the Cascading libraries. The application can accept command-line parameters for customization during execution.

The process assembler assembles data processing workflows into unique applications. This work is generally a development task that involves chaining together operations to act on input data sets so that they produce output data sets. Development can be done with the default Java Cascading API, the Fluid API, or with a scripting language such as Scala, Clojure, Groovy, JRuby, Jython. Cascading also supports domain-specific languages that are implemented in the scripting languages.

The operation developer writes individual functions or operations (typically in Java) or reusable subassemblies that act on the data that passes through the data processing workflow. A trivial example would be a parser that takes a string and converts it to an integer. Operations are equivalent to Java functions in the sense that they take input arguments and return data. Operations can execute at any granularity, from simply parsing a string to performing complex procedures on the argument data using third-party libraries. Cascading provides many prebuilt operations.

Each of the three roles can be filled by a developer. But in some organizations non-developers might run ad-hoc applications or build production processes because Cascading supports clean separation of the responsibilities that are entailed in the three roles.

As of Cascading 3.0, two new roles have emerged:

The platform optimizer improves execution runtime or resource utilization. This role can be responsible for creating workload-specific query-plan rules that target specific improvements. Another responsibility of the platform optimizer could be to add new core primitives that can be leveraged in application logic.

The platform developer ports Cascading to new computing platforms. This is an emerging API, but already proven to be very robust and powerful. As business needs change and new technologies emerge, a developer can create bindings to these new technologies allowing existing investments in the Cascading API and broader ecosystem to be leveraged.