Cascading 4.0 User Guide - Extending Cascading

1. Introduction

1.1. What Is Cascading?

2. Diving into the APIs

2.1. Anatomy of a Word-Count Application

3. Cascading Basic Concepts

3.1. Terminology

3.3. Pipes

3.4. Platforms

3.6. Sink Modes

3.7. Flows

4. Tuple Fields

4.1. Field Sets

5. Pipe Assemblies

5.1. Each and Every Pipes

5.2. Merge

5.3. GroupBy

5.4. CoGroup

5.5. HashJoin

6. Flows

6.1. Creating Flows from Pipe Assemblies

7. Cascades

7.1. Creating a Cascade

8. Configuring

8.1. Introduction

9. Local Platform

9.1. Building an Application

10. The Apache Hadoop Platforms

10.1. What is Apache Hadoop?

11. Apache Hadoop MapReduce Platform

11.1. Configuring Applications

11.3. Building

12. Apache Tez Platform

12.1. Configuring Applications

12.2. Building

13. Using and Developing Operations

13.1. Introduction

13.2. Functions

13.3. Filters

13.4. Aggregators

13.5. Buffers

14. Custom Taps and Schemes

14.1. Introduction

14.2. Custom Taps

15. Advanced Processing

15.1. SubAssemblies

16. Built-In Operations

16.1. Identity Function

16.9. Assertions

16.11. Buffers

17. Built-in SubAssemblies

17.1. Optimized Aggregations

18. Cascading Best Practices

18.1. Unit Testing

19. Extending Cascading

19.1. Scripting

20. Cookbook: Code Examples of Cascading Idioms

20.1. Tuples and Fields

20.5. API Usage

21. The Cascading Process Planner

21.1. FlowConnector

21.3. RuleRegistry

Extending Cascading

Scripting

The Cascading API was designed with scripting in mind. Any JVM-compatible scripting language can import and instantiate Cascading classes, create pipe assemblies and Flows, and execute those Flows. And if the scripting language in question supports domain-specific language (DSL) creation, users can create their own DSLs to handle common idioms.

The Cascading website (http://cascading.org/extensions/) includes information on scripting language bindings that are publicly available.

Custom Types and Serialization

The Tuple class is a generic container for all java.lang.Object instances.

Thus any primitive value or custom class can be stored in a Tuple instance — that is, returned by a Function, Aggregator, or Buffer as a result value.

Unfortunately there is no common method for managing the serialization of custom types that is cross-platform. See the platform-specific topics of this User Guide documentation for details about registering serializers that Cascading can adopt at runtime.

Custom Comparators and Hashing

Frequently, objects in one Tuple are compared to objects in a second Tuple. This is especially true during the sort phase of GroupBy and CoGroup. By default, Cascading uses the equals() and hashCode() Object native methods to compare two values and get a consistent hash code for a given value, respectively.

There are two different approaches that you can take to override the default behavior:

  • Create a java.util.Comparator class to perform comparisons on given field in a Tuple. For instance, to secondary-sort a collection of custom Person objects in a GroupBy, use the Fields.setComparator() method to designate the custom Comparator to the Fields instance that specifies the sort fields.

  • Alternatively, you can set a default Comparator for a Flow or for a local Pipe instance by one of the following ways:

    • Either calling FlowProps.setDefaultTupleElementComparator() on a Properties instance

    • Or using the cascading.flow.tuple.element.comparator property key

If the hash code must also be customized, the custom Comparator can implement the cascading.tuple.Hasher interface.

For more information, see the Javadoc.