Cascading 4.0 User Guide - Extending Cascading
- 1. Introduction
-
1.1. What Is Cascading?
1.2. Another Perspective
1.3. Why Use Cascading?
1.5. Who Are the Users?
- 2. Diving into the APIs
- 3. Cascading Basic Concepts
-
3.1. Terminology
3.2. Pipe Assemblies
3.3. Pipes
3.4. Platforms
3.6. Sink Modes
3.7. Flows
- 4. Tuple Fields
-
4.1. Field Sets
4.2. Field Algebra
4.3. Field Typing
4.4. Type Coercion
- 5. Pipe Assemblies
-
5.1. Each and Every Pipes
5.2. Merge
5.3. GroupBy
5.4. CoGroup
5.5. HashJoin
- 6. Flows
-
6.1. Creating Flows from Pipe Assemblies
6.2. Configuring Flows
6.3. Skipping Flows
6.6. Runtime Metrics
- 7. Cascades
-
7.1. Creating a Cascade
- 8. Configuring
-
8.1. Introduction
8.2. Creating Properties
8.3. Passing Properties
- 9. Local Platform
-
9.3. Source and Sink Taps
- 10. The Apache Hadoop Platforms
-
10.1. What is Apache Hadoop?
10.4. Configuring Applications
10.5. Building an Application
10.6. Executing an Application
10.8. Source and Sink Taps
10.9. Custom Taps and Schemes
- 11. Apache Hadoop MapReduce Platform
-
11.1. Configuring Applications
11.3. Building
- 12. Apache Tez Platform
-
12.1. Configuring Applications
12.2. Building
- 13. Using and Developing Operations
-
13.1. Introduction
13.2. Functions
13.3. Filters
13.4. Aggregators
13.5. Buffers
- 14. Custom Taps and Schemes
-
14.1. Introduction
14.2. Custom Taps
14.3. Custom Schemes
14.5. Tap Life-Cycle Methods
- 15. Advanced Processing
-
15.1. SubAssemblies
15.2. Stream Assertions
15.3. Failure Traps
15.4. Checkpointing
15.7. PartitionTaps
- 16. Built-In Operations
-
16.1. Identity Function
16.2. Debug Function
16.4. Insert Function
16.5. Text Functions
16.8. XML Operations
16.9. Assertions
16.10. Logical Filter Operators
16.11. Buffers
- 17. Built-in SubAssemblies
-
17.1. Optimized Aggregations
17.2. Stream Shaping
- 18. Cascading Best Practices
-
18.1. Unit Testing
18.2. Flow Granularity
18.7. Optimizing Joins
18.8. Debugging Streams
18.11. Fields Constants
18.12. Checking the Source Code
- 19. Extending Cascading
-
19.1. Scripting
- 20. Cookbook: Code Examples of Cascading Idioms
-
20.1. Tuples and Fields
20.2. Stream Shaping
20.3. Common Operations
20.4. Stream Ordering
20.5. API Usage
- 21. The Cascading Process Planner
-
21.1. FlowConnector
21.2. RuleRegistrySet
21.3. RuleRegistry
Extending Cascading
Scripting
The Cascading API was designed with scripting in mind. Any JVM-compatible scripting language can import and instantiate Cascading classes, create pipe assemblies and Flows, and execute those Flows. And if the scripting language in question supports domain-specific language (DSL) creation, users can create their own DSLs to handle common idioms.
The Cascading website (http://cascading.org/extensions/) includes information on scripting language bindings that are publicly available.
Custom Types and Serialization
The Tuple class is a generic container for all java.lang.Object instances.
Thus any primitive value or custom class can be stored in a Tuple instance — that is, returned by a Function, Aggregator, or Buffer as a result value.
Unfortunately there is no common method for managing the serialization of custom types that is cross-platform. See the platform-specific topics of this User Guide documentation for details about registering serializers that Cascading can adopt at runtime.
Custom Comparators and Hashing
Frequently, objects in one Tuple are compared to objects in a second Tuple. This is especially true during the sort phase of GroupBy and CoGroup. By default, Cascading uses the equals() and hashCode() Object native methods to compare two values and get a consistent hash code for a given value, respectively.
There are two different approaches that you can take to override the default behavior:
-
Create a java.util.Comparator class to perform comparisons on given field in a Tuple. For instance, to secondary-sort a collection of custom Person objects in a GroupBy, use the Fields.setComparator() method to designate the custom Comparator to the Fields instance that specifies the sort fields.
-
Alternatively, you can set a default Comparator for a Flow or for a local Pipe instance by one of the following ways:
-
Either calling FlowProps.setDefaultTupleElementComparator() on a Properties instance
-
Or using the cascading.flow.tuple.element.comparator property key
-
If the hash code must also be customized, the custom Comparator can implement the cascading.tuple.Hasher interface.
For more information, see the Javadoc.