Cascading 4.0 User Guide - Built-In SubAssemblies

1. Introduction: 1.1. What Is Cascading?

1.2. Another Perspective

1.3. Why Use Cascading?

1.4. The Cascading Philosophy

1.5. Who Are the Users?
2. Diving into the APIs: 2.1. Anatomy of a Word-Count Application

2.2. Fluid: An Alternative Fluent API
3. Cascading Basic Concepts: 3.1. Terminology

3.2. Pipe Assemblies

3.3. Pipes

3.4. Platforms

3.5. Sourcing and Sinking Data

3.6. Sink Modes

3.7. Flows
4. Tuple Fields: 4.1. Field Sets

4.2. Field Algebra

4.3. Field Typing

4.4. Type Coercion
5. Pipe Assemblies: 5.1. Each and Every Pipes

5.2. Merge

5.3. GroupBy

5.4. CoGroup

5.5. HashJoin
6. Flows: 6.1. Creating Flows from Pipe Assemblies

6.2. Configuring Flows

6.3. Skipping Flows

6.4. Creating Custom Flows

6.5. Process Levels in the Flow Hierarchy

6.6. Runtime Metrics
7. Cascades: 7.1. Creating a Cascade

7.2. The Cascade Topological Scheduler
8. Configuring: 8.1. Introduction

8.2. Creating Properties

8.3. Passing Properties
9. Local Platform: 9.1. Building an Application

9.2. Executing an Application

9.3. Source and Sink Taps

9.4. Troubleshooting and Debugging
10. The Apache Hadoop Platforms: 10.1. What is Apache Hadoop?

10.2. Hadoop 1 MapReduce vs. Hadoop 2 MapReduce

10.3. Hadoop 2 MapReduce vs Hadoop 2 Tez

10.4. Configuring Applications

10.5. Building an Application

10.6. Executing an Application

10.7. Troubleshooting and Debugging

10.8. Source and Sink Taps

10.9. Custom Taps and Schemes

10.10. Partial Aggregation instead of Combiners

10.11. Custom Types and Serialization
11. Apache Hadoop MapReduce Platform: 11.1. Configuring Applications

11.2. Creating Flows from a JobConf

11.3. Building
12. Apache Tez Platform: 12.1. Configuring Applications

12.2. Building
13. Using and Developing Operations: 13.1. Introduction

13.2. Functions

13.3. Filters

13.4. Aggregators

13.5. Buffers

13.6. Operation and BaseOperation
14. Custom Taps and Schemes: 14.1. Introduction

14.2. Custom Taps

14.3. Custom Schemes

14.4. Taps with File and Nonfile Resources

14.5. Tap Life-Cycle Methods
15. Advanced Processing: 15.1. SubAssemblies

15.2. Stream Assertions

15.3. Failure Traps

15.4. Checkpointing

15.5. Restarting a Checkpointed Flow

15.6. Flow and Cascade Event Handling

15.7. PartitionTaps

15.8. Partial Aggregation instead of Combiners
16. Built-In Operations: 16.1. Identity Function

16.2. Debug Function

16.3. Sample and Limit Functions

16.4. Insert Function

16.5. Text Functions

16.6. Regular Expression Operations

16.7. Java Expression Operations

16.8. XML Operations

16.9. Assertions

16.10. Logical Filter Operators

16.11. Buffers
17. Built-in SubAssemblies: 17.1. Optimized Aggregations

17.2. Stream Shaping
18. Cascading Best Practices: 18.1. Unit Testing

18.2. Flow Granularity

18.3. SubAssemblies, not Factories

18.4. Logical Responsibilities for SubAssemblies

18.5. Java Operators in Field Names

18.6. Debugging Planner Failures

18.7. Optimizing Joins

18.8. Debugging Streams

18.9. Handling Good and Bad Data

18.10. Maintaining State in Operations

18.11. Fields Constants

18.12. Checking the Source Code
19. Extending Cascading: 19.1. Scripting

19.2. Custom Types and Serialization

19.3. Custom Comparators and Hashing
20. Cookbook: Code Examples of Cascading Idioms: 20.1. Tuples and Fields

20.2. Stream Shaping

20.3. Common Operations

20.4. Stream Ordering

20.5. API Usage
21. The Cascading Process Planner: 21.1. FlowConnector

21.2. RuleRegistrySet

21.3. RuleRegistry

21.4. Debugging RuleRegistrySets

Built-in SubAssemblies

There are a number of helper SubAssemblies provided by the core Cascading library.

Many of the assemblies that are described below can be coded to ignore null values. This allows for an optional but closer resemblance to how similar functions in SQL perform.

Optimized Aggregations

The following SubAssemblies are implementations or optimizations of more discrete aggregate functions. Many of these SubAssemblies rely on the AggregateBy base class, which can be subclassed by developers to create custom aggregations leveraging the internal partial aggregation implementation.

Unique

The cascading.pipe.assembly.Unique SubAssembly is used to remove duplicate values in a Tuple stream. Uniqueness is determined by the values of all fields listed in uniqueFields. Use Fields.ALL as the uniqueFields argument to find all distinct Tuples in a stream.

Example 1. Using the Unique SubAssembly

// incoming -> first, last

assembly = new Unique( assembly, new Fields( "first", "last" ) );

// outgoing -> first, last

Unique uses the FirstNBuffer to more efficiently determine unique values.

AggregateBy

The cascading.pipe.assembly.AggregateBy SubAssembly is an implementation of the partial aggregation pattern, and is the base class for built-in and custom partial aggregation implementations like AverageBy or CountBy.

Generally the AggregateBy class is used to combine multiple AggregateBy subclasses into a single Pipe.

Example 2. Composing partial aggregations with AggregateBy

Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );

// note we do not pass the parent assembly Pipe in
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size", long.class );
SumBy sumBy = new SumBy( valueField, sumField );

Fields countField = new Fields( "num-events" );
CountBy countBy = new CountBy( countField );

assembly = new AggregateBy( assembly, groupingFields, sumBy, countBy );

To create a custom partial aggregation, subclass the AggregateBy class and implement the appropriate internal interfaces. See the Javadoc for details.

AverageBy

The cascading.pipe.assembly.AverageBy SubAssembly calculates an average of the given valueFields and returns the result in the averageField field. AverageBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 3. Using AverageBy

Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields avgField = new Fields( "avg-size" );
assembly = new AverageBy( assembly, groupingFields, valueField, avgField );

CountBy

The cascading.pipe.assembly.CountBy SubAssembly performs a count over the given groupingFields and returns the result in the countField field. CountBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 4. Using CountBy

Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields countField = new Fields( "count" );
assembly = new CountBy( assembly, groupingFields, countField );

SumBy

The cascading.pipe.assembly.SumBy SubAssembly performs a sum over the given valueFields and returns the result in the sumField field. SumBy may be combined with other AggregateBy subclasses so that they may be executed simultaneously over the same grouping.

Example 5. Using SumBy

Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size" );
assembly =
  new SumBy( assembly, groupingFields, valueField, sumField, long.class );

FirstBy

The cascading.pipe.assembly.FirstBy SubAssembly is used to return the first encountered value in the given valueFields. FirstBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 6. Using FirstBy

Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );

// we want the largest size in this grouping
valueField.setComparator( "size", new LongComparator() );

assembly =
  new FirstBy( assembly, groupingFields, valueField );

Note if the valueFields Fields instance has field comparators, they will be used to sort the order of argument values. Otherwise, the fields will not be sorted in any deterministic order.

Stream Shaping

Coerce

The cascading.pipe.assembly.Coerce SubAssembly is used to coerce a set of values from one type to another type — for example, to convert the age field from a String to an Integer.

Example 7. Using Coerce

// incoming -> first, last, age

assembly =
  new Coerce( assembly, new Fields( "age" ), Integer.class );

// outgoing -> first, last, age

Discard

The cascading.pipe.assembly.Discard SubAssembly is used to shape the Tuple stream by discarding all fields given on the constructor. All unlisted fields are retained.

Example 8. Using Discard

// incoming -> first, last, age

assembly = new Discard( assembly, new Fields( "age" ) );

// outgoing -> first, last

Rename

The cascading.pipe.assembly.Rename SubAssembly is used to rename a field.

Example 9. Using Rename

// incoming -> first, last, age

assembly =
  new Rename( assembly, new Fields( "age" ), new Fields( "years" ) );

// outgoing -> first, last, years

Retain

The cascading.pipe.assembly.Retain SubAssembly is used to shape the Tuple stream by retaining all fields given on the constructor. All unlisted fields are discarded.

Example 10. Using Retain

// incoming -> first, last, age

assembly = new Retain( assembly, new Fields( "first", "last" ) );

// outgoing -> first, last