Cascading 4.0 User Guide - Built-In SubAssemblies

1. Introduction

1.1. What Is Cascading?

2. Diving into the APIs

2.1. Anatomy of a Word-Count Application

3. Cascading Basic Concepts

3.1. Terminology

3.3. Pipes

3.4. Platforms

3.6. Sink Modes

3.7. Flows

4. Tuple Fields

4.1. Field Sets

5. Pipe Assemblies

5.1. Each and Every Pipes

5.2. Merge

5.3. GroupBy

5.4. CoGroup

5.5. HashJoin

6. Flows

6.1. Creating Flows from Pipe Assemblies

7. Cascades

7.1. Creating a Cascade

8. Configuring

8.1. Introduction

9. Local Platform

9.1. Building an Application

10. The Apache Hadoop Platforms

10.1. What is Apache Hadoop?

11. Apache Hadoop MapReduce Platform

11.1. Configuring Applications

11.3. Building

12. Apache Tez Platform

12.1. Configuring Applications

12.2. Building

13. Using and Developing Operations

13.1. Introduction

13.2. Functions

13.3. Filters

13.4. Aggregators

13.5. Buffers

14. Custom Taps and Schemes

14.1. Introduction

14.2. Custom Taps

15. Advanced Processing

15.1. SubAssemblies

16. Built-In Operations

16.1. Identity Function

16.9. Assertions

16.11. Buffers

17. Built-in SubAssemblies

17.1. Optimized Aggregations

18. Cascading Best Practices

18.1. Unit Testing

19. Extending Cascading

19.1. Scripting

20. Cookbook: Code Examples of Cascading Idioms

20.1. Tuples and Fields

20.5. API Usage

21. The Cascading Process Planner

21.1. FlowConnector

21.3. RuleRegistry

Built-in SubAssemblies

There are a number of helper SubAssemblies provided by the core Cascading library.

Many of the assemblies that are described below can be coded to ignore null values. This allows for an optional but closer resemblance to how similar functions in SQL perform.

Optimized Aggregations

The following SubAssemblies are implementations or optimizations of more discrete aggregate functions. Many of these SubAssemblies rely on the AggregateBy base class, which can be subclassed by developers to create custom aggregations leveraging the internal partial aggregation implementation.

Unique

The cascading.pipe.assembly.Unique SubAssembly is used to remove duplicate values in a Tuple stream. Uniqueness is determined by the values of all fields listed in uniqueFields. Use Fields.ALL as the uniqueFields argument to find all distinct Tuples in a stream.

Example 1. Using the Unique SubAssembly
// incoming -> first, last

assembly = new Unique( assembly, new Fields( "first", "last" ) );

// outgoing -> first, last

Unique uses the FirstNBuffer to more efficiently determine unique values.

AggregateBy

The cascading.pipe.assembly.AggregateBy SubAssembly is an implementation of the partial aggregation pattern, and is the base class for built-in and custom partial aggregation implementations like AverageBy or CountBy.

Generally the AggregateBy class is used to combine multiple AggregateBy subclasses into a single Pipe.

Example 2. Composing partial aggregations with AggregateBy
Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );

// note we do not pass the parent assembly Pipe in
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size", long.class );
SumBy sumBy = new SumBy( valueField, sumField );

Fields countField = new Fields( "num-events" );
CountBy countBy = new CountBy( countField );

assembly = new AggregateBy( assembly, groupingFields, sumBy, countBy );

To create a custom partial aggregation, subclass the AggregateBy class and implement the appropriate internal interfaces. See the Javadoc for details.

AverageBy

The cascading.pipe.assembly.AverageBy SubAssembly calculates an average of the given valueFields and returns the result in the averageField field. AverageBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 3. Using AverageBy
Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields avgField = new Fields( "avg-size" );
assembly = new AverageBy( assembly, groupingFields, valueField, avgField );

CountBy

The cascading.pipe.assembly.CountBy SubAssembly performs a count over the given groupingFields and returns the result in the countField field. CountBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 4. Using CountBy
Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields countField = new Fields( "count" );
assembly = new CountBy( assembly, groupingFields, countField );

SumBy

The cascading.pipe.assembly.SumBy SubAssembly performs a sum over the given valueFields and returns the result in the sumField field. SumBy may be combined with other AggregateBy subclasses so that they may be executed simultaneously over the same grouping.

Example 5. Using SumBy
Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size" );
assembly =
  new SumBy( assembly, groupingFields, valueField, sumField, long.class );

FirstBy

The cascading.pipe.assembly.FirstBy SubAssembly is used to return the first encountered value in the given valueFields. FirstBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.

Example 6. Using FirstBy
Pipe assembly = new Pipe( "assembly" );

// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );

// we want the largest size in this grouping
valueField.setComparator( "size", new LongComparator() );

assembly =
  new FirstBy( assembly, groupingFields, valueField );

Note if the valueFields Fields instance has field comparators, they will be used to sort the order of argument values. Otherwise, the fields will not be sorted in any deterministic order.

Stream Shaping

Coerce

The cascading.pipe.assembly.Coerce SubAssembly is used to coerce a set of values from one type to another type — for example, to convert the age field from a String to an Integer.

Example 7. Using Coerce
// incoming -> first, last, age

assembly =
  new Coerce( assembly, new Fields( "age" ), Integer.class );

// outgoing -> first, last, age

Discard

The cascading.pipe.assembly.Discard SubAssembly is used to shape the Tuple stream by discarding all fields given on the constructor. All unlisted fields are retained.

Example 8. Using Discard
// incoming -> first, last, age

assembly = new Discard( assembly, new Fields( "age" ) );

// outgoing -> first, last

Rename

The cascading.pipe.assembly.Rename SubAssembly is used to rename a field.

Example 9. Using Rename
// incoming -> first, last, age

assembly =
  new Rename( assembly, new Fields( "age" ), new Fields( "years" ) );

// outgoing -> first, last, years

Retain

The cascading.pipe.assembly.Retain SubAssembly is used to shape the Tuple stream by retaining all fields given on the constructor. All unlisted fields are discarded.

Example 10. Using Retain
// incoming -> first, last, age

assembly = new Retain( assembly, new Fields( "first", "last" ) );

// outgoing -> first, last