Cascading 4.0 User Guide - Built-In SubAssemblies
- 1. Introduction
-
1.1. What Is Cascading?
1.2. Another Perspective
1.3. Why Use Cascading?
1.5. Who Are the Users?
- 2. Diving into the APIs
- 3. Cascading Basic Concepts
-
3.1. Terminology
3.2. Pipe Assemblies
3.3. Pipes
3.4. Platforms
3.6. Sink Modes
3.7. Flows
- 4. Tuple Fields
-
4.1. Field Sets
4.2. Field Algebra
4.3. Field Typing
4.4. Type Coercion
- 5. Pipe Assemblies
-
5.1. Each and Every Pipes
5.2. Merge
5.3. GroupBy
5.4. CoGroup
5.5. HashJoin
- 6. Flows
-
6.1. Creating Flows from Pipe Assemblies
6.2. Configuring Flows
6.3. Skipping Flows
6.6. Runtime Metrics
- 7. Cascades
-
7.1. Creating a Cascade
- 8. Configuring
-
8.1. Introduction
8.2. Creating Properties
8.3. Passing Properties
- 9. Local Platform
-
9.3. Source and Sink Taps
- 10. The Apache Hadoop Platforms
-
10.1. What is Apache Hadoop?
10.4. Configuring Applications
10.5. Building an Application
10.6. Executing an Application
10.8. Source and Sink Taps
10.9. Custom Taps and Schemes
- 11. Apache Hadoop MapReduce Platform
-
11.1. Configuring Applications
11.3. Building
- 12. Apache Tez Platform
-
12.1. Configuring Applications
12.2. Building
- 13. Using and Developing Operations
-
13.1. Introduction
13.2. Functions
13.3. Filters
13.4. Aggregators
13.5. Buffers
- 14. Custom Taps and Schemes
-
14.1. Introduction
14.2. Custom Taps
14.3. Custom Schemes
14.5. Tap Life-Cycle Methods
- 15. Advanced Processing
-
15.1. SubAssemblies
15.2. Stream Assertions
15.3. Failure Traps
15.4. Checkpointing
15.7. PartitionTaps
- 16. Built-In Operations
-
16.1. Identity Function
16.2. Debug Function
16.4. Insert Function
16.5. Text Functions
16.8. XML Operations
16.9. Assertions
16.10. Logical Filter Operators
16.11. Buffers
- 17. Built-in SubAssemblies
-
17.1. Optimized Aggregations
17.2. Stream Shaping
- 18. Cascading Best Practices
-
18.1. Unit Testing
18.2. Flow Granularity
18.7. Optimizing Joins
18.8. Debugging Streams
18.11. Fields Constants
18.12. Checking the Source Code
- 19. Extending Cascading
-
19.1. Scripting
- 20. Cookbook: Code Examples of Cascading Idioms
-
20.1. Tuples and Fields
20.2. Stream Shaping
20.3. Common Operations
20.4. Stream Ordering
20.5. API Usage
- 21. The Cascading Process Planner
-
21.1. FlowConnector
21.2. RuleRegistrySet
21.3. RuleRegistry
Built-in SubAssemblies
There are a number of helper SubAssemblies provided by the core Cascading library.
Many of the assemblies that are described below can be coded to ignore null values. This allows for an optional but closer resemblance to how similar functions in SQL perform. |
Optimized Aggregations
The following SubAssemblies are implementations or optimizations of more discrete aggregate functions. Many of these SubAssemblies rely on the AggregateBy base class, which can be subclassed by developers to create custom aggregations leveraging the internal partial aggregation implementation.
Unique
The cascading.pipe.assembly.Unique SubAssembly is used to remove duplicate values in a Tuple stream. Uniqueness is determined by the values of all fields listed in uniqueFields. Use Fields.ALL as the uniqueFields argument to find all distinct Tuples in a stream.
// incoming -> first, last
assembly = new Unique( assembly, new Fields( "first", "last" ) );
// outgoing -> first, last
Unique uses the FirstNBuffer to more efficiently determine unique values.
AggregateBy
The cascading.pipe.assembly.AggregateBy SubAssembly is an implementation of the partial aggregation pattern, and is the base class for built-in and custom partial aggregation implementations like AverageBy or CountBy.
Generally the AggregateBy class is used to combine multiple AggregateBy subclasses into a single Pipe.
Pipe assembly = new Pipe( "assembly" );
// ...
Fields groupingFields = new Fields( "date" );
// note we do not pass the parent assembly Pipe in
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size", long.class );
SumBy sumBy = new SumBy( valueField, sumField );
Fields countField = new Fields( "num-events" );
CountBy countBy = new CountBy( countField );
assembly = new AggregateBy( assembly, groupingFields, sumBy, countBy );
To create a custom partial aggregation, subclass the AggregateBy class and implement the appropriate internal interfaces. See the Javadoc for details.
AverageBy
The cascading.pipe.assembly.AverageBy SubAssembly calculates an average of the given valueFields and returns the result in the averageField field. AverageBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.
Pipe assembly = new Pipe( "assembly" );
// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields avgField = new Fields( "avg-size" );
assembly = new AverageBy( assembly, groupingFields, valueField, avgField );
CountBy
The cascading.pipe.assembly.CountBy SubAssembly performs a count over the given groupingFields and returns the result in the countField field. CountBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.
Pipe assembly = new Pipe( "assembly" );
// ...
Fields groupingFields = new Fields( "date" );
Fields countField = new Fields( "count" );
assembly = new CountBy( assembly, groupingFields, countField );
SumBy
The cascading.pipe.assembly.SumBy SubAssembly performs a sum over the given valueFields and returns the result in the sumField field. SumBy may be combined with other AggregateBy subclasses so that they may be executed simultaneously over the same grouping.
Pipe assembly = new Pipe( "assembly" );
// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
Fields sumField = new Fields( "total-size" );
assembly =
new SumBy( assembly, groupingFields, valueField, sumField, long.class );
FirstBy
The cascading.pipe.assembly.FirstBy SubAssembly is used to return the first encountered value in the given valueFields. FirstBy may be combined with other AggregateBy subclasses so they may be executed simultaneously over the same grouping.
Pipe assembly = new Pipe( "assembly" );
// ...
Fields groupingFields = new Fields( "date" );
Fields valueField = new Fields( "size" );
// we want the largest size in this grouping
valueField.setComparator( "size", new LongComparator() );
assembly =
new FirstBy( assembly, groupingFields, valueField );
Note if the valueFields Fields instance has field comparators, they will be used to sort the order of argument values. Otherwise, the fields will not be sorted in any deterministic order.
Stream Shaping
Coerce
The cascading.pipe.assembly.Coerce SubAssembly is used to coerce a set of values from one type to another type — for example, to convert the age field from a String to an Integer.
// incoming -> first, last, age
assembly =
new Coerce( assembly, new Fields( "age" ), Integer.class );
// outgoing -> first, last, age
Discard
The cascading.pipe.assembly.Discard SubAssembly is used to shape the Tuple stream by discarding all fields given on the constructor. All unlisted fields are retained.
// incoming -> first, last, age
assembly = new Discard( assembly, new Fields( "age" ) );
// outgoing -> first, last
Rename
The cascading.pipe.assembly.Rename SubAssembly is used to rename a field.
// incoming -> first, last, age
assembly =
new Rename( assembly, new Fields( "age" ), new Fields( "years" ) );
// outgoing -> first, last, years
Retain
The cascading.pipe.assembly.Retain SubAssembly is used to shape the Tuple stream by retaining all fields given on the constructor. All unlisted fields are discarded.
// incoming -> first, last, age
assembly = new Retain( assembly, new Fields( "first", "last" ) );
// outgoing -> first, last