Cascading 4.0 User Guide - Apache Tez Platform
- 1. Introduction
-
1.1. What Is Cascading?
1.2. Another Perspective
1.3. Why Use Cascading?
1.5. Who Are the Users?
- 2. Diving into the APIs
- 3. Cascading Basic Concepts
-
3.1. Terminology
3.2. Pipe Assemblies
3.3. Pipes
3.4. Platforms
3.6. Sink Modes
3.7. Flows
- 4. Tuple Fields
-
4.1. Field Sets
4.2. Field Algebra
4.3. Field Typing
4.4. Type Coercion
- 5. Pipe Assemblies
-
5.1. Each and Every Pipes
5.2. Merge
5.3. GroupBy
5.4. CoGroup
5.5. HashJoin
- 6. Flows
-
6.1. Creating Flows from Pipe Assemblies
6.2. Configuring Flows
6.3. Skipping Flows
6.6. Runtime Metrics
- 7. Cascades
-
7.1. Creating a Cascade
- 8. Configuring
-
8.1. Introduction
8.2. Creating Properties
8.3. Passing Properties
- 9. Local Platform
-
9.3. Source and Sink Taps
- 10. The Apache Hadoop Platforms
-
10.1. What is Apache Hadoop?
10.4. Configuring Applications
10.5. Building an Application
10.6. Executing an Application
10.8. Source and Sink Taps
10.9. Custom Taps and Schemes
- 11. Apache Hadoop MapReduce Platform
-
11.1. Configuring Applications
11.3. Building
- 12. Apache Tez Platform
-
12.1. Configuring Applications
12.2. Building
- 13. Using and Developing Operations
-
13.1. Introduction
13.2. Functions
13.3. Filters
13.4. Aggregators
13.5. Buffers
- 14. Custom Taps and Schemes
-
14.1. Introduction
14.2. Custom Taps
14.3. Custom Schemes
14.5. Tap Life-Cycle Methods
- 15. Advanced Processing
-
15.1. SubAssemblies
15.2. Stream Assertions
15.3. Failure Traps
15.4. Checkpointing
15.7. PartitionTaps
- 16. Built-In Operations
-
16.1. Identity Function
16.2. Debug Function
16.4. Insert Function
16.5. Text Functions
16.8. XML Operations
16.9. Assertions
16.10. Logical Filter Operators
16.11. Buffers
- 17. Built-in SubAssemblies
-
17.1. Optimized Aggregations
17.2. Stream Shaping
- 18. Cascading Best Practices
-
18.1. Unit Testing
18.2. Flow Granularity
18.7. Optimizing Joins
18.8. Debugging Streams
18.11. Fields Constants
18.12. Checking the Source Code
- 19. Extending Cascading
-
19.1. Scripting
- 20. Cookbook: Code Examples of Cascading Idioms
-
20.1. Tuples and Fields
20.2. Stream Shaping
20.3. Common Operations
20.4. Stream Ordering
20.5. API Usage
- 21. The Cascading Process Planner
-
21.1. FlowConnector
21.2. RuleRegistrySet
21.3. RuleRegistry
Apache Tez Platform
The following documentation covers details about using Cascading on the Apache Tez platform that are not covered in the Apache Hadoop documentation of this guide.
The most up-to-date information about running Cascading on Apache Tez and supported Tez releases can be found in a GitHub repo README at:
- Released Source
-
https://github.com/cascading/cascading/tree/3.1/cascading-hadoop2-tez
- Work-in-Progress Source
-
https://github.com/cwensel/cascading/tree/wip-3.1/cascading-hadoop2-tez
Apache Tez is a noticeable improvement over MapReduce. Tez’s merits include:
-
No more "identity mappers" — mappers that simply forward data to a reducer
-
Support for multiple outputs
-
No prefixing data with join ordinality
-
Suppression of sorting when not required
-
Removal of HDFS as an intermediate store between jobs
Configuring Applications
During runtime, Hadoop must be told which application JAR file should be pushed to the cluster.
In order to remain platform-independent, the AppProps class should be used as described in the configuring applications for Hadoop documentation.
Building
Cascading ships with several JARs and dependencies in the download archive.
Alternatively, Cascading is available over Maven and Ivy through the Conjars repository, along with a number of other Cascading-related projects. See http://conjars.org for more information.
The Cascading Hadoop artifacts include the following:
- cascading-core-3.x.y.jar
-
This JAR contains the Cascading Core class files. It should be packaged with lib/*.jar when using Hadoop.
- cascading-hadoop2-tez-3.x.y.jar
-
This JAR contains the Cascading Hadoop 2 and Apache Tez specific dependencies. It should be packaged with lib/*.jar when using Hadoop.
- cascading-hadoop2-tez-stats-3.x.y.jar
-
This JAR is a dependency of cascading-hadoop2-tez-3.x.y.jar and will be automatically included in a Maven or Gradle build.
Cascading works with either of the Hadoop processing modes — the default local stand-alone mode and the distributed cluster mode. As specified in the Hadoop documentation, running in cluster mode requires the creation of a Hadoop job JAR that includes the Cascading JARs, plus any needed third-party JARs, in its lib directory. This is true regardless of whether they are Cascading Hadoop-mode applications or raw Apache Tez applications.