Cascading 4.0 User Guide - Apache Tez Platform

1. Introduction

1.1. What Is Cascading?

2. Diving into the APIs

2.1. Anatomy of a Word-Count Application

3. Cascading Basic Concepts

3.1. Terminology

3.3. Pipes

3.4. Platforms

3.6. Sink Modes

3.7. Flows

4. Tuple Fields

4.1. Field Sets

5. Pipe Assemblies

5.1. Each and Every Pipes

5.2. Merge

5.3. GroupBy

5.4. CoGroup

5.5. HashJoin

6. Flows

6.1. Creating Flows from Pipe Assemblies

7. Cascades

7.1. Creating a Cascade

8. Configuring

8.1. Introduction

9. Local Platform

9.1. Building an Application

10. The Apache Hadoop Platforms

10.1. What is Apache Hadoop?

11. Apache Hadoop MapReduce Platform

11.1. Configuring Applications

11.3. Building

12. Apache Tez Platform

12.1. Configuring Applications

12.2. Building

13. Using and Developing Operations

13.1. Introduction

13.2. Functions

13.3. Filters

13.4. Aggregators

13.5. Buffers

14. Custom Taps and Schemes

14.1. Introduction

14.2. Custom Taps

15. Advanced Processing

15.1. SubAssemblies

16. Built-In Operations

16.1. Identity Function

16.9. Assertions

16.11. Buffers

17. Built-in SubAssemblies

17.1. Optimized Aggregations

18. Cascading Best Practices

18.1. Unit Testing

19. Extending Cascading

19.1. Scripting

20. Cookbook: Code Examples of Cascading Idioms

20.1. Tuples and Fields

20.5. API Usage

21. The Cascading Process Planner

21.1. FlowConnector

21.3. RuleRegistry

Apache Tez Platform

The following documentation covers details about using Cascading on the Apache Tez platform that are not covered in the Apache Hadoop documentation of this guide.

The most up-to-date information about running Cascading on Apache Tez and supported Tez releases can be found in a GitHub repo README at:

Apache Tez is a noticeable improvement over MapReduce. Tez’s merits include:

  • No more "identity mappers" — mappers that simply forward data to a reducer

  • Support for multiple outputs

  • No prefixing data with join ordinality

  • Suppression of sorting when not required

  • Removal of HDFS as an intermediate store between jobs

Configuring Applications

During runtime, Hadoop must be told which application JAR file should be pushed to the cluster.

In order to remain platform-independent, the AppProps class should be used as described in the configuring applications for Hadoop documentation.

Building

Cascading ships with several JARs and dependencies in the download archive.

Alternatively, Cascading is available over Maven and Ivy through the Conjars repository, along with a number of other Cascading-related projects. See http://conjars.org for more information.

The Cascading Hadoop artifacts include the following:

cascading-core-3.x.y.jar

This JAR contains the Cascading Core class files. It should be packaged with lib/*.jar when using Hadoop.

cascading-hadoop2-tez-3.x.y.jar

This JAR contains the Cascading Hadoop 2 and Apache Tez specific dependencies. It should be packaged with lib/*.jar when using Hadoop.

cascading-hadoop2-tez-stats-3.x.y.jar

This JAR is a dependency of cascading-hadoop2-tez-3.x.y.jar and will be automatically included in a Maven or Gradle build.

Cascading works with either of the Hadoop processing modes — the default local stand-alone mode and the distributed cluster mode. As specified in the Hadoop documentation, running in cluster mode requires the creation of a Hadoop job JAR that includes the Cascading JARs, plus any needed third-party JARs, in its lib directory. This is true regardless of whether they are Cascading Hadoop-mode applications or raw Apache Tez applications.