Cascading 4.0 User Guide - Local Platform
- 1. Introduction
-
1.1. What Is Cascading?
1.2. Another Perspective
1.3. Why Use Cascading?
1.5. Who Are the Users?
- 2. Diving into the APIs
- 3. Cascading Basic Concepts
-
3.1. Terminology
3.2. Pipe Assemblies
3.3. Pipes
3.4. Platforms
3.6. Sink Modes
3.7. Flows
- 4. Tuple Fields
-
4.1. Field Sets
4.2. Field Algebra
4.3. Field Typing
4.4. Type Coercion
- 5. Pipe Assemblies
-
5.1. Each and Every Pipes
5.2. Merge
5.3. GroupBy
5.4. CoGroup
5.5. HashJoin
- 6. Flows
-
6.1. Creating Flows from Pipe Assemblies
6.2. Configuring Flows
6.3. Skipping Flows
6.6. Runtime Metrics
- 7. Cascades
-
7.1. Creating a Cascade
- 8. Configuring
-
8.1. Introduction
8.2. Creating Properties
8.3. Passing Properties
- 9. Local Platform
-
9.3. Source and Sink Taps
- 10. The Apache Hadoop Platforms
-
10.1. What is Apache Hadoop?
10.4. Configuring Applications
10.5. Building an Application
10.6. Executing an Application
10.8. Source and Sink Taps
10.9. Custom Taps and Schemes
- 11. Apache Hadoop MapReduce Platform
-
11.1. Configuring Applications
11.3. Building
- 12. Apache Tez Platform
-
12.1. Configuring Applications
12.2. Building
- 13. Using and Developing Operations
-
13.1. Introduction
13.2. Functions
13.3. Filters
13.4. Aggregators
13.5. Buffers
- 14. Custom Taps and Schemes
-
14.1. Introduction
14.2. Custom Taps
14.3. Custom Schemes
14.5. Tap Life-Cycle Methods
- 15. Advanced Processing
-
15.1. SubAssemblies
15.2. Stream Assertions
15.3. Failure Traps
15.4. Checkpointing
15.7. PartitionTaps
- 16. Built-In Operations
-
16.1. Identity Function
16.2. Debug Function
16.4. Insert Function
16.5. Text Functions
16.8. XML Operations
16.9. Assertions
16.10. Logical Filter Operators
16.11. Buffers
- 17. Built-in SubAssemblies
-
17.1. Optimized Aggregations
17.2. Stream Shaping
- 18. Cascading Best Practices
-
18.1. Unit Testing
18.2. Flow Granularity
18.7. Optimizing Joins
18.8. Debugging Streams
18.11. Fields Constants
18.12. Checking the Source Code
- 19. Extending Cascading
-
19.1. Scripting
- 20. Cookbook: Code Examples of Cascading Idioms
-
20.1. Tuples and Fields
20.2. Stream Shaping
20.3. Common Operations
20.4. Stream Ordering
20.5. API Usage
- 21. The Cascading Process Planner
-
21.1. FlowConnector
21.2. RuleRegistrySet
21.3. RuleRegistry
Local Platform
Building an Application
The Cascading local mode has no special requirements for building outside the requirement for any Java application to be executed from the command line. However, there are two top-level dependencies that should be added to the build file:
- cascading-core-3.x.y.jar
-
This JAR contains the Cascading Core class files.
- cascading-local-3.x.y.jar
-
This JAR contains the Cascading local-mode class files.
Executing an Application
After completing a build of the application’s "main" class, the application can be run like any other Java-based command-line application.
Troubleshooting and Debugging
IDE debugging and testing in Cascading local mode, unlike Cascading on other platforms, is straightforward as all the processing happens in the local JVM and in local memory. Therefore, the first recommendation for debugging Cascading applications on a given platform is to first write tests that run in Cascading local mode.
Because Cascading local mode runs entirely in memory, large data sets may cause an OutOfMemoryException. Also, be sure to adjust the Java runtime memory settings. |
In addition to using an IDE debugger, you can use two Cascading features to help sort out runtime issues.
One feature is the Debug filter. Best practice is to sprinkle Debug operators (see Debug Function) in the pipe assembly and rely on the planner to remove them at runtime by setting a DebugLevel.
Debug can only print to the local console via standard output or standard error. This print limitation makes it harder to use Debug on distributed platforms, as operations do not execute locally but on the cluster side. Debug provides the option to print the current field names, and a prefix can be set to help distinguish between instances of the Debug operation.
Additionally, the actual execution plan for a Flow can be written (and visualized) via the Flow.writeDOT() method. DOT files are simply text representations of graph data and can be read by tools like Graphviz and OmniGraffle.
In Cascading local mode, these execution plans are exactly as the pipe assemblies were coded, except the subassemblies are unwound and the field names across the Flow are resolved by the planner. In other words, Fields.ALL and other wild cards are converted to the actual field names or ordinals.
If the connect() method on the current FlowConnector fails, the resulting PlannerException has a writeDOT() method that shows the progress of the current planner.
For planner-related errors that appear during runtime when executing a Flow, see the chapter on the Cascading Process Planner.