|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectcascading.pipe.Pipe
cascading.pipe.Splice
cascading.pipe.HashJoin
public class HashJoin
The HashJoin pipe allows for two or more tuple streams to join into a single stream via a Joiner
when
all but one tuple stream is considered small enough to fit into memory.
Pipe
instance, a Fields
instance must be specified that denotes the field names
or positions that should be joined with the other given Pipe instances. If the incoming Pipe instances declare
one or more field with the same name, the declaredFields must be given to name the outgoing Tuple stream fields
to overcome field name collisions.
By default HashJoin performs an inner join via the InnerJoin
Joiner
class.
Self joins can be achieved by using a constructor that takes a single Pipe and a numSelfJoins value. A value of
1 for numSelfJoins will join the Pipe with itself once. Note that a self join will block until all data is accumulated
thus the stream must be reasonably small.
Note "outer" joins on the left most side will not behave as expected. All observed keys on the right most sides
will be emitted with null
for the left most stream, thus when running distributed, duplicate values will
emerge from every Map task split on the MapReduce platform.
HashJoin does not scale well to large data sizes and thus requires streams with more data on the left hand side to
join with more sparse data on the right hand side. That is, always attempt to effect M x N joins where M is large
and N is small, instead of where M is small and N is large. Right hand side streams will be accumulated, and
spilled to disk if the collection reaches a specific threshold when using Hadoop.
If spills are happening, consider increasing the spill thresholds, see SpillableTupleMap
.
If one of the right hand side streams starts larger than memory but is filtered (likely by a
Filter
implementation) down to the point it fits into memory, it may be useful to use
a Checkpoint
Pipe to persist the stream and force a new FlowStep (MapReduce job) to read the data from
disk, instead of applying the filter redundantly. This will minimize the amount of data "replicated" across the
network.
See the TupleCollectionFactory
and TupleMapFactory
for a means
to use alternative spillable types.
InnerJoin
,
OuterJoin
,
LeftJoin
,
RightJoin
,
MixedJoin
,
Fields
,
SpillableTupleMap
,
Serialized FormField Summary |
---|
Fields inherited from class cascading.pipe.Splice |
---|
declaredFields, keyFieldsMap, resultGroupFields, sortFieldsMap |
Fields inherited from class cascading.pipe.Pipe |
---|
configDef, name, parent, previous, stepConfigDef |
Constructor Summary | |
---|---|
HashJoin(Pipe[] pipes,
Fields[] joinFields,
Fields declaredFields,
Joiner joiner)
Constructor HashJoin creates a new Join instance. |
|
HashJoin(Pipe pipe,
Fields joinFields,
int numSelfJoins,
Fields declaredFields)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(Pipe pipe,
Fields joinFields,
int numSelfJoins,
Fields declaredFields,
Joiner joiner)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(Pipe pipe,
Fields joinFields,
int numSelfJoins,
Joiner joiner)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(Pipe lhs,
Fields lhsJoinFields,
Pipe rhs,
Fields rhsJoinFields)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(Pipe lhs,
Fields lhsJoinFields,
Pipe rhs,
Fields rhsJoinFields,
Fields declaredFields)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(Pipe lhs,
Fields lhsJoinFields,
Pipe rhs,
Fields rhsJoinFields,
Fields declaredFields,
Joiner joiner)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(Pipe lhs,
Fields lhsJoinFields,
Pipe rhs,
Fields rhsJoinFields,
Joiner joiner)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(String joinName,
Pipe[] pipes,
Fields[] joinFields,
Fields declaredFields,
Joiner joiner)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(String joinName,
Pipe pipe,
Fields joinFields,
int numSelfJoins,
Fields declaredFields)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(String joinName,
Pipe pipe,
Fields joinFields,
int numSelfJoins,
Fields declaredFields,
Joiner joiner)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(String joinName,
Pipe pipe,
Fields joinFields,
int numSelfJoins,
Joiner joiner)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(String joinName,
Pipe lhs,
Fields lhsJoinFields,
Pipe rhs,
Fields rhsJoinFields)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(String joinName,
Pipe lhs,
Fields lhsJoinFields,
Pipe rhs,
Fields rhsJoinFields,
Fields declaredFields)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(String joinName,
Pipe lhs,
Fields lhsJoinFields,
Pipe rhs,
Fields rhsJoinFields,
Fields declaredFields,
Joiner joiner)
Constructor HashJoin creates a new HashJoin instance. |
|
HashJoin(String joinName,
Pipe lhs,
Fields lhsJoinFields,
Pipe rhs,
Fields rhsJoinFields,
Joiner joiner)
Constructor HashJoin creates a new HashJoin instance. |
Method Summary |
---|
Methods inherited from class cascading.pipe.Splice |
---|
equals, getDeclaredFields, getJoinDeclaredFields, getJoiner, getKeySelectors, getName, getNumSelfJoins, getPipePos, getPrevious, getSortingSelectors, hashCode, isCoGroup, isEquivalentTo, isGroupBy, isJoin, isMerge, isSorted, isSortReversed, outgoingScopeFor, printInternal, resolveIncomingOperationPassThroughFields, toString |
Methods inherited from class cascading.pipe.Pipe |
---|
getConfigDef, getHeads, getParent, getStepConfigDef, getTrace, hasConfigDef, hasStepConfigDef, id, named, names, pipes, print, resolveIncomingOperationArgumentFields, setParent |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
@ConstructorProperties(value={"joinName","lhs","lhsJoinFields","rhs","rhsJoinFields"}) public HashJoin(String joinName, Pipe lhs, Fields lhsJoinFields, Pipe rhs, Fields rhsJoinFields)
joinName
- lhs
- lhsJoinFields
- rhs
- rhsJoinFields
- @ConstructorProperties(value={"joinName","lhs","lhsJoinFields","rhs","rhsJoinFields","joiner"}) public HashJoin(String joinName, Pipe lhs, Fields lhsJoinFields, Pipe rhs, Fields rhsJoinFields, Joiner joiner)
joinName
- lhs
- lhsJoinFields
- rhs
- rhsJoinFields
- joiner
- @ConstructorProperties(value={"joinName","lhs","lhsJoinFields","rhs","rhsJoinFields","declaredFields"}) public HashJoin(String joinName, Pipe lhs, Fields lhsJoinFields, Pipe rhs, Fields rhsJoinFields, Fields declaredFields)
joinName
- lhs
- lhsJoinFields
- rhs
- rhsJoinFields
- declaredFields
- @ConstructorProperties(value={"joinName","lhs","lhsJoinFields","rhs","rhsJoinFields","declaredFields","joiner"}) public HashJoin(String joinName, Pipe lhs, Fields lhsJoinFields, Pipe rhs, Fields rhsJoinFields, Fields declaredFields, Joiner joiner)
joinName
- lhs
- lhsJoinFields
- rhs
- rhsJoinFields
- declaredFields
- joiner
- @ConstructorProperties(value={"joinName","pipe","joinFields","numSelfJoins","declaredFields","joiner"}) public HashJoin(String joinName, Pipe pipe, Fields joinFields, int numSelfJoins, Fields declaredFields, Joiner joiner)
joinName
- pipe
- joinFields
- numSelfJoins
- declaredFields
- joiner
- @ConstructorProperties(value={"joinName","pipe","joinFields","numSelfJoins","declaredFields"}) public HashJoin(String joinName, Pipe pipe, Fields joinFields, int numSelfJoins, Fields declaredFields)
joinName
- pipe
- joinFields
- numSelfJoins
- declaredFields
- @ConstructorProperties(value={"joinName","pipe","joinFields","numSelfJoins","joiner"}) public HashJoin(String joinName, Pipe pipe, Fields joinFields, int numSelfJoins, Joiner joiner)
joinName
- pipe
- joinFields
- numSelfJoins
- joiner
- @ConstructorProperties(value={"joinName","pipes","joinFields","declaredFields","joiner"}) public HashJoin(String joinName, Pipe[] pipes, Fields[] joinFields, Fields declaredFields, Joiner joiner)
joinName
- pipes
- joinFields
- declaredFields
- joiner
- @ConstructorProperties(value={"lhs","lhsJoinFields","rhs","rhsJoinFields"}) public HashJoin(Pipe lhs, Fields lhsJoinFields, Pipe rhs, Fields rhsJoinFields)
lhs
- lhsJoinFields
- rhs
- rhsJoinFields
- @ConstructorProperties(value={"lhs","lhsJoinFields","rhs","rhsJoinFields","joiner"}) public HashJoin(Pipe lhs, Fields lhsJoinFields, Pipe rhs, Fields rhsJoinFields, Joiner joiner)
lhs
- lhsJoinFields
- rhs
- rhsJoinFields
- joiner
- @ConstructorProperties(value={"lhs","lhsJoinFields","rhs","rhsJoinFields","declaredFields"}) public HashJoin(Pipe lhs, Fields lhsJoinFields, Pipe rhs, Fields rhsJoinFields, Fields declaredFields)
lhs
- lhsJoinFields
- rhs
- rhsJoinFields
- declaredFields
- @ConstructorProperties(value={"lhs","lhsJoinFields","rhs","rhsJoinFields","declaredFields","joiner"}) public HashJoin(Pipe lhs, Fields lhsJoinFields, Pipe rhs, Fields rhsJoinFields, Fields declaredFields, Joiner joiner)
lhs
- lhsJoinFields
- rhs
- rhsJoinFields
- declaredFields
- joiner
- @ConstructorProperties(value={"pipe","joinFields","numSelfJoins","declaredFields","joiner"}) public HashJoin(Pipe pipe, Fields joinFields, int numSelfJoins, Fields declaredFields, Joiner joiner)
pipe
- joinFields
- numSelfJoins
- declaredFields
- joiner
- @ConstructorProperties(value={"pipe","joinFields","numSelfJoins","declaredFields"}) public HashJoin(Pipe pipe, Fields joinFields, int numSelfJoins, Fields declaredFields)
pipe
- joinFields
- numSelfJoins
- declaredFields
- @ConstructorProperties(value={"pipe","joinFields","numSelfJoins","joiner"}) public HashJoin(Pipe pipe, Fields joinFields, int numSelfJoins, Joiner joiner)
pipe
- joinFields
- numSelfJoins
- joiner
- @ConstructorProperties(value={"pipes","joinFields","declaredFields","joiner"}) public HashJoin(Pipe[] pipes, Fields[] joinFields, Fields declaredFields, Joiner joiner)
pipes
- joinFields
- declaredFields
- joiner
-
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |