cascading.tap.hadoop
Class Hfs

java.lang.Object
  extended by cascading.tap.Tap<JobConf,RecordReader,OutputCollector>
      extended by cascading.tap.hadoop.Hfs
All Implemented Interfaces:
FlowElement, FileType<JobConf>, Serializable
Direct Known Subclasses:
Dfs, Lfs, TempHfs

public class Hfs
extends Tap<JobConf,RecordReader,OutputCollector>
implements FileType<JobConf>

Class Hfs is the base class for all Hadoop file system access. Hfs may only be used with the HadoopFlowConnector when creating Hadoop executable Flow instances.

Paths typically should point to a directory, where in turn all the "part" files immediately in that directory will be included. This is the practice Hadoop expects. Sub-directories are not included and typically result in a failure.

To include sub-directories, Hadoop supports "globing". Globing is a frustrating feature and is supported more robustly by GlobHfs and less so by Hfs.

Hfs will accept /* (wildcard) paths, but not all convenience methods like getSize(org.apache.hadoop.mapred.JobConf) will behave properly or reliably. Nor can the Hfs instance with a wildcard path be used as a sink to write data.

In those cases use GlobHfs since it is a sub-class of MultiSourceTap.

Optionally use Dfs or Lfs for resources specific to Hadoop Distributed file system or the Local file system, respectively. Using Hfs is the best practice when possible, Lfs and Dfs are conveniences.

Use the Hfs class if the 'kind' of resource is unknown at design time. To use, prefix a scheme to the 'stringPath'. Where hdfs://... will denote Dfs, and file://... will denote Lfs.

Call setTemporaryDirectory(java.util.Map, String) to use a different temporary file directory path other than the current Hadoop default path.

By default Cascading on Hadoop will assume any source or sink Tap using the file:// URI scheme intends to read files from the local client filesystem (for example when using the Lfs Tap) where the Hadoop job jar is started, Tap so will force any MapReduce jobs reading or writing to file:// resources to run in Hadoop "standalone mode" so that the file can be read.

To change this behavior, HfsProps.setLocalModeScheme(java.util.Map, String) to set a different scheme value, or to "none" to disable entirely for the case the file to be read is available on every Hadoop processing node in the exact same path.

Hfs can optionally combine multiple small files (or a series of small "blocks") into larger "splits". This reduces the number of resulting map tasks created by Hadoop and can improve application performance.

This is enabled by calling HfsProps.setUseCombinedInput(boolean) to true. By default, merging or combining splits into large ones is disabled.

See Also:
Serialized Form

Field Summary
protected  String stringPath
          Field stringPath
static String TEMPORARY_DIRECTORY
          Deprecated. see HfsProps.TEMPORARY_DIRECTORY
 
Constructor Summary
protected Hfs()
           
  Hfs(Fields fields, String stringPath)
          Deprecated. 
  Hfs(Fields fields, String stringPath, boolean replace)
          Deprecated. 
  Hfs(Fields fields, String stringPath, SinkMode sinkMode)
          Deprecated. 
protected Hfs(Scheme<JobConf,RecordReader,OutputCollector,?,?> scheme)
           
  Hfs(Scheme<JobConf,RecordReader,OutputCollector,?,?> scheme, String stringPath)
          Constructor Hfs creates a new Hfs instance.
  Hfs(Scheme<JobConf,RecordReader,OutputCollector,?,?> scheme, String stringPath, boolean replace)
          Deprecated. 
  Hfs(Scheme<JobConf,RecordReader,OutputCollector,?,?> scheme, String stringPath, SinkMode sinkMode)
          Constructor Hfs creates a new Hfs instance.
 
Method Summary
protected  void applySourceConfInitIdentifiers(FlowProcess<JobConf> process, JobConf conf, String... fullIdentifiers)
           
 boolean createResource(JobConf conf)
           
 boolean deleteChildResource(JobConf conf, String childIdentifier)
           
 boolean deleteResource(JobConf conf)
           
 long getBlockSize(JobConf conf)
          Method getBlockSize returns the blocksize specified by the underlying file system for this resource.
 String[] getChildIdentifiers(JobConf conf)
           
 String[] getChildIdentifiers(JobConf conf, int depth, boolean fullyQualified)
           
protected static boolean getCombinedInputSafeMode(JobConf conf)
           
protected  FileSystem getDefaultFileSystem(JobConf jobConf)
           
 URI getDefaultFileSystemURIScheme(JobConf jobConf)
          Method getDefaultFileSystemURIScheme returns the URI scheme for the default Hadoop FileSystem.
protected  FileSystem getFileSystem(JobConf jobConf)
           
 String getFullIdentifier(JobConf conf)
           
 String getIdentifier()
           
protected static String getLocalModeScheme(JobConf conf, String defaultValue)
           
 long getModifiedTime(JobConf conf)
           
 Path getPath()
           
 int getReplication(JobConf conf)
          Method getReplication returns the replication specified by the underlying file system for this resource.
 long getSize(JobConf conf)
           
static String getTemporaryDirectory(Map<Object,Object> properties)
          Deprecated. see HfsProps
static Path getTempPath(JobConf conf)
           
 URI getURIScheme(JobConf jobConf)
           
protected static boolean getUseCombinedInput(JobConf conf)
           
 boolean isDirectory(JobConf conf)
           
protected  String makeTemporaryPathDirString(String name)
           
protected  URI makeURIScheme(JobConf jobConf)
           
 TupleEntryIterator openForRead(FlowProcess<JobConf> flowProcess, RecordReader input)
           
 TupleEntryCollector openForWrite(FlowProcess<JobConf> flowProcess, OutputCollector output)
           
 boolean resourceExists(JobConf conf)
           
protected  void setStringPath(String stringPath)
           
static void setTemporaryDirectory(Map<Object,Object> properties, String tempDir)
          Deprecated. see HfsProps
protected  void setUriScheme(URI uriScheme)
           
 void sinkConfInit(FlowProcess<JobConf> process, JobConf conf)
           
 void sourceConfInit(FlowProcess<JobConf> process, JobConf conf)
           
protected  void sourceConfInitAddInputPath(JobConf conf, Path qualifiedPath)
           
protected  void sourceConfInitComplete(FlowProcess<JobConf> process, JobConf conf)
           
protected static void verifyNoDuplicates(JobConf conf)
           
 
Methods inherited from class cascading.tap.Tap
commitResource, createResource, deleteResource, equals, flowConfInit, getConfigDef, getFullIdentifier, getModifiedTime, getScheme, getSinkFields, getSinkMode, getSourceFields, getStepConfigDef, getTrace, hasConfigDef, hashCode, hasStepConfigDef, id, isEquivalentTo, isKeep, isReplace, isSink, isSource, isTemporary, isUpdate, openForRead, openForWrite, outgoingScopeFor, presentSinkFields, presentSourceFields, resolveIncomingOperationArgumentFields, resolveIncomingOperationPassThroughFields, resourceExists, retrieveSinkFields, retrieveSourceFields, rollbackResource, setScheme, taps, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

TEMPORARY_DIRECTORY

@Deprecated
public static final String TEMPORARY_DIRECTORY
Deprecated. see HfsProps.TEMPORARY_DIRECTORY
Field TEMPORARY_DIRECTORY

See Also:
Constant Field Values

stringPath

protected String stringPath
Field stringPath

Constructor Detail

Hfs

protected Hfs()

Hfs

@ConstructorProperties(value="scheme")
protected Hfs(Scheme<JobConf,RecordReader,OutputCollector,?,?> scheme)

Hfs

@Deprecated
@ConstructorProperties(value={"fields","stringPath"})
public Hfs(Fields fields,
                                                 String stringPath)
Deprecated. 

Constructor Hfs creates a new Hfs instance.

Parameters:
fields - of type Fields
stringPath - of type String

Hfs

@Deprecated
@ConstructorProperties(value={"fields","stringPath","replace"})
public Hfs(Fields fields,
                                                 String stringPath,
                                                 boolean replace)
Deprecated. 

Constructor Hfs creates a new Hfs instance.

Parameters:
fields - of type Fields
stringPath - of type String
replace - of type boolean

Hfs

@Deprecated
@ConstructorProperties(value={"fields","stringPath","sinkMode"})
public Hfs(Fields fields,
                                                 String stringPath,
                                                 SinkMode sinkMode)
Deprecated. 

Constructor Hfs creates a new Hfs instance.

Parameters:
fields - of type Fields
stringPath - of type String
sinkMode - of type SinkMode

Hfs

@ConstructorProperties(value={"scheme","stringPath"})
public Hfs(Scheme<JobConf,RecordReader,OutputCollector,?,?> scheme,
                                      String stringPath)
Constructor Hfs creates a new Hfs instance.

Parameters:
scheme - of type Scheme
stringPath - of type String

Hfs

@Deprecated
@ConstructorProperties(value={"scheme","stringPath","replace"})
public Hfs(Scheme<JobConf,RecordReader,OutputCollector,?,?> scheme,
                                                 String stringPath,
                                                 boolean replace)
Deprecated. 

Constructor Hfs creates a new Hfs instance.

Parameters:
scheme - of type Scheme
stringPath - of type String
replace - of type boolean

Hfs

@ConstructorProperties(value={"scheme","stringPath","sinkMode"})
public Hfs(Scheme<JobConf,RecordReader,OutputCollector,?,?> scheme,
                                      String stringPath,
                                      SinkMode sinkMode)
Constructor Hfs creates a new Hfs instance.

Parameters:
scheme - of type Scheme
stringPath - of type String
sinkMode - of type SinkMode
Method Detail

setTemporaryDirectory

@Deprecated
public static void setTemporaryDirectory(Map<Object,Object> properties,
                                                    String tempDir)
Deprecated. see HfsProps

Method setTemporaryDirectory sets the temporary directory on the given properties object.

Parameters:
properties - of type Map
tempDir - of type String

getTemporaryDirectory

@Deprecated
public static String getTemporaryDirectory(Map<Object,Object> properties)
Deprecated. see HfsProps

Method getTemporaryDirectory returns the configured temporary directory from the given properties object.

Parameters:
properties - of type Map
Returns:
a String or null if not set

getLocalModeScheme

protected static String getLocalModeScheme(JobConf conf,
                                           String defaultValue)

getUseCombinedInput

protected static boolean getUseCombinedInput(JobConf conf)

getCombinedInputSafeMode

protected static boolean getCombinedInputSafeMode(JobConf conf)

setStringPath

protected void setStringPath(String stringPath)

setUriScheme

protected void setUriScheme(URI uriScheme)

getURIScheme

public URI getURIScheme(JobConf jobConf)

makeURIScheme

protected URI makeURIScheme(JobConf jobConf)

getDefaultFileSystemURIScheme

public URI getDefaultFileSystemURIScheme(JobConf jobConf)
Method getDefaultFileSystemURIScheme returns the URI scheme for the default Hadoop FileSystem.

Parameters:
jobConf - of type JobConf
Returns:
URI

getDefaultFileSystem

protected FileSystem getDefaultFileSystem(JobConf jobConf)

getFileSystem

protected FileSystem getFileSystem(JobConf jobConf)

getIdentifier

public String getIdentifier()
Specified by:
getIdentifier in class Tap<JobConf,RecordReader,OutputCollector>

getPath

public Path getPath()

getFullIdentifier

public String getFullIdentifier(JobConf conf)
Overrides:
getFullIdentifier in class Tap<JobConf,RecordReader,OutputCollector>

sourceConfInit

public void sourceConfInit(FlowProcess<JobConf> process,
                           JobConf conf)
Overrides:
sourceConfInit in class Tap<JobConf,RecordReader,OutputCollector>

verifyNoDuplicates

protected static void verifyNoDuplicates(JobConf conf)

applySourceConfInitIdentifiers

protected void applySourceConfInitIdentifiers(FlowProcess<JobConf> process,
                                              JobConf conf,
                                              String... fullIdentifiers)

sourceConfInitAddInputPath

protected void sourceConfInitAddInputPath(JobConf conf,
                                          Path qualifiedPath)

sourceConfInitComplete

protected void sourceConfInitComplete(FlowProcess<JobConf> process,
                                      JobConf conf)

sinkConfInit

public void sinkConfInit(FlowProcess<JobConf> process,
                         JobConf conf)
Overrides:
sinkConfInit in class Tap<JobConf,RecordReader,OutputCollector>

openForRead

public TupleEntryIterator openForRead(FlowProcess<JobConf> flowProcess,
                                      RecordReader input)
                               throws IOException
Specified by:
openForRead in class Tap<JobConf,RecordReader,OutputCollector>
Throws:
IOException

openForWrite

public TupleEntryCollector openForWrite(FlowProcess<JobConf> flowProcess,
                                        OutputCollector output)
                                 throws IOException
Specified by:
openForWrite in class Tap<JobConf,RecordReader,OutputCollector>
Throws:
IOException

createResource

public boolean createResource(JobConf conf)
                       throws IOException
Specified by:
createResource in class Tap<JobConf,RecordReader,OutputCollector>
Throws:
IOException

deleteResource

public boolean deleteResource(JobConf conf)
                       throws IOException
Specified by:
deleteResource in class Tap<JobConf,RecordReader,OutputCollector>
Throws:
IOException

deleteChildResource

public boolean deleteChildResource(JobConf conf,
                                   String childIdentifier)
                            throws IOException
Throws:
IOException

resourceExists

public boolean resourceExists(JobConf conf)
                       throws IOException
Specified by:
resourceExists in class Tap<JobConf,RecordReader,OutputCollector>
Throws:
IOException

isDirectory

public boolean isDirectory(JobConf conf)
                    throws IOException
Specified by:
isDirectory in interface FileType<JobConf>
Throws:
IOException

getSize

public long getSize(JobConf conf)
             throws IOException
Specified by:
getSize in interface FileType<JobConf>
Throws:
IOException

getBlockSize

public long getBlockSize(JobConf conf)
                  throws IOException
Method getBlockSize returns the blocksize specified by the underlying file system for this resource.

Parameters:
conf - of JobConf
Returns:
long
Throws:
IOException - when

getReplication

public int getReplication(JobConf conf)
                   throws IOException
Method getReplication returns the replication specified by the underlying file system for this resource.

Parameters:
conf - of JobConf
Returns:
int
Throws:
IOException - when

getChildIdentifiers

public String[] getChildIdentifiers(JobConf conf)
                             throws IOException
Specified by:
getChildIdentifiers in interface FileType<JobConf>
Throws:
IOException

getChildIdentifiers

public String[] getChildIdentifiers(JobConf conf,
                                    int depth,
                                    boolean fullyQualified)
                             throws IOException
Specified by:
getChildIdentifiers in interface FileType<JobConf>
Throws:
IOException

getModifiedTime

public long getModifiedTime(JobConf conf)
                     throws IOException
Specified by:
getModifiedTime in class Tap<JobConf,RecordReader,OutputCollector>
Throws:
IOException

getTempPath

public static Path getTempPath(JobConf conf)

makeTemporaryPathDirString

protected String makeTemporaryPathDirString(String name)


Copyright © 2007-2013 Concurrent, Inc. All Rights Reserved.