|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.hadoop.conf.Configured
org.apache.nutch.tools.arc.ArcSegmentCreator
public class ArcSegmentCreator
The ArcSegmentCreator is a replacement for fetcher that will
take arc files as input and produce a nutch segment as output.
Arc files are tars of compressed gzips which are produced by both the internet archive project and the grub distributed crawler project.
| Field Summary | |
|---|---|
static org.apache.commons.logging.Log |
LOG
|
static String |
URL_VERSION
|
| Constructor Summary | |
|---|---|
ArcSegmentCreator()
|
|
ArcSegmentCreator(org.apache.hadoop.conf.Configuration conf)
Constructor that sets the job configuration. |
|
| Method Summary | |
|---|---|
void |
close()
|
void |
configure(org.apache.hadoop.mapred.JobConf job)
Configures the job. |
void |
createSegments(org.apache.hadoop.fs.Path arcFiles,
org.apache.hadoop.fs.Path segmentsOutDir)
Creates the arc files to segments job. |
static String |
generateSegmentName()
Generates a random name for the segments. |
static void |
main(String[] args)
|
void |
map(org.apache.hadoop.io.Text key,
org.apache.hadoop.io.BytesWritable bytes,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,NutchWritable> output,
org.apache.hadoop.mapred.Reporter reporter)
Runs the Map job to translate an arc record into output for Nutch segments. |
int |
run(String[] args)
|
| Methods inherited from class org.apache.hadoop.conf.Configured |
|---|
getConf, setConf |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Methods inherited from interface org.apache.hadoop.conf.Configurable |
|---|
getConf, setConf |
| Field Detail |
|---|
public static final org.apache.commons.logging.Log LOG
public static final String URL_VERSION
| Constructor Detail |
|---|
public ArcSegmentCreator()
public ArcSegmentCreator(org.apache.hadoop.conf.Configuration conf)
Constructor that sets the job configuration.
conf - | Method Detail |
|---|
public static String generateSegmentName()
public void configure(org.apache.hadoop.mapred.JobConf job)
Configures the job. Sets the url filters, scoring filters, url normalizers and other relevant data.
configure in interface org.apache.hadoop.mapred.JobConfigurablejob - The job configuration.public void close()
close in interface Closeable
public void map(org.apache.hadoop.io.Text key,
org.apache.hadoop.io.BytesWritable bytes,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,NutchWritable> output,
org.apache.hadoop.mapred.Reporter reporter)
throws IOException
Runs the Map job to translate an arc record into output for Nutch segments.
map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.Text,NutchWritable>key - The arc record header.bytes - The arc record raw content bytes.output - The output collecter.reporter - The progress reporter.
IOException
public void createSegments(org.apache.hadoop.fs.Path arcFiles,
org.apache.hadoop.fs.Path segmentsOutDir)
throws IOException
Creates the arc files to segments job.
arcFiles - The path to the directory holding the arc filessegmentsOutDir - The output directory for writing the segments
IOException - If an IO error occurs while running the job.
public static void main(String[] args)
throws Exception
Exception
public int run(String[] args)
throws Exception
run in interface org.apache.hadoop.util.ToolException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||