|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.nutch.urlfilter.api.RegexURLFilterBase
public abstract class RegexURLFilterBase
Generic URL filter based on
regular expressions.
The regular expressions rules are expressed in a file. The file of rules
is provided by each implementation using the
getRulesFile(Configuration) method.
The format of this file is made of many rules (one per line):
[+-]<regex>
where plus (+)means go ahead and index it and minus
(-)means no.
| Field Summary |
|---|
| Fields inherited from interface org.apache.nutch.net.URLFilter |
|---|
X_POINT_ID |
| Constructor Summary | |
|---|---|
|
RegexURLFilterBase()
Constructs a new empty RegexURLFilterBase |
protected |
RegexURLFilterBase(Reader reader)
Constructs a new RegexURLFilter and init it with a Reader of rules. |
|
RegexURLFilterBase(String filename)
Constructs a new RegexURLFilter and init it with a file of rules. |
| Method Summary | |
|---|---|
protected abstract RegexRule |
createRule(boolean sign,
String regex)
Creates a new RegexRule. |
String |
filter(String url)
|
org.apache.hadoop.conf.Configuration |
getConf()
|
protected abstract String |
getRulesFile(org.apache.hadoop.conf.Configuration conf)
Returns the name of the file of rules to use for a particular implementation. |
static void |
main(RegexURLFilterBase filter,
String[] args)
Filter the standard input using a RegexURLFilterBase. |
void |
setConf(org.apache.hadoop.conf.Configuration conf)
|
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public RegexURLFilterBase()
public RegexURLFilterBase(String filename)
throws IOException,
IllegalArgumentException
filename - is the name of rules file.
IOException
IllegalArgumentException
protected RegexURLFilterBase(Reader reader)
throws IOException,
IllegalArgumentException
reader - is a reader of rules.
IOException
IllegalArgumentException| Method Detail |
|---|
protected abstract RegexRule createRule(boolean sign,
String regex)
RegexRule.
sign - of the regular expression.
A true value means that any URL matching this rule
must be included, whereas a false
value means that any URL matching this rule must be excluded.regex - is the regular expression associated to this rule.protected abstract String getRulesFile(org.apache.hadoop.conf.Configuration conf)
conf - is the current configuration.
public String filter(String url)
filter in interface URLFilterpublic void setConf(org.apache.hadoop.conf.Configuration conf)
setConf in interface org.apache.hadoop.conf.Configurablepublic org.apache.hadoop.conf.Configuration getConf()
getConf in interface org.apache.hadoop.conf.Configurable
public static void main(RegexURLFilterBase filter,
String[] args)
throws IOException,
IllegalArgumentException
filter - is the RegexURLFilterBase to use for filtering the
standard input.args - some optional parameters (not used).
IOException
IllegalArgumentException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||