1. A complete
crawl procedure can be presented by the following pseudo-code:
inject: pass links of urls file to webDB
for (i = 0; i < depth; i++) {
generate: creat a new segment and generate a fetchlist from the WebDB;
fetch: fetch content from URLs in the new fetchlist;
parse: parse content of the new segment;
updatedb: add new links in the crawldb according to the new segment;
}
invertlinks: create the linkdb, listing incoming links for each url;
index: create indexes for segments;
dedup: delete duplicate documents for each indexes segment;
merge: merge all indexes into single index corresponding;
2. Nutch provide a set of utility commands, there are:
for webDB:
readdb: Read utility
mergedb: merger
convdb: old version converter
for linkdb:
readlinkdb: Read utility
mergelinkdb: merger
for segment:
readseg: Read utility
mergesegs: merger
3. Besides those above, there two system commands:
plugin: registry of plugin
server: a search server
4. Here is a complete list of all commands and their simple description
命令
|
input
|
output |
task |
| crawl |
urls dir
|
all |
do whole thing in single command |
| inject |
urls dir |
webDB |
pass links of urls file to webDB |
| generate |
webDB |
a segment |
creat a new segment and generate a fetchlist from the WebDB |
| freegen |
urls dir
|
a segment
|
creat a new segment and generate a fetchlist from a plain text |
| fetch |
a segment |
a segment |
fetch content from URLs in the fetchlist |
| fetch2 |
a segment |
a segment |
Another fether |
| parse |
a segment |
a segment |
Parse content in a segment |
| updatedb |
a segment |
webDB |
add new links into the crawldb according to new segment |
| invertlinks |
segments |
linkdb |
maintains an inverted link map, listing incoming links for each url |
| index |
segments, linkdb, webDB |
indexes |
Create indexes for segments |
| dedup |
indexes dir |
indexes dir |
Delete duplicate documents in a set of Lucene indexes |
| merge |
indexes dir
|
index |
merge all indexes into single index |
| readdb |
webDB |
information about webDB |
Read utility for the webDB |
| mergedb |
webDBs |
webDB |
merge several webDB |
| readlinkdb |
linkdb |
information about linkdb |
Read utility for the linkdb |
| mergelinkdb |
linkdb
|
linkdb
|
merge several linkdb
|
| readseg |
segment |
information about segment |
Read utility for the segment |
| mergesegs |
segment
|
segment
|
merge several segment
|
convdb
|
webDB |
webDB
|
convert old webDB into new version
|
| plugin |
plugin class |
NA |
register a plugin |
| server |
port, indexdir |
NA |
run a search server |