如果妳祈求心灵的平和与快乐,就去信仰上帝!如果妳希望成为一个真理的门徒,探索吧!! -- 尼采
I can calculate the motions of havenly bodies, but not the madness of people. -- Newton
You have to be out to be in.

搜索引擎

Java, Web, Searching Engine

  IT博客 :: 首页 :: 新随笔 :: 联系 :: 聚合  :: 管理 ::
  24 随笔 :: 25 文章 :: 27 评论 :: 0 Trackbacks
1. A complete crawl procedure can be presented by the following pseudo-code:

inject: pass links of urls file to webDB
for (i = 0; i < depth; i++) {
    generate: creat a new segment and generate a fetchlist from the WebDB;
    fetch: fetch content from URLs in the new fetchlist;
    parse: parse content of the new segment;
    updatedb: add new links in the crawldb according to the new segment;
}
invertlinks: create the linkdb, listing incoming links for each url;
index: create indexes for segments;
dedup: delete duplicate documents for each indexes segment;
merge: merge all indexes into single index corresponding;

2. Nutch provide a set of utility commands, there are:
for webDB:
  readdb: Read utility
  mergedb: merger
  convdb: old version converter
for linkdb:
  readlinkdb: Read utility
  mergelinkdb: merger
for segment:
  readseg: Read utility
  mergesegs: merger

3. Besides those above, there two system commands:
plugin: registry of plugin
server: a search server


4. Here is a complete list of all commands and their simple description
命令
input
 output  task
 crawl  urls dir
all   do whole thing in single command
 inject  urls dir  webDB pass links of urls file to webDB 
 generate  webDB  a segment creat a new segment and generate a fetchlist from the WebDB 
 freegen  urls dir
a segment
 creat a new segment and generate a fetchlist from a plain text
 fetch a segment  a segment  fetch content from URLs in the fetchlist 
 fetch2  a segment a segment  Another fether 
 parse  a segment  a segment Parse content in a segment 
 updatedb  a segment webDB   add new links into the crawldb according to new segment
 invertlinks  segments linkdb   maintains an inverted link map, listing incoming links for each url
 index  segments, linkdb, webDB indexes  Create indexes for segments 
 dedup indexes dir  indexes dir   Delete duplicate documents in a set of Lucene indexes
 merge  indexes dir
index  merge all indexes into single index 
 readdb webDB  information about webDB  Read utility for the webDB 
 mergedb  webDBs  webDB merge several webDB 
 readlinkdb linkdb  information about linkdb Read utility for the linkdb
 mergelinkdb linkdb
linkdb
 merge several linkdb
 readseg  segment  information about segment Read utility for the segment 
 mergesegs segment
segment
merge several segment
convdb
webDB webDB
convert old webDB into new version
 plugin  plugin class NA  register a plugin 
 server port, indexdir  NA  run a search server 

posted on 2007-10-03 23:32 专心练剑 阅读(572) 评论(0)  编辑 收藏 引用 所属分类: 搜索引擎
只有注册用户登录后才能发表评论。