如果妳祈求心灵的平和与快乐，就去信仰上帝！如果妳希望成为一个真理的门徒，探索吧！！ -- 尼采
I can calculate the motions of havenly bodies, but not the madness of people. -- Newton
You have to be out to be in.

搜索引擎

Java, Web, Searching Engine

IT博客 :: 首页 :: 新随笔 :: 联系 :: 聚合

:: 管理 ::

24 随笔 :: 25 文章 :: 27 评论 :: 0 Trackbacks

Lucene学习笔记一：Indexer基本class

   1. IndexWriter

       IndexWriter is the central component of the indexing process. This class creates
       a new index and adds documents to an existing index. You can think of Index-Writer
       as an object that gives you write access to the index but doesn’t let you read
       or search it.
       variables:
         Directory directory - where the index directory
         Analyzer analyzer - how to analyze text
       methods:
         addDocument(Document, Analyzer )
         addIndexes(Directory[] dirs)     merge another index

    2. Directory
       The Directory class represents the location of a Lucene index. an abstract class
       that allows its subclasses (two of which are included in Lucene) to store the index
       as they see fit.
       Lucene has 5 concrete implementation of this abstract class.

       CompoundFileReader - for accessing a compound stream.
       DbDirectory - a Berkeley DB 4.3 based implementation
       FSDirectory - Straightforward implementation of Directory as a directory of files
       JEDirectory - Port of Andi Vajda's DbDirectory to to Java Edition of Berkeley Database
       RAMDirectory - A memory-resident Directory implementation.

    3. Analyzer
       The abstract class Analyzer is in charge of extracting tokens out of text to be indexed
       and eliminating the rest. Analyzers are an important part of Lucene and can be used for
       much more than simple input filtering.Lucene comes with several implementations of it.
       BrazilianAnalyzer - br
       ChineseAnalyzer   - cn
       CJKAnalyzer       - cjk
       CzechAnalyzer     - cz
       DutchAnalyzer     - nl
       FrenchAnalyzer    - fr
       GermanAnalyzer    - de
       GreekAnalyzer     - el
       RussianAnalyzer   - ru
       ThaiAnalyzer      - th
       KeywordAnalyzer   - "Tokenizes" the entire stream as a single token.
       PatternAnalyzer   -
       PerFieldAnalyzerWrapper - used to facilitate scenarios where different fields require
                                 different analysis techniques.
       SimpleAnalyzer    - filters LetterTokenizer with LowerCaseFilter.
       SnowballAnalyzer - Filters StandardTokenizer with StandardFilter->LowerCaseFilter
                         ->StopFilter->SnowballFilter
       StandardAnalyzer - using a list of English stop words
       StopAnalyzer      - Filters LetterTokenizer with LowerCaseFilter and StopFilter
       WhitespaceAnalyzer - An Analyzer that uses WhitespaceTokenizer

    4. Document
       Documents are the unit of indexing and search. It represents a collection of fields.
       Fields of a document represent the document or meta-data associated with that document.
       The meta-data such as author, title, subject, date modified, and so on, are indexed
       and stored separately as fields of a document.

       Variables:
           List fields;
           float boost;

    5. Field
       Each field corresponds to a piece of data that is either queried against or retrieved
       from the index during search.

       Lucene offers four different types of fields:

       Keyword — Isn’t analyzed, but is indexed and stored in the index verbatim. This type
       is suitable for fields whose original value should be preserved in its entirety, such
       as URLs, file system paths, dates, personal names, Social Security numbers, telephone
       numbers, and so on.

       UnIndexed — Is neither analyzed nor indexed, but its value is stored in the index as
       is. This type is suitable for fields that you need to display with search results, but
       whose values you’ll never search directly.

       UnStored — The opposite of UnIndexed. This field type is analyzed and indexed but isn’t
       stored in the index. It’s suitable for indexing a large amount of text that doesn’t
       need to be retrieved in its original form, such as bodies of web pages, or any other type
       of text document.

       Text — Is analyzed, and is indexed. This implies that fields of this type can be
       searched against, but be cautious about the field size. If the data indexed is a String,
       it’s also stored; but if the data (as in our Indexer example) is from a Reader, it isn’t
       stored

       Finally, UnStored and Text fields can be used to create term vectors (an advanced topic,
       covered in section 5.7).

posted on 2007-10-31 20:05 专心练剑阅读(568) 评论(0) 编辑收藏引用所属分类: 搜索引擎

只有注册用户登录后才能发表评论。

搜索引擎

常用链接

留言簿(4)

文章分类(26)

AI

opensource

Vertical Search

面经

搜索

最新评论

评论排行榜