﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>IT博客-Rukas - Oh, My Blog!-随笔分类-搜索技术</title><link>http://www.cnitblog.com/rukas/category/1400.html</link><description /><language>zh-cn</language><lastBuildDate>Tue, 04 Oct 2011 03:39:00 GMT</lastBuildDate><pubDate>Tue, 04 Oct 2011 03:39:00 GMT</pubDate><ttl>60</ttl><item><title>解析搜索引擎常用的排序技术 </title><link>http://www.cnitblog.com/rukas/archive/2005/11/30/5028.html</link><dc:creator>Rukas - Oh, My Blog!</dc:creator><author>Rukas - Oh, My Blog!</author><pubDate>Wed, 30 Nov 2005 06:56:00 GMT</pubDate><guid>http://www.cnitblog.com/rukas/archive/2005/11/30/5028.html</guid><wfw:comment>http://www.cnitblog.com/rukas/comments/5028.html</wfw:comment><comments>http://www.cnitblog.com/rukas/archive/2005/11/30/5028.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cnitblog.com/rukas/comments/commentRss/5028.html</wfw:commentRss><trackback:ping>http://www.cnitblog.com/rukas/services/trackbacks/5028.html</trackback:ping><description><![CDATA[<FONT size=2>&nbsp;&nbsp; </FONT>
<P><FONT size=2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Google有一个创始人叫Larry Page，据说PageRank的专利是他申请的，于是依据他的名字就有了Page Rank。国内也有一家很成功的搜索引擎公司，叫百度（</FONT><A href="http://www.baidu.com/"><FONT size=2>http://www.baidu.com</FONT></A><FONT size=2> ）。百度的创始人李彦宏说，早在1996年他就申请了名为超链分析的专利，PageRank的原理和超链分析的原理是一样的，而且PageRank目前还在Paten-pending（专利申请中）。言下之意是这里面存在专利所有权的问题。这里不讨论专利所有权，只是从中可看出，成功搜索引擎的排序技术，就其原理上来说都差不多，那就是链接分析。超链分析和PageRank都属于链接分析。<BR><BR><STRONG>PageRank 分析：</STRONG><BR>PageRank的原理类似于科技论文中的引用机制：谁的论文被引用次数多，谁就是权威。说的更白话一点：张三在谈话中提到了张曼玉，李四在谈话中也提到张曼玉，王五在谈话中还提到张曼玉，这就说明张曼玉一定是很有名的人。在互联网上，链接就相当于“引用”，在B网页中链接了A，相当于B在谈话时提到了A，如果在C、D、E、F中都链接了A，那么说明A网页是最重要的，A网页的PageRank值也就最高。 <BR>&nbsp;&nbsp;&nbsp; 如何计算PageRank值有一个简单的公式 ： <BR></FONT><IMG alt="" src="http://www.fullsearcher.com/image/5.jpg"><BR><FONT size=2>其中：系数为一个大于0，小于1的数。一般设置为0.85。网页1、网页2至网页N表示所有链接指向A的网页。</FONT></P>
<P><FONT size=2>由以上公式可以看出三点 ： <BR>&nbsp;&nbsp; 1、链接指向A的网页越多，A的级别越高。即A的级别和指向A的网页个数成正比，在公式中表示，N越大， A的级别越高； <BR>&nbsp; </FONT><FONT size=2>2、链接指向A的网页，其网页级别越高， A的级别也越高。即A的级别和指向A的网页自己的网页级别成正比，在公式中表示，网页N级别越高， A的级别也越高； <BR>&nbsp; </FONT><FONT size=2>3、链接指向A的网页，其链出的个数越多，A的级别越低。即A的级别和指向A的网页自己的网页链出个数成反比，在公式中现实，网页N链出个数越多，A的级别越低。 </FONT></P>
<P><FONT size=2>&nbsp;&nbsp;&nbsp;&nbsp; 每个网页有一个PageRank值，这样形成一个巨大的方程组，对这个方程组求解，就能得到每个网页的PageRank值。互联网上有上百亿个网页，那么这个方程组就有上百亿个未知数，这个方程虽然是有解，但计算毕竟太复杂了，不可能把这所有的页面放在一起去求解的。对具体的计算方法有兴趣的朋友可以去参考一些数值计算方面的书。 </FONT></P>
<P><FONT size=2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;总之，PageRank有效地利用了互联网所拥有的庞大链接构造的特性。 从网页A导向网页B的链接，用Google创始人的话讲，是页面A对页面B的支持投票，Google根据这个投票数来判断页面的重要性，但Google除了看投票数（链接数）以外，对投票者（链接的页面）也进行分析。「重要性」高的页面所投的票的评价会更高，因为接受这个投票页面会被理解为「重要的物品」。从新浪、雅虎、微软的首页都有我网页的三个链接的话，可能比我在其他网站找三十个链接还强。如果还有人不理解这个原理，就去想想有句成语叫：三人成虎。如果有三个人都说北京大街上有老虎，那么许多人会认为有老虎，如果这三个人都是国家领导人的话，那么所有人都会认为北京大街上有老虎。 </FONT></P>
<P><FONT size=2>&nbsp;&nbsp;&nbsp; 每个网页都会有PageRank值，如果大家想知道自己网站的网页PageRank值是多少，最简单的办法就是下载一个Google的免费工具</FONT><A href="http://toolbar.google.com/"><FONT size=2>http://toolbar.google.com/</FONT></A><FONT size=2> <BR>&nbsp;&nbsp; 据Google技术负责人介绍，Google除了用PageRank衡量网页的重要程度以外还有其它上百种因素来参与排序。其它搜索引擎也是如此，不可能按照某一种规则来进行搜索结果的排序。 </FONT></P>
<P><FONT size=2><STRONG>HillTop算法：<BR></STRONG>HillTop同样是一项搜索引擎结果排序的专利，是Google的一个工程师Bharat在2001年获得的专利。Google的排序规则经常在变化，但变化最大的一次也就是基于HillTop算法进行了优化。HillTop究竟原理如何，值得Google如此青睐？ </FONT></P>
<P><FONT size=2>&nbsp;&nbsp;&nbsp; 其实HillTop算法的指导思想和PageRank的是一致的，都是通过网页被链接的数量和质量来确定搜索结果的排序权重。但HillTop认为只计算来自具有相同主题的相关文档链接对于搜索者的价值会更大：即主题相关网页之间的链接对于权重计算的贡献比主题不相关的链接价值要更高。如果网站是介绍“服装”的，有10个链接都是从“服装”相关的网站链接过来，那这10个链接比另外10个从“电器”相关网站链接过来的贡献要大。Bharat称这种对主题有影响的文档为“专家”文档，从这些专家文档页面到目标文档的链接决定了被链接网页“权重得分”的主要部分。 </FONT></P>
<P><FONT size=2>&nbsp;&nbsp;&nbsp; 与PageRank结合HillTop算法确定网页与搜索关键词的匹配程度的基本排序过程取代了过份依靠PageRank的值去寻找那些权威页面的方法。这对于两个具有同样主题而且PR相近的网页排序过程中，HillTop算法就显得非常的重要了。HillTop同时也避免了许多想通过增加许多无效链接来提高网页PageRank值的做弊方法。 </FONT></P>
<P><FONT size=2><STRONG>锚文本（Anchor Text）</STRONG><BR>锚文本名字听起来难以理解，实际上锚文本就是链接文本。例如，在个人网站上把中央电视台（ </FONT><A href="http://www.cctv.com/"><FONT size=2>www.cctv.com</FONT></A><FONT size=2> ）做为新闻频道的链接，访问者通过点击网站上的“新闻频道”就能进入 </FONT><A href="http://www.cctv.com/"><FONT size=2>http://www.cctv.com</FONT></A><FONT size=2> 网站，那么“新闻频道”就是中央电视台网站首页的锚文本。 <BR>&nbsp;&nbsp;&nbsp; 锚文本可以做为锚文本所在的页面的内容的评估。正常来讲，页面中增加的链接都会和页面本身的内容有一定的关系。服装的行业网站上会增加一些同行网站的链接或者一些做服装的知名企业的链接；另一方面，锚文本能做为对所指向页面的评估。锚文本能精确的描述所指向页面的内容，个人网站上增加Google的链接，锚文本为“搜索引擎”。这样通过锚文本本身就能知道，Google是搜索引擎。 <BR>&nbsp;&nbsp;&nbsp; 锚文本对搜索引擎起的作用还表现为可以收集一些搜索引擎不能索引的文件。例如，网站上增加了一张张曼玉的照片，格式为jpg文件，搜索引擎目前很难索引（一般只处理文本）。若这张照片链接的锚文本为“张曼玉的照片”，那么搜索引擎就能识别这张图片是张曼玉的照片，以后访问者搜索“张曼玉”的时候，这张图片就能被搜索到。<BR>&nbsp;&nbsp;&nbsp; 由此可见，在网计中选择合适的锚文本，会让所在网页和所指向网页的重要程度有所提升。 </FONT></P>
<P><FONT size=2><STRONG>页面版式:</STRONG></FONT></P>
<P><FONT size=2>每个网页都有版式，包括标题、字体、标签等等。搜索引擎也会利用这些版式来识别搜索词与页面内容的相关程度。以静态的html格式的网页为例，搜索引擎通过网络蜘蛛把网页抓取下来后，需要提取里面的正文内容，过滤其他html代码。在提取内容的时候，搜索引擎就可以记录所有版式信息，包括：哪些词是在标题中出现，哪些词是在正文中出现，哪些词的字体比其他的字体大，哪些词是加粗过，哪些词是用KeyWord标识过的等等。这样在搜索结果中就可以根据这些信息来确定所搜索的结果和搜索词的相关程度。例如搜索“毛泽东”，假如有两个结果，一篇文章标题是《毛泽东的一生》，另一篇文章的标题是《江青的一生》但内容有提到毛泽东，这时搜索引擎会认为前者比较重要，因为“毛泽东”在标题里出现了。 <BR>&nbsp;&nbsp;&nbsp; 因此，合理的利用网页的页面版式，会提升网页在搜索结果页的排序位置</FONT></P>
<P><FONT size=2><STRONG>收费排名:</STRONG><BR>应该说收费排名并不属于排序技术（这里指的收费排名也包括竞价排名），而是一种搜索引擎的赢利模式。但收费排名已经最直接的影响到了搜索引擎的排序，在此也略做说明。 </FONT></P>
<P><FONT size=2>&nbsp;&nbsp;&nbsp; 用户可以购买某个关键词的排名，只要向搜索引擎公司交纳一定的费用，就可以让用户的网站排在搜索结果的前几位，按照不同关键词、不同位置、时间长短来定义价格。价格从几千元到几十万元不等（像“六合彩”在3721上的排名费用大多是几十万）。&nbsp;</FONT><FONT size=2><BR>&nbsp; &nbsp;对于企业来说，收费排名是提升网站在搜索引擎中排名的最直接和最简单的办法。如今，如何提升网页在搜索引擎中的排序，已经形成了一门职业，叫SEO（Search Engine Optimization），即搜索引擎优化。SEO是针对搜索引擎排序的技术，通过修改网页（或者网站）结构和主动增加网站链接等方法来让搜索引擎认为这些网页是很重要的，从而提升网页在搜索引擎结果中的排序.</FONT></P><img src ="http://www.cnitblog.com/rukas/aggbug/5028.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cnitblog.com/rukas/" target="_blank">Rukas - Oh, My Blog!</a> 2005-11-30 14:56 <a href="http://www.cnitblog.com/rukas/archive/2005/11/30/5028.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>搜索引擎设计实用教程-以百度为例 之三:对百度分词算法的进一步分析</title><link>http://www.cnitblog.com/rukas/archive/2005/11/30/5027.html</link><dc:creator>Rukas - Oh, My Blog!</dc:creator><author>Rukas - Oh, My Blog!</author><pubDate>Wed, 30 Nov 2005 06:55:00 GMT</pubDate><guid>http://www.cnitblog.com/rukas/archive/2005/11/30/5027.html</guid><wfw:comment>http://www.cnitblog.com/rukas/comments/5027.html</wfw:comment><comments>http://www.cnitblog.com/rukas/archive/2005/11/30/5027.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cnitblog.com/rukas/comments/commentRss/5027.html</wfw:commentRss><trackback:ping>http://www.cnitblog.com/rukas/services/trackbacks/5027.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT face=Arial size=2>/*版权声明：可以任意转载，转载时请务必标明文章原始出处和作者信息 .*/</FONT>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><FONT size=2><SPAN lang=EN-US><SPAN style="mso-spacerun: yes"></SPAN></SPAN></FONT></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><FONT size=3><SPAN lang=EN-US><SPAN style="mso-spacerun: yes"><FONT face="Times New Roman">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </FONT></SPAN></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">搜</SPAN></FONT><FONT size=3><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">索引擎设计实用教程</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">(3)-</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">以百度为例</SPAN></FONT></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><SPAN lang=EN-US><FONT face="Times New Roman"><FONT size=3>&nbsp;</FONT></FONT></SPAN><FONT size=3><SPAN lang=EN-US><SPAN style="mso-spacerun: yes"><FONT face="Times New Roman">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </FONT></SPAN></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">之三</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">:</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">对百度分词算法的进一步分析</SPAN></FONT></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><SPAN lang=EN-US><FONT face="Times New Roman"><FONT size=3>&nbsp;<SPAN><STRONG>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</STRONG></SPAN><SPAN><STRONG>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</STRONG></SPAN></FONT></FONT></SPAN><SPAN><STRONG>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</STRONG> </SPAN></P>
<P></P>
<P></P><SPAN><STRONG>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </STRONG>&nbsp;中科院软件所</SPAN> <SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">malefactor</SPAN> 
<P></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><SPAN lang=EN-US style="FONT-SIZE: 10.5pt; FONT-FAMILY: 'Times New Roman'; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: 宋体; mso-font-kerning: 1.0pt; mso-ansi-language: EN-US; mso-fareast-language: ZH-CN; mso-bidi-language: AR-SA">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2005</SPAN><SPAN style="FONT-SIZE: 10.5pt; FONT-FAMILY: 宋体; mso-bidi-font-size: 12.0pt; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'; mso-bidi-font-family: 'Times New Roman'; mso-font-kerning: 1.0pt; mso-ansi-language: EN-US; mso-fareast-language: ZH-CN; mso-bidi-language: AR-SA">年</SPAN><SPAN lang=EN-US style="FONT-SIZE: 10.5pt; FONT-FAMILY: 'Times New Roman'; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: 宋体; mso-font-kerning: 1.0pt; mso-ansi-language: EN-US; mso-fareast-language: ZH-CN; mso-bidi-language: AR-SA">11</SPAN><SPAN style="FONT-SIZE: 10.5pt; FONT-FAMILY: 宋体; mso-bidi-font-size: 12.0pt; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'; mso-bidi-font-family: 'Times New Roman'; mso-font-kerning: 1.0pt; mso-ansi-language: EN-US; mso-fareast-language: ZH-CN; mso-bidi-language: AR-SA">月</SPAN></P>
<P></P>
<P></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt; TEXT-INDENT: 21pt; mso-char-indent-count: 2.0; mso-char-indent-size: 10.5pt"><FONT size=3><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">上面说过</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">经过分析得出百度的分词系统采用双向最大匹配分词</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">但是后来发现推理过程</SPAN></FONT><FONT size=3><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">中存在一个漏洞</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">而且推导出来的百度分词算法步骤还是过于繁琐</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">所以进一步进行分析</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN></FONT><FONT size=3><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">看看是否前面的推导有错误</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN></FONT></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt; TEXT-INDENT: 21pt; mso-char-indent-count: 2.0; mso-char-indent-size: 10.5pt"><FONT size=3><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">那么以前的分析有什么漏洞呢</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">?</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">我们推导百度分词有反向最大匹配的依据是百度将</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">分词为</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">从这里看好像采用了反向最大匹配</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">因为正向最大匹配的结果应该是</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北京</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">华</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">但是由此就推论说百度采用了双向最大匹配还是太仓促了</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">前面文章我们也讲过</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">百度有两个词典</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">一个普通词典</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">一个专有词典</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">而且是专有词典的词汇先切分</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">然后将剩余片断交给普通词典去切分</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">所以上面的</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">之所以被切分成</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">另外一个可能是</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">:</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云这个词汇是在专有词典里面存储的</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">所以先分析</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">这样得出</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">",</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">剩下</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">",</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">没什么好切分的</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">所以输出</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;.</FONT></SPAN></FONT></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt; TEXT-INDENT: 31.5pt; mso-char-indent-count: 3.0; mso-char-indent-size: 10.5pt"><FONT size=3><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">这里只是假设</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">那么是否确实</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">在专有词典呢</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">?</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">我们再看一个例子</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">山东北京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">",</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">百度切分的结果是</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">山东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">如果</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">在普通词典</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">如果是反向切分</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">那么结果应该是</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">山</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">东北</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">如果是正向切分应该是</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">山东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北京</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">华</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">无论如何都分不出</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">山东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;.</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">这说明什么</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">?</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">说明</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">是在那个专有词典</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">所以先切分出</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">",</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">然后剩下的</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">山东北</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">交由普通词典切分</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">明显是正向最大匹配的结果输出</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">山东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;.</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">当然按照我们在第一篇文章的算法推导</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">山东北</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">的切分也会得出</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">山东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">北</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">的结论</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">但是明显比正向最大匹配多几个判断步骤</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">既然效果一样</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">另外一个更加简洁的方法也能说得通</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">那当然选择简便的方法了</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">所以初步判断百度采取的是正向最大匹配</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN></FONT></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt; TEXT-INDENT: 21pt; mso-char-indent-count: 2.0; mso-char-indent-size: 10.5pt"><FONT size=3><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">我们继续测试采用何种分词算法</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">为了减少专有词典首先分词造成的影响</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">那么查询里面不能出现相对特殊的词汇</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">构筑查询</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">天才能量级</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">",</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">这里应该没有专有词典出现过的词汇</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">百度切分为</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">天才</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">能量</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">级</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">看来是正向最大匹配的结果</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">另外</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">如果所有查询词汇都出现在专有词典</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">那么采取的是何种方法</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">?</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">这样首先就得保证词汇都出现在专有词典</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">这么保证这一点呢</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">?</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">我们构造查询</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">铺陈晓东方</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">",</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">百度切分为</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">铺</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈晓东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">方</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">可以看出</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈晓东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">是在专有词典的所以先切分出来</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">另外一个例子</SPAN><SPAN lang=EN-US><FONT face="Times New Roman"> "</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">山东京城</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">",</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">百度切分为</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">山东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京城</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">说明</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">东京</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">是在普通词典的</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.OK,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">构造查询</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈晓东京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">",</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">通过前面分析可以看出两个词汇都在专有词典里面</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">百度切分为</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈晓东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">京华烟云</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">说明对于专有词典词汇也是采取正向最大匹配或者双向最大匹配</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">那么使用反向最大匹配了吗</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">?</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">构造查询例子</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈晓东方不败</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">",</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">首先我们肯定</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈晓东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">和</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">东方不败</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">都是在专有词典出现的</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">如果是正向切分</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">那么应该是</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈晓东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">方</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">不败</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">或者</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈晓东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">方</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">不</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">败</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">如果是反向切分则是</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">晓</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">东方不败</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">可以看出百度的切分是</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈晓东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">方</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">不败</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">或者</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈晓东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">方</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">不</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">败</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">说明采用的是正向最大匹配</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">通过分析</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">百度的词典不包含</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">不败</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">"</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">这个单词</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">所以实际上百度的切分结果是</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&lt;</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">陈晓东</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">方</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">不</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">败</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">&gt;,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">很明显这和我们以前推导的算法是有矛盾的</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">所以以前的分析算法确实有问题</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">所以结论是百度采取的是正向最大匹配算法</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN></FONT></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt; TEXT-INDENT: 21pt; mso-char-indent-count: 2.0; mso-char-indent-size: 10.5pt"><FONT size=3><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">重新归纳一下百度的分词系统</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">:</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">首先用专有词典采用最大正向匹配分词</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">切分出部分结果</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">剩余没有切分交给普通词典</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">同样采取正向最大匹配分词</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">最后输出结果</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN></FONT></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt; TEXT-INDENT: 21pt; mso-char-indent-count: 2.0; mso-char-indent-size: 10.5pt"><FONT size=3><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">另外</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,GOOGLE</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">也是采用正向最大匹配分词算法</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">不过好像没有那个专用词典</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">所以很多专名都被切碎了</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN></FONT></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><FONT size=3><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">从这点讲</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,GOOGLE</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">在中文词典构建上比百度差些</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">还需要加把子力气才行</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">,</FONT></SPAN><SPAN style="FONT-FAMILY: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman'">不过这也不是什么多难的事</SPAN><SPAN lang=EN-US><FONT face="Times New Roman">.</FONT></SPAN></FONT></P>
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt"><SPAN lang=EN-US><FONT size=3><FONT face="Times New Roman"> 
<P></P></FONT></FONT></SPAN><img src ="http://www.cnitblog.com/rukas/aggbug/5027.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cnitblog.com/rukas/" target="_blank">Rukas - Oh, My Blog!</a> 2005-11-30 14:55 <a href="http://www.cnitblog.com/rukas/archive/2005/11/30/5027.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>搜索引擎设计实用教程-以百度为例 之二:Spelling Checker拼写检查错误提示(以及拼音提示功能) </title><link>http://www.cnitblog.com/rukas/archive/2005/11/30/5026.html</link><dc:creator>Rukas - Oh, My Blog!</dc:creator><author>Rukas - Oh, My Blog!</author><pubDate>Wed, 30 Nov 2005 06:54:00 GMT</pubDate><guid>http://www.cnitblog.com/rukas/archive/2005/11/30/5026.html</guid><wfw:comment>http://www.cnitblog.com/rukas/comments/5026.html</wfw:comment><comments>http://www.cnitblog.com/rukas/archive/2005/11/30/5026.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cnitblog.com/rukas/comments/commentRss/5026.html</wfw:commentRss><trackback:ping>http://www.cnitblog.com/rukas/services/trackbacks/5026.html</trackback:ping><description><![CDATA[<P>搜索引擎设计实用教程(2)-以百度为例 <BR>　　之二:Spelling Checker拼写检查错误提示(以及拼音提示功能) <BR>　　<BR>　　中科院软件所 张俊林</P>
<P>　　2005年11月 </P>
<P><BR>　　拼写检查错误提示是搜索引擎都具备的一个功能,也就是说用户提交查询给搜索引擎,搜索引擎检查看是否用户输入的拼写有错误,对于中文用户来说一般造成的错误是输入法造成的错误.那么我们就来分析看看百度是怎么实现这一功能的. <BR>　　我们分析拼写检查系统关注以下几个问题: <BR>　　(1)系统如何判断用户的输入是有可能发生错误的查询呢? <BR>　　(2)如果判断是可能错误的查询输入,如何提示正确的词汇呢? <BR>　　<BR>　　那么百度是如何做的呢?百度判断用户输入是否错误的标准,我觉得应该是查字典,如果发现字典里面不包含这个词汇,那么很有可能是个错误的输入,此时启动错误提示功能,这个很好判断,因为如果是一个正常词汇的话,百度一般不会有错误提示,而你故意输入一个词典不可能包含的所谓词汇,此时百度一般会提示你正确的检索词汇. <BR>　　那么百度是怎么提示正确词汇的呢?很明显是通过拼音的方式,比如我输入查询” 制才”,百度提供的提示词汇为: “:制裁 质材 纸材”,都是同音字.所以百度必然维持着一个同音词词典,里面保留着同音词信息,比如可能包含着下面这条词条: “ zhi cai à制裁,质材,纸材”,另外还有一个标注拼音程序,现在能够看到的基本流程是: 用户输入” 制才”,查词典,发现没有这个词汇,OK,启动标注拼音程序,将” 制才”标注为拼音”zhi cai”,然后查找同音词词典,发现同音词” 制裁,质材,纸材”,那么提示用户可能的正确拼写. <BR>　　整体流程看起来很简单,但是还有一些遗留的小问题,比如是否将词表里面所有同音词都作为用户的提示信息呢?比如 某个拼音有10个同音词,是否都输出呢?百度并没有将所有同音词都输出而是选择一定筛选标准,选择其中几个输出.怎么证明这一点?我们看看拼音”liu li”的同音词,紫光输入法提示同音词汇有” 流丽 流离 琉璃 流利”4个,我们看看百度返回几个,输入”流厉”作为查询,这里是故意输入一个词典不包含的词汇,这样百度的拼写检查才开始工作,百度提示: "琉璃 刘丽 刘莉 ",这说明什么?说明不是所有同音词都输出,而是选择输出,那么选择的标准是什么?我能够猜测到的方法是对于用户查询LOG进行统计,提取用户查询次数多的那些同音词输出,如果是这样的话,上面的例子说明用户搜索”琉璃”次数比其它的都要高些,次之是” 刘丽”,再次是” 刘莉”,看来大家都喜欢查询自己或者认识的人的名字. <BR>　　另外一个小问题:同音词词典包含2字词,3字词,那么是否包含4字词以及更长的词条?是否包含一字词? 这里一字词好回答,不用测试也能知道肯定不包含,因为你输入一个字,谁知道是否是错误的呢?反正只要是汉字就能在词表里面找到,所以没有判断依据.二字词是包含的,上面有例子,三字词也包含,比如查询 "中城药"百度错误提示:”中成药”,修改查询为"重城药",还是提示”中成药” ,再次修改查询 "重城要",百度依然提示”中成药”. 那么4字词汇呢? <BR>　　百度还是会给你提示的,下面是个例子: <BR>　　输入:静华烟云 提示 京华烟云 <BR>　　输入 静话烟云 提示 京华烟云 <BR>　　输入 静话阎晕 提示 京华烟云 <BR>　　那么更长的词汇是否提示呢?也提示,比如我输入: "落花世界有风军",这个查询是什么意思,估计读过古诗的都知道,看看百度的提示”落花时节又逢君”,这说明什么?说明同音词词典包含不同长度的同音词信息,另外也说明了百度的核心中文处理技术,也就是那个词典,还真挺大的. <BR>　　但是,如果用户输入的查询由两个或者两个以上子字符串构成,那么百度的错误提示功能就罢工了,比如输入查询"哀体",百度提示”艾提 挨踢”,但是.输入为 "我 哀体",则没有任何错误提示. <BR>　　还有一个比较重要的问题:如果汉字是多音字那么怎么处理?百度呢比较偷懒,它根本就没有对多音字做处理.我们来看看百度的一个标注拼音的错误,在看这个错误前先看看对于多音字百度是怎么提示错误的,我们输入查询"俱长",百度提示”剧场 局长”, “俱长”的拼音有两个:”ju zhang /ju chang” ,可见如果是多音字则几种情况都提示..现在我们来看看错误的情况, 我们输入查询”剧常”,百度提示”:剧场 局长”,提示为”剧场"当然好解释,因为是同音字,但是为什么 "局长"也会被提示呢?这说明百度的同音字词典有错误,说明在”ju chang”这个词条里面包含”局长”这个错误的同音词.让我们顺藤摸瓜,这个错误又说明什么问题呢?说明百度的同音词典是自动生成的,而且没有人工校对.还说明在自动生成同音词典的过程中,百度不是根据对一篇文章标注拼音然后在抽取词汇和对应的拼音信息获得的,而是完全按照某个词典的词条来标注音节的,所以对于多音字造成的错误无法识别出来,如果是对篇章进行拼音标注,可能就不会出现这种很容易发现的错误标注. 当然还有另外一种解释,就是"局长"是故意被百度提示出来可能的正确提示词汇,因为考虑到南方人"zh'和 "ch"等前后鼻音分不清么,那么是这样的么?我们继续测试到底是何种情况.是百度有错误还是这是百度的先进的算法? <BR>　　我们考虑词汇"长大",故意错误输入为"赃大",如果百度考虑到了前后鼻音的问题,那么应该会提示"长大",但是百度提示是"藏大".这说明什么?说明百度并没有考虑前后鼻音问题,根本就是系统错误. 我们输入查询"悬赏",故意将之错误输入为"悬桑",没有错误提示,说明确实没有考虑这种情况.前鼻音没有考虑,那么后鼻音考虑了么,我们输入”:经常”,故意改为后鼻音 "经缠",百度提示为"经产 经忏",还是没有考虑后鼻音.这基本可以确定是百度系统的错误导致. <BR>　　根据以上推导,我们可以得出如下结论:百度是将分词词典里面每个词条利用拼音标注程序标注成拼音,然后形成同音词词典,所以两个词典是同样大的,而且这个词典也随着分词词典的增长而在不断增长. 至于标注过程中多音字百度没有考虑,如果是多音字就标注成多个发音组合,通过这种方式形成同音词词典.这样的同音词词典显然包含着很多错误. <BR>　　最后一个问题:百度对于英文进行拼写检查么?让我们试试看,输入查询”china”,不错,搜到不少结果,专注中文搜索的百度还能搜索到英文,真是意外的惊喜.变换一下查询”chine”,会更加意外惊喜的给我们提示”china”吗?百度提示的是: 吃呢 持呢,原来是不小心触发了百度的拼音搜索功能了.那么拼音搜索和中文检查错误是否采用同一套同音词词典呢,让我们来实验一下,搜索”rongji”,百度提示” 榕基 溶剂 容积”,OK,换个中文查询”容机”,百度提示” 榕基 溶剂 容积”,看来使用的是同一套同音词词典.也就是说百度的中文纠错和拼音检索使用的机制相同,中文纠错多了一道拼音注音的过程而已.难道这就是传说中那个百度的”事实上是一个无比强大的拼音输入法”的拼音提示功能么? <BR>　　最后让我们总结归纳一下百度的拼写检查系统: <BR>　　后台作业: (1)前面的文章我们说过,百度分词使用的词典至少包含两个词典一个是普通词典,另外一个是专用词典(专名等),百度利用拼音标注程序依次扫描所有词典中的每个词条,然后标注拼音,如果是多音字则把多个音都标上,比如”长大”,会被标注为”zhang da/chang da”两个词条. <BR>　　(2)通过标注完的词条,建立同音词词典,比如上面的”长大”,会有两个词条: zhang daà长大” , chang daà长大. <BR>　　(3)利用用户查询LOG频率信息给予每个中文词条一个权重; <BR>　　(4)OK,同音词词典建立完成了,当然随着分词词典的逐步扩大,同音词词典也跟着同步扩大; <BR>　　<BR>　　拼写检查: <BR>　　(1)用户输入查询,如果是多个子字符串,不作拼写检查; <BR>　　(2)对于用户查询,先查分词词典,如果发现有这个单词词条,OK,不作拼写检查; <BR>　　(3)如果发现词典里面不包含用户查询,启动拼写检查系统;首先利用拼音标注程序对用户输入进行拼音标注; <BR>　　(4)对于标注好的拼音在同音词词典里面扫描,如果没有发现则不作任何提示; <BR>　　(5)如果发现有词条,则按照顺序输出权重比较大的几个提示结果; <BR>　　<BR>　　拼音提示: <BR>　　(1)对于用户输入的拼音在同音词词典里面扫描,如果没有发现则不作任何提示; <BR>　　(2)如果发现有词条,则按照顺序输出权重比较大的几个提示结果; </P><BR><BR>
<P id=TBPingURL>Trackback: http://tb.blog.csdn.net/TrackBack.aspx?PostId=537266<BR></P><img src ="http://www.cnitblog.com/rukas/aggbug/5026.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cnitblog.com/rukas/" target="_blank">Rukas - Oh, My Blog!</a> 2005-11-30 14:54 <a href="http://www.cnitblog.com/rukas/archive/2005/11/30/5026.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>搜索引擎设计实用教程-以百度为例 之一:查询处理以及分词技术</title><link>http://www.cnitblog.com/rukas/archive/2005/11/30/5025.html</link><dc:creator>Rukas - Oh, My Blog!</dc:creator><author>Rukas - Oh, My Blog!</author><pubDate>Wed, 30 Nov 2005 06:52:00 GMT</pubDate><guid>http://www.cnitblog.com/rukas/archive/2005/11/30/5025.html</guid><wfw:comment>http://www.cnitblog.com/rukas/comments/5025.html</wfw:comment><comments>http://www.cnitblog.com/rukas/archive/2005/11/30/5025.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cnitblog.com/rukas/comments/commentRss/5025.html</wfw:commentRss><trackback:ping>http://www.cnitblog.com/rukas/services/trackbacks/5025.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: 搜索引擎设计实用教程-以百度为例 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 之一:查询处理以及分词技术 ...&nbsp;&nbsp;<a href='http://www.cnitblog.com/rukas/archive/2005/11/30/5025.html'>阅读全文</a><img src ="http://www.cnitblog.com/rukas/aggbug/5025.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cnitblog.com/rukas/" target="_blank">Rukas - Oh, My Blog!</a> 2005-11-30 14:52 <a href="http://www.cnitblog.com/rukas/archive/2005/11/30/5025.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>