Thunder Bird
Communication & Improvement
posts - 47,  comments - 155,  trackbacks - 0

搞自然语言处理的应该不会对这个概念感到陌生,编辑距离就是用来计算从原串(s)转换到目标串(t)所需要的最少的插入,删除和替换的数目,在NLP中应用比较广泛,如一些评测方法中就用到了(wer,mWer等),同时也常用来计算你对原文本所作的改动数。
编辑距离的算法是首先由俄国科学家Levenshtein提出的,故又叫Levenshtein Distance。
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,

  • If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
  • If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.

The greater the Levenshtein distance, the more different the strings are.

Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.

The Levenshtein distance algorithm has been used in:

  • Spell checking
  • Speech recognition
  • DNA analysis
  • Plagiarism detection

The Algorithm

Steps

Step Description
1 Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.
2 Initialize the first row to 0..n.
Initialize the first column to 0..m.
3 Examine each character of s (i from 1 to n).
4 Examine each character of t (j from 1 to m).
5 If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.
6 Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7 After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

Example

This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL".

Steps 1 and 2

    G U M B O
  0 1 2 3 4 5
G 1          
A 2          
M 3          
B 4          
O 5          
L 6          

Steps 3 to 6 When i = 1

    G U M B O
  0 1 2 3 4 5
G 1 0        
A 2 1        
M 3 2        
B 4 3        
O 5 4        
L 6 5        

Steps 3 to 6 When i = 2

    G U M B O
  0 1 2 3 4 5
G 1 0 1      
A 2 1 1      
M 3 2 2      
B 4 3 3      
O 5 4 4      
L 6 5 5      

Steps 3 to 6 When i = 3

    G U M B O
  0 1 2 3 4 5
G 1 0 1 2    
A 2 1 1 2    
M 3 2 2 1    
B 4 3 3 2    
O 5 4 4 3    
L 6 5 5 4    

Steps 3 to 6 When i = 4

    G U M B O
  0 1 2 3 4 5
G 1 0 1 2 3  
A 2 1 1 2 3  
M 3 2 2 1 2  
B 4 3 3 2 1  
O 5 4 4 3 2  
L 6 5 5 4 3  

Steps 3 to 6 When i = 5

    G U M B O
  0 1 2 3 4 5
G 1 0 1 2 3 4
A 2 1 1 2 3 4
M 3 2 2 1 2 3
B 4 3 3 2 1 2
O 5 4 4 3 2 1
L 6 5 5 4 3 2

Step 7

The distance is in the lower right hand corner of the matrix, i.e. 2. This corresponds to our intuitive realization that "GUMBO" can be transformed into "GAMBOL" by substituting "A" for "U" and adding "L" (one substitution and 1 insertion = 2 changes).
由于,我在实际应用中要处理中文,每个汉字在内存中占两个字节,如果单纯用上述程序进行比较,就会有一些微小错误容易让人忽视,如汉字的“啊”和“阿”他们就有一个字节是相同的,一个字节是不同的,利用上述程序统计出的更改数除以2就会出现半个字,所以,对于汉英混合文本统计更改数时,需先判断当前进行比较的两个字是汉字还是西文字母,然后填写一个代价矩阵,在填写时,如果是汉字,要把其相邻的两个字节对应的代价矩阵赋为同一个值,具体做法,请看代码:
LD(const char *s, const char *t)
{
 int *d; // pointer to matrix
 int n; // length of s
 int m; // length of t
 int i; // iterates through s
 int j; // iterates through t
 char s_i1; // ith character of s
 char s_i2; // ith character of s
 char t_j1; // jth character of t
 char t_j2; // jth character of t
 int *cost; // cost代价矩阵
 int result; // result
 int cell; // contents of target cell
 int above; // contents of cell immediately above
 int left; // contents of cell immediately to left
 int diag; // contents of cell immediately above and to left
 int sz; // number of cells in matrix

 // Step 1 

 n = strlen (s);
 m = strlen (t);
 if (n == 0)
 {
  return m;
 }
 if (m == 0)
 {
  return n;
 }
 sz = (n+1) * (m+1) * sizeof (int);
 d = (int *) malloc (sz);
 cost = (int *) malloc (sz);

 // Step 2

 for (i = 0; i <= n; i++)
 {
  PutAt (d, i, 0, n, i);
 }

 for (j = 0; j <= m; j++)
 {
  PutAt (d, 0, j, n, j);
 }
 for (int g=0;g<=m;g++)//把代价距离矩阵全部初始化为同一个值,以后可根据此值判断相应的方格是否被赋过值
 {
  for(int h=0;h<=n;h++)
  {
   PutAt(cost,h,g,n,2);
  }
 }
 // Step 3

 for (i = 1; i <= n; i++)
 {

  s_i1 = s[i-1];
  s_i2 = s[i];
  bool sbd=false;
  bool tbd=false;
  if(s_i1>=' '&&s_i1<='@'||s_i1>='A'&&s_i1<='~')
  {//s为标点符号或其他非中文符号和数字
  sbd=true;
  }
  // Step 4

  for (j = 1; j <= m; j++)
  {

   tbd=false;
   t_j1 = t[j-1];
   t_j2 = t[j];
   // Step 5
   if(t_j1>=' '&&t_j1<='@'||t_j1>='A'&&t_j1<='~')
   {//t也为标点符号
    tbd=true;
   }
   if(!sbd)
   {//s为汉字
    if(!tbd)
    {//t也为汉字
     if (s_i1 == t_j1&&s_i2 == t_j2)
     {
      bool tt=false;
      int temp=GetAt(cost,i,j,n);
      if(temp==2)
      {
       PutAt(cost,i,j,n,0);
       tt=true;
      }
      if(tt)
      {//因为st全市汉字,所以把代价矩阵他相邻的未赋过值的三个格赋值
       int temp1=GetAt(cost,i+1,j,n);
       if(temp1==2)
       {
        PutAt(cost,i+1,j,n,0);
       }
       int temp2=GetAt(cost,i,j+1,n);
       if(temp2==2)
       {
        PutAt(cost,i,j+1,n,0);
       }
       int temp3=GetAt(cost,i+1,j+1,n);
       if(temp3==2)
       {
        PutAt(cost,i+1,j+1,n,0);
       }
      }
     }
     else
     {
      bool tt=false;
      int temp=GetAt(cost,i,j,n);
      if(temp==2)
      {
       PutAt(cost,i,j,n,1);
       tt=true;
      }
      if(tt)
      {
       int temp1=GetAt(cost,i+1,j,n);
       if(temp1==2)
       {
        PutAt(cost,i+1,j,n,1);
       }
       int temp2=GetAt(cost,i,j+1,n);
       if(temp2==2)
       {
        PutAt(cost,i,j+1,n,1);
       }
       int temp3=GetAt(cost,i+1,j+1,n);
       if(temp3==2)
       {
        PutAt(cost,i+1,j+1,n,1);
       }
      }
     }
    }
    else
    {//t为符号
     bool tt=false;
     int temp=GetAt(cost,i,j,n);
     if(temp==2)
     {
      PutAt(cost,i,j,n,1);
      tt=true;
     }
     if(tt)
     {
      int temp1=GetAt(cost,i+1,j,n);
      if(temp1==2)
      {
       PutAt(cost,i+1,j,n,1);
      }
     }
    
    }

   }
   else
   {//s为符号
    if(!tbd)
    {//t为汉字 
     bool tt=false;
     int temp=GetAt(cost,i,j,n);
     if(temp==2)
     {
      PutAt(cost,i,j,n,1);
      tt=true;
     }
     if(tt)
     {
      int temp1=GetAt(cost,i,j+1,n);
      if(temp1==2)
      {
       PutAt(cost,i,j+1,n,1);
      }
     }
    }
    else
    {
     if(s_i1==t_j1)
     {
      int temp=GetAt(cost,i,j,n);
      if(temp==2)
      {
       PutAt(cost,i,j,n,0);
      }
     }
     else
     {
      int temp=GetAt(cost,i,j,n);
      if(temp==2)
      {
       PutAt(cost,i,j,n,1);
      }
     }
    }

   }

    // Step 6

   above = GetAt (d,i-1,j, n);
   left = GetAt (d,i, j-1, n);
   diag = GetAt (d, i-1,j-1, n);
   int curcost=GetAt(cost,i,j,n);
   cell = Minimum (above + 1, left + 1, diag + curcost);
   PutAt (d, i, j, n, cell);
  }
 }

  // Step 7

  result = GetAt (d, n, m, n);
  free (d);
  return result;

}

posted on 2005-12-27 20:44 Thunder 阅读(12041) 评论(13)  编辑 收藏 引用

FeedBack:
# re: 编辑距离(Levenshtein Distance)
2006-04-10 16:13 | bawfnjechen@gmail.com
我需要一个java版的,谢谢  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2006-04-10 17:31 | Thunder
Java version
public class Distance {

//****************************
// Get minimum of three values
//****************************

private int Minimum (int a, int b, int c) {
int mi;

mi = a;
if (b < mi) {
mi = b;
}
if (c < mi) {
mi = c;
}
return mi;

}

//*****************************
// Compute Levenshtein distance
//*****************************

public int LD (String s, String t) {
int d[][]; // matrix
int n; // length of s
int m; // length of t
int i; // iterates through s
int j; // iterates through t
char s_i; // ith character of s
char t_j; // jth character of t
int cost; // cost

// Step 1

n = s.length ();
m = t.length ();
if (n == 0) {
return m;
}
if (m == 0) {
return n;
}
d = new int[n+1][m+1];

// Step 2

for (i = 0; i <= n; i++) {
d[i][0] = i;
}

for (j = 0; j <= m; j++) {
d[0][j] = j;
}

// Step 3

for (i = 1; i <= n; i++) {

s_i = s.charAt (i - 1);

// Step 4

for (j = 1; j <= m; j++) {

t_j = t.charAt (j - 1);

// Step 5

if (s_i == t_j) {
cost = 0;
}
else {
cost = 1;
}

// Step 6

d[i][j] = Minimum (d[i-1][j]+1, d[i][j-1]+1, d[i-1][j-1] + cost);

}

}

// Step 7

return d[n][m];

}

}

  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2006-05-14 02:18 | o0晨0o
麻烦把PutAt()和GetAt()两个函数也传上来吧,谢谢!  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2006-05-14 17:00 | Thunder
double CLdDlg::Minimum(double a, double b, double c)
{
double mi;

mi = a;
if (b < mi) {
mi = b;
}
if (c < mi) {
mi = c;
}
return mi;
}

double* CLdDlg::GetCellPointer(double *pOrigin, int col, int row, int nCols)
{
return pOrigin + col + (row * (nCols + 1));
}

double CLdDlg::GetAt(double *pOrigin, int col, int row, int nCols)
{
double *pCell;

pCell = GetCellPointer (pOrigin, col, row, nCols);
return *pCell;
}

void CLdDlg::PutAt(double *pOrigin, int col, int row, int nCols, double x)
{
double *pCell;
pCell = GetCellPointer (pOrigin, col, row, nCols);
*pCell = x;
}
  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2006-06-19 15:49 | Blau
请问有LD算法的完整C++版吗?谢谢  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2006-08-15 18:11 | yhhu
没必要这么复杂吧。只需对汉字字符串做预处理,转换成每个元素为单字节或者双字节的新字符串。后续处理照旧。  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2007-04-13 12:25 | Raining
提交一个比较简洁的,稳定的编辑距离实现。
如下代码用于比较中文串的编加距离。如果是英文串,就要简单些。

//最小编辑距离方法比较相似度。
int minEditDistance( const std::string& left, const std::string& right )
{
int lz = (int)left.size();
int rz = (int)right.size();
int size = lz>rz?lz:rz;

char lc[size][2];
char rc[size][2];

const char * ccl = left.c_str();
const char * ccr = right.c_str();

//先形成两个汉字序列。
int llen = 0;
for ( int i = 0; i < lz; i++ )
{
if ( 0 > ccl[i] )
{
lc[llen][0] = ccl[i++];
if ( i != lz ) lc[llen][1] = ccl[i];
}
else
{
lc[llen][0] = ccl[i];
lc[llen][1] = 0;
}
llen++;
}

int rlen = 0;
for ( int i = 0; i < rz; i++ )
{
if ( 0 > ccr[i] )
{
rc[rlen ][0] = ccr[i++];
if ( i != rz ) rc[rlen ][1] = ccr[i];
}
else
{
rc[rlen ][0] = ccr[i];
rc[rlen ][1] = 0;
}
rlen++;
}

//开始最小编辑距离计算。
int tmp[(llen+1)*(rlen+1)];
int ** C = new int*[llen+1];
for ( int i = 0; i < llen+1; i++ ) C[i] = tmp + i*(rlen+1);

int x, y, z;

for ( int i = 0; i < llen+1; i++ )
{
C[i][0] = i;
}
for ( int j = 0; j < llen+1; j++ )
{
C[0][j] = j;
}

for ( int i = 1; i < llen+1; i++ )
{
for ( int j = 1; j < rlen+1; j++ )
{
x = C[i-1][j] + 1;
y = C[i][j-1] + 1;

if ( lc[i-1][0] == rc[j-1][0] && lc[i-1][1] == rc[j-1][1] )
{
z = C[i-1][j-1];
}
else
{
z = C[i-1][j-1] + 1;
}

int min = x;
if ( y < min ) min = y;
if ( z < min ) min = z;

C[i][j] = min;
}
}

int result = C[llen-1][rlen-1];

delete [] C;

return result;
}
  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2008-03-11 13:38 | G36
这么简单的算法被你搞成这样,看着就想吐.世界上由很多种编码的,你先把字符串转化成unicode然后再算不就完了,I服了U  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2008-03-31 15:23 | ictliu
@G36
楼上不懂不要装懂!估计你至多也就是本科学历!
Levenshtein Distance是非常经典的一个动态规划算法!  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2008-04-18 11:32 |
楼主写的代码 根本不正确
可以使用,但是得到的是错误答案!  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2008-04-18 11:35 |
因为处理都是错误的搞毛啊!  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2008-04-18 11:36 |
英文都不能处理

代码错误!!!!

鉴定完毕  回复  更多评论
  
# re: 编辑距离(Levenshtein Distance)
2008-05-30 16:09 | schin
晕 中国的研究生国际上不认的 大家都是本科学历  回复  更多评论
  
只有注册用户登录后才能发表评论。

<2005年12月>
27282930123
45678910
11121314151617
18192021222324
25262728293031
1234567

常用链接

留言簿(8)

随笔档案

相册

搜索

  •  

最新评论

阅读排行榜

评论排行榜