ES pinyin 插件拼音搜索原理 match_phase

标签： search es pinyin 拼音搜索原理 match_phrase

背景

中文搜索很多时候都要用到pinyin搜索，基本绕不开这个插件；如搜索人名之类的；

介绍

插件github：地址

在README的最后，举的例子挺有意思；经过一系列操作之后，对刘德华建index，竟然搜liudh，刘dh，各种奇葩的搜索都能搜出来，这是为啥呢？让我们来仔细分析一下。

如官网的配置

配置analyzer

PUT /medcl3/
{
   "settings" : {
       "analysis" : {
           "analyzer" : {
               "pinyin_analyzer" : {
                   "tokenizer" : "my_pinyin"
                   }
           },
           "tokenizer" : {
               "my_pinyin" : {
                   "type" : "pinyin",
                   "keep_first_letter":true,
                   "keep_separate_first_letter" : true,
                   "keep_full_pinyin" : true,
                   "keep_original" : false,
                   "limit_first_letter_length" : 16,
                   "lowercase" : true
               }
           }
       }
   }
}

主要是用了分词器tokenizer：my_pinyin。
具体的设置是，
keep_first_letter: true ；也就是会将刘德华 -> ldh
keep_seperate_first_letter: true; 将刘德华 -> l 、 d 、 h
keep_full_pinyin: true; 将刘德华 -> liu, de, hua

有了这些设置之后，我们发现对刘德华进行analyze：

GET /hjxtest_pinyin/_analyze
{
  "text": "刘德华",
  "analyzer": "pinyin_analyzer"
}

得到结果就是上面说的这7个key

{
  "tokens": [
    {
      "token": "l",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "ldh",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "d",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "de",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    },
    {
      "token": "hua",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    }
  ]
}

然后我们建好index，搜索liudh的时候，会先用相同的分词方法分词：

GET /hjxtest_pinyin/_analyze
{
  "text": "liudh",
  "analyzer": "pinyin_analyzer"
}

分词结果

{
  "tokens": [
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "liudh",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "d",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    }
  ]
}

可见，我们牛皮的分词器，会分词出结果 liu + d + h + liudh
回顾我们建的倒排索引： liu de hua l d h ldh
搜索的时候
liu d h都能找到咱们的文档，当然就可以搜到结果了：

GET /hjxtest_pinyin/_search
{
  "query": {"match": {
    "name.pinyin": "liudh"
  }}
}

在这里插入图片描述
但是我们发现一个有意思的现象，当我们搜liudh的时候，竟然会把黄渤也搜出来，这是什么鬼？

盲猜是因为 analyze的时候，黄渤 analyze的结果是：
huang + bo + h + b + hb
然后搜索的时候跟liudh的h match上了

验证一下： 黄渤 analyze的结果是：

{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "huang",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "hb",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "b",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "bo",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    }
  ]
}

果然跟猜想的一致。

那怎么办呢，这种准确率也太低了吧

我们看到github上给的查询例子实际上是match_phase而不是match

区别是啥？参看官网

在这里插入图片描述

match_phrase要求query和doc不仅要在term上有交集，还需要顺序保持一致

具体到我们这个例子，我搜liudh 文档里的liu d h 也必须匹配着顺序出现，所以就只有刘德华可以匹配上了：

GET /hjxtest_pinyin/_search
{
  "query": {"match_phrase": {
    "name.pinyin": "liudh"
  }}
}

在这里插入图片描述
这样就提高了准确率了。

在这里插入图片描述

本文链接：https://blog.csdn.net/waltonhuang/article/details/106834903

智能推荐

汉字拼音转换工具 pypinyin 和pinyin 安装使用比对记录

将汉字转为拼音。可以用于汉字注音、排序、检索记录整理两个包的安装使用过程 pypinyin和pinyin pypinyin： https://pypi.org/project/pypinyin/ 作者mozillazg, 闲耘 pinyin： https://pypi.org/project/pinyin/ 作者Author: Lx Yu pypinyin 安装pypinyin包查看安装的包的...

探索PinYin4j.jar将汉字转换为拼音的基本用法

将汉字转换为拼音在Android开发中是个很常见的问题。例如：在android手机应用开发中，要查询联系人的姓名，通常都是用拼音进行查询的。 Pinyin4j是一个功能强悍的汉语拼音工具包，是sourceforge.net上的一个开源项目。主要的功能有： - 支持同一汉字有多个发音 - 支持拼音的格式化输出，比如第几声之类的 - 支持简体中文、繁体中文转换为拼音首先，在Android Stud...

linux上安装Qt4.8.6+QtCreator4.0.3

一、Qt简介 Qt是1991年奇趣科技开发的一个跨平台的C++图形用户界面应用程序框架。它提供给应用程序开发者建立艺术级的图形用户界面所需的所有功能。Qt很容易扩展，并且允许真正地组件编程。准备工作操作系统：centos6.5 位数：64位二、安装 1、获取源码Qt4.8.6 2、获取源码QtCreator4.0.3 2、安装QtCreator4.0.3 进入QtCreator安装界面，指定...

react-native metro 分析

文章目录前言概念 Resolution Transformation Serialization 打包方式 Moudles Plain bundle Indexed RAM bundle File RAM bundle 流程前置流程 resolve流程 Transformer流程序列化流程缓存为什么要缓存缓存的请求与缓存 Metro配置结构前言 metro是一种支持ReactNa...

嵌入式Linux——应用调试：用户态打印段错误信息

简介：很多时候我们会遇到段错误：segmentation fault，而段错误有时是由内核引起的，有时是由应用程序引起的。在内核态时，发生段错误时会打印oops信息，但是在用户态时，发生段错误却只会打印segmentation fault而并不会打印其他的信息。所以本文主要介绍在用户态时，通过修改内核设置和添加启动参数来打印引发segmentati...

ES pinyin 插件 拼音搜索 原理 match_phase

背景

介绍

配置analyzer

智能推荐

猜你喜欢

ES pinyin 插件拼音搜索原理 match_phase