Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.5.4版本搜索返回为空 #167

Open
lht1221 opened this issue Aug 23, 2019 · 4 comments
Open

6.5.4版本搜索返回为空 #167

lht1221 opened this issue Aug 23, 2019 · 4 comments

Comments

@lht1221
Copy link

lht1221 commented Aug 23, 2019

使用index_ansj存储,query_ansj搜索 mapping简略配置如下:

"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"my_char_filter"
],
"tokenizer": "index_ansj"
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip"
}
}
},
"title": {
"search_analyzer": "query_ansj",
"fielddata": true,
"analyzer": "my_analyzer",
"type": "text"
},

搜索语句如下返回结果为空(搜索词加引号)

{
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "baseInfo.title",
"query": ""全新奥迪""
}
}
],
"must_not": [ ],
"should": [ ]
}
},
"from": 0,
"size": 10,
"sort": [ ],
"aggs": { }
}

搜索词不加引号时正常。但返回结果数量很多。
但并不是所有词加引号都不会返回结果,比如"谍照曝光"等词可以正常返回。
我看了下默认词典好像词性只为n的词前后都不能加其他词去搜索,
比如"谍照表"词性为n,文章中原文是“本田Urban EV谍照表示其车型由概念车的三门版”。用"谍照表"可以搜索出结果,但"谍照表示"无法搜索到结果,但用"谍照表"+"示"或"谍照表"+"表示"两个词同时搜索都可以得到文章。
同样方法在5.5.0版本中可以搜索到结果,但这个版本没有单独定义搜索分词,全部使用的dic_ansj分词。mapping如下

"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"my_char_filter"
],
"tokenizer": "dic_ansj"
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip"
}
}
},
"title": {
"fielddata": true,
"analyzer": "my_analyzer",
"type": "text"

我的表达能力有限,请大大多多理解给予帮助。谢谢

@shi-yuan
Copy link
Member

"本田Urban EV谍照表示其车型由概念车的三门版",tokens:

{
    "tokens": [
       ...,
        {
            "token": "",
            "start_offset": 10,
            "end_offset": 11,
            "type": "null",
            "position": 4
        },
        {
            "token": "",
            "start_offset": 11,
            "end_offset": 12,
            "type": "v",
            "position": 5
        },
        {
            "token": "表示",
            "start_offset": 12,
            "end_offset": 14,
            "type": "v",
            "position": 6
        },
        ...
    ]
}

"谍照表示",tokens:

{
    "tokens": [
        {
            "token": "",
            "start_offset": 0,
            "end_offset": 1,
            "type": "null",
            "position": 0
        },
        {
            "token": "",
            "start_offset": 1,
            "end_offset": 2,
            "type": "v",
            "position": 1
        },
        {
            "token": "表示",
            "start_offset": 2,
            "end_offset": 4,
            "type": "v",
            "position": 2
        }
    ]
}

@shi-yuan
Copy link
Member

"谍照表示",是可以的

{
  "query": {
    "query_string": {
      "query": "\"谍照表示\"",
      "default_field": "title"
    }
  }
}

@shi-yuan
Copy link
Member

shi-yuan commented Aug 25, 2019

这个,建议您看看,tokens和索引里的_termvectors

@lht1221
Copy link
Author

lht1221 commented Apr 8, 2020

@shi-yuan
不好意思过了这么久再次打扰了,我觉得我确定了搜索不到数据的问题为何发生了,
比如上面词句实际拆分为
{ "token": "谍", "start_offset": 10, "end_offset": 11, "type": "ng", "position": 6 } , { "token": "照表", "start_offset": 11, "end_offset": 13, "type": "n", "position": 7 } , { "token": "照", "start_offset": 11, "end_offset": 12, "type": "v", "position": 8 } , { "token": "表示", "start_offset": 12, "end_offset": 14, "type": "v", "position": 9 } , { "token": "表", "start_offset": 12, "end_offset": 13, "type": "n", "position": 10 } , { "token": "示", "start_offset": 13, "end_offset": 14, "type": "vg", "position": 11 }

谍照表符合"position": 6+"position": 7这两个连续的所以可以得到结果,
谍照表示 则是符合"position": 6+"position": 7+"position": 11或其他组合方式,但中间空着position": 8-10,所以搜索时没能匹配到。
我觉得原因是词组的position组成数字连续时可以搜索到数据,不连续的时候则搜索结果为空。

例如另外一个短句安达保险金融险部相关人士介绍
搜索用"安达保险"确保不拆词
虽然拆词中存在安达保险("position": 4774+"position": 4778),但因为position不连续,所以无法搜到"安达保险",而同样不拆词搜索"安达保"("position": 4775+"position": 4776),则可以得到结果
{ "token": "安达", "start_offset": 3410, "end_offset": 3412, "type": "nz", "position": 4774 } , { "token": "安", "start_offset": 3410, "end_offset": 3411, "type": "ag", "position": 4775 } , { "token": "达保", "start_offset": 3411, "end_offset": 3413, "type": "nr", "position": 4776 } , { "token": "达", "start_offset": 3411, "end_offset": 3412, "type": "v", "position": 4777 } , { "token": "保险金", "start_offset": 3412, "end_offset": 3415, "type": "n", "position": 4778 } , { "token": "保险", "start_offset": 3412, "end_offset": 3414, "type": "n", "position": 4779 } , { "token": "保", "start_offset": 3412, "end_offset": 3413, "type": "v", "position": 4780 } , { "token": "险", "start_offset": 3413, "end_offset": 3414, "type": "ng", "position": 4781 } , { "token": "金融", "start_offset": 3414, "end_offset": 3416, "type": "n", "position": 4782 } , { "token": "金", "start_offset": 3414, "end_offset": 3415, "type": "b", "position": 4783 } , { "token": "融", "start_offset": 3415, "end_offset": 3416, "type": "vi", "position": 4784 } , {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants