Elasticsearch 停止令牌过滤器不起作用

分享于2022年07月17日 elasticsearch 问答
【问题标题】:Elasticsearch 停止令牌过滤器不起作用(Elasticsearch Stop Token Filter Not Working)
【发布时间】:2022-01-17 23:17:14
【问题描述】:

我在 Elasticsearch 7.10 中创建了一个如下所示的索引:

{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "description": {
        "type": "text"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "lowercase",
            "stemmer",
            "stop"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  }
}

如您所见,我配置了一个名为 my_analyzer 的自定义分析器,其中应用了 stop 令牌过滤器。基于 the documentation ,我希望此过滤器在索引时从文档的所有 text 类型属性中删除英语停用词。

确实,如果我使用此请求正文向 http://localhost:30200/my_index/_analyze 发送 POST 请求:

{
  "analyzer": "my_analyzer",
  "text": "If you are a horse, I do not want that cake"
}

我收到的回复表明标记 if a not that 已从提供的文本中删除:

{
    "tokens": [
        {
            "token": "you",
            "start_offset": 3,
            "end_offset": 6,
            "type": "",
            "position": 1
        },
        {
            "token": "ar",
            "start_offset": 7,
            "end_offset": 10,
            "type": "",
            "position": 2
        },
        {
            "token": "hors",
            "start_offset": 13,
            "end_offset": 18,
            "type": "",
            "position": 4
        },
        {
            "token": "i",
            "start_offset": 20,
            "end_offset": 21,
            "type": "",
            "position": 5
        },
        {
            "token": "do",
            "start_offset": 22,
            "end_offset": 24,
            "type": "",
            "position": 6
        },
        {
            "token": "want",
            "start_offset": 29,
            "end_offset": 33,
            "type": "",
            "position": 8
        },
        {
            "token": "cake",
            "start_offset": 39,
            "end_offset": 43,
            "type": "",
            "position": 10
        }
    ]
}

但是,如果我索引一个其 description 属性包含字符串“If you are a horse, I don't want that cake”的文档,然后通过向 http:// 发出 GET 请求来查询索引localhost:30200/my_index/_search 与此请求正文:

{
  "query": {
    "multi_match" : {
      "query": "that", 
      "fields": ["description"]
    }
  }
}

返回文档,即使分析器应该删除了单词“that”:

{
    "took": 3,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "27ibobulhqhc7s96jbz6653ud",
                "_score": 0.2876821,
                "_source": {
                    "id": "27ibobulhqhc7s96jbz6653ud",
                    "title": "muscular yak",
                    "description": "If you are a horse, I do not want that cake"
                }
            }
        ]
    }
}

那是什么?如果 stop 过滤器正在从索引的 text 属性中去除英语停用词,我希望查询其中一个停用词会返回零结果。在索引文档或处理查询时,我是否必须明确告诉 Elasticsearch 使用 my_analyzer

不管怎样,我配置的其他过滤器( lowercase stemmer )似乎可以按预期工作。只是 stop 给我带来了麻烦。


【解决方案1】:

你快到了。您只需将 description 字段与您创建的客户分析器映射,如下所示。这将确保 description 字段的内容在索引和搜索时使用 my_analyzer

{
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "description": {
        "type": "text",
        "analyzer": "my_analyzer"          // note this
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "lowercase",
            "stemmer",
            "stop"
          ],
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  }
}