Elasticsearch

概述

The Elastic Stack, 包括 Elasticsearch、Kibana、Beats 和 Logstash（也称为 ELK Stack）。能够安全可靠地获取任何来源、任何格式的数据，然后实时地对数据进行搜索、分析和可视化。Elasticsearch是一个基于 Lucene 的搜索服务器，简称为 ES，ES 是一个开源的高扩展的分布式全文搜索引擎，是整个 Elasticsearch 技术栈的核心。它可以近乎实时的存储、检索数据；本身扩展性很好，可以扩展到上百台服务器，处理 PB 级别的数据

Google、百度类的网站搜索，它们都是根据网页中的关键字生成索引，我们在搜索的时候输入关键字，它们会将该关键字即索引匹配到的所有网页返回；还有常见的项目中应用日志的搜索等等。对于这些非结构化的数据文本，关系型数据库搜索不是能很好的支持

一般传统数据库，全文检索都实现的很鸡肋，因为一般也没人用数据库存文本字段。进行全文检索需要扫描整个表，如果数据量大的话即使对 SQL 的语法优化，也收效甚微。建立了索引，但是维护起来也很麻烦，对于 insert 和 update 操作都会重新构建索引

基于以上原因可以分析得出，在一些生产环境中，使用常规的搜索方式，性能是非常差的：

搜索的数据对象是大量的非结构化的文本数据
文件记录量达到数十万或数百万个甚至更多
支持大量基于交互式文本的查询
需求非常灵活的全文搜索查询
对高度相关的搜索结果的有特殊需求，但是没有可用的关系数据库可以满足
对不同记录类型、非文本数据操作或安全事务处理的需求相对较少的情况。为了解决结构化数据搜索和非结构化数据搜索性能问题，我们就需要专业，健壮，强大的全文搜索引擎

这里说到的全文搜索引擎指的是目前广泛应用的主流搜索引擎。它的工作原理是计算机索引程序通过扫描文章中的每一个词，对每一个词建立一个索引，指明该词在文章中出现的次数和位置，当用户查询时，检索程序就根据事先建立的索引进行查找，并将查找的结果反馈给用户的检索方式。这个过程类似于通过字典中的检索字表查字的过程。本篇文章主要介绍 Elasticsearch 8.x，对于更早之前的 7.x 请自行查阅资料

索引操作

创建索引

在 Elasticsearch中，索引是一个包含文档的集合。我们可以使用 PUT 命令来创建一个新的索引。例如，要创建一个名为 "my_index" 的索引，可以使用以下命令：

1	PUT my_index

注意：索引名称必须小写且不能重复。很多时候我们可能忘记了有没有创建过这个索引，这个时候我们可以使用 HEAD 来查看所以是否存在

1	HEAD my_idnex

如果不存在它回返回 404 - Not Found

查询索引

使用 GET 命令来查询索引。例如，要查询一个名为 "my_index" 的索引，可以使用以下命令：

1	GET my_idnex

也可以使用 _cat 查询所有的索引：

1	GET _cat/indices

修改索引

ES 软件是不允许修改索引信息的，如果你想修改，只能是创建一个新的索引

删除索引

删除索引使用 DELETE 命令后面加上索引名称

1	DELETE my_index

相应结果 acknowledged 为 true，则表示删除成功

文档操作

创建文档

在 Elasticsearch 中，文档是我们要搜索和分析的数据。我们可以使用 PUT 或 POST 命令将文档添加到索引中。使用 PUT 的时候需要添加一个唯一性标识（在 _doc/ 后面加）例如，要将一个名为 "my_document" 的文档添加到 "my_index" 索引中，可以使用以下命令：

PUT my_index/_doc/1001
{
  "id": 1001,
  "title": "My Document",
  "content": "This is my document."
}

查询文档

使用 GET 命令查询文档

1	GET my_index/_doc/1001

如果是使用 POST 方法创建的文档，id 很长或者你不知道，你也可以查询索引中的所有文档数据

1	GET my_index/_search

修改文档

修改文档和创建文档类似，只需要你把数据换成你想要的即可

PUT my_index/_doc/1001
{
  "id": 1001,
  "title": "My Document",
  "content": "This is my update document."
}

删除文档

使用 DELETE 命令来删除文档。例如，要删除 "my_index" 索引中 id 为 1

1	DELETE my_index/_doc/1

文档搜索

在 Elasticsearch 中，你可以使用多种方式进行文档搜索，例如：

GET my_index/_search
{
  "query": {
    "match": { // match_all 查询所有
      "name": "张三"
    }
  }
}

注意：Elasticsearch match 使用的是分词查询。它会将查询字符串进行分词，然后在字段中匹配任何一个分词。如果你想要完整的匹配可以用 term，term 查询用于精确匹配，它不会对查询字符串进行分词

GET my_index/_search
{
  "query": {
    "term": {
      "content": "apple"
    }
  }
}

在 Elasticsearch 中查询时，你可以使用布尔查询将多个条件进行组合，并使用 sort 和 from / size 参数对结果进行排序分页。下面是一个示例查询，该查询包括两个条件：

match 查询：匹配 title 字段包含 "apple" 的文档
range 查询：过滤时间戳字段 timestamp，只包括在过去一天的文档

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "apple"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "timestamp": {
              "gte": "now-1d/d",
              "lte": "now/d"
            }
          }
        }
      ]
    }
  },
  "sort": [
    {
      "timestamp": {
        "order": "desc"
      }
    }
  ],
  "from": 0,
  "size": 10
}

聚合操作

在 Elasticsearch 中，聚合是一种强大的功能，用于按照特定条件对文档进行分组、计数、统计等操作。以下是一些常见的聚合操作示例：

Terms Aggregation: 根据字段值进行分组，并计算每个分组的文档数量。例如，聚合统计 "category" 字段的不同取值及其文档数量：

GET my_index/_search
{
  "aggs": {
    "category_count": {
      "terms": {
        "field": "category"
      }
    }
  }
}

Range Aggregation: 将文档按某个字段的范围进行分组，并计算每个分组的文档数量。例如，根据 "price" 字段将文档按价格区间分组，统计每个价格区间的文档数量：

GET my_index/_search
{
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 100 },
          { "from": 100, "to": 200 },
          { "from": 200 }
        ]
      }
    }
  }
}

Date Histogram Aggregation: 将文档按日期字段进行分组，并计算每个时间段内的文档数量或其他统计指标。例如，按 "timestamp" 字段将文档按天进行分组，统计每天的文档数量：

GET my_index/_search
{
  "aggs": {
    "daily_count": {
      "date_histogram": {
        "field": "timestamp",
        "calendar_interval": "day"
      }
    }
  }
}

索引模版

索引模板（Index Templates）是在 Elasticsearch 中创建和管理索引的重要工具。通过创建索引模板，可以在创建新索引时自动应用某些设置、映射模板和配置，从而使索引的管理和维护更加容易和一致。以下是一个索引模板的示例，它会自动为以 my-index 开头的索引创建设置：

PUT _template/my_template
{
  "index_patterns": [
    "my-index-*"
  ],
  "template": {
    "settings": {
      "number_of_shards": 1
    },
    "mappings": {
      "properties": {
        "message": {
          "type": "text"
        }
      }
    }
  }
}

在此示例中，我们创建了一个名为 my_template 的索引模板。index_patterns 项定义了与模板匹配的索引名称模式，例如 my-index-* 将匹配所有以 my-index- 开头的索引。settings 项定义了所有与这些索引相关的设置，例如 number_of_shards 参数设置了主分片数为 1；mappings 项定义了索引中各个字段的数据类型和其他属性，例如 message 字段的类型为 text。当新的符合模板匹配模式的索引被创建时，它将自动应用这些设置和映射模板

中文分词

在 Elasticsearch 中使用中文分词时，推荐使用 ik 分词器，它是开源分词器 Lucene 和 Elasticsearch 的联合产物。使用 ik 分词器需要进行以下步骤：

安装 ik 分词器插件。在 Elasticsearch 安装目录下 bin 目录中运行以下命令安装 ik 分词器插件：

1	./Elasticsearch-plugin install https://github.com/medcl/Elasticsearch-analysis-ik/releases/download/v7.10.2/Elasticsearch-analysis-ik-7.10.2.zip

以上命令中的版本号需要与 Elasticsearch 的版本号对应，具体请根据实际情况修改

创建索引时使用 ik 分词器。在创建索引时，可以指定使用 ik 分词器。例如：

PUT /my_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "ik_max_word": {
            "tokenizer": "ik_max_word"
          },
          "ik_smart": {
            "tokenizer": "ik_smart"
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      }
    }
  }
}

以上示例创建了一个名为 my_index 的索引，并使用 ik 分词器。analyzer 使用 ik_max_word 指定使用最细粒度分词，而search_analyzer 使用 ik_smart 指定使用最粗粒度分词。可以根据需要选择合适的分词器

在查询时使用 ik 分词器。在查询时，可以指定使用 ik 分词器。例如：

POST /my_index/_search
{
  "query": {
    "match": {
      "content": {
        "query": "我要买手机",
        "analyzer": "ik_smart"
      }
    }
  }
}

在查询结果中匹配 content 字段时，使用了 ik_smart 分词器。总体来说，使用 ik 分词器可以很好地实现中文分词，提高索引和检索的准确性

文档评分机制

Elasticsearch 中的文档评分机制是通过内置的相关性算法来对查询结果进行排序，并生成一个相对的分数，表示文档与查询的相关性。这个分数可以帮助你理解和排序查询结果。文档评分主要是基于以下两个核心概念：

逆向文档频率（Inverse Document Frequency，IDF）：表示一个术语在索引中的罕见程度。当一个术语在更多文档中出现时，它的 IDF 值较低，因为它不太具有唯一性。而当一个术语在较少文档中出现时，它的 IDF 值较高，因为它很有可能是与查询相关的关键术语
词频（Term Frequency，TF）：表示一个术语在一个特定文档中的出现频率。当一个术语在文档中出现的次数越多，它的 TF 值就越高。TF 反映了一个术语在查询中的重要性

这两个概念结合起来，形成了一种称为 TF-IDF 的计算公式，用于计算文档与查询之间的相关性。Elasticsearch 使用默认的 BM25 算法（BM stands for Best Matching，25 是其改进版本的编号）来计算相关性得分，该算法综合考虑了词频、逆向文档频率以及文档长度等因素。

通过文档评分机制，Elasticsearch 会根据查询条件和相关性算法为每个文档赋予一个分数，并按照分数对结果进行排序（默认降序）。这样，查询结果中的文档将按照与查询相关性从高到低的顺序排列，以便用户更容易获得最相关的结果。以下是一个简单的示例：

假设我们有一个索引，其中包含了一些商品的文档，每个文档都有一个 title 字段和一个 description 字段。我们希望搜索包含关键词 "手机" 的商品，并按与查询相关性进行排序

建立索引

PUT /products
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "description": {
        "type": "text"
      }
    }
  }
}

插入文档

POST /products/_doc/1
{
  "title": "iPhone 12",
  "description": "全新款 iPhone 12，具有出色的摄影功能和强大的性能。"
}

POST /products/_doc/2
{
  "title": "小米 11",
  "description": "小米 11 是一款性能强劲，价格亲民的智能手机。"
}

POST /products/_doc/3
{
  "title": "华为 P40",
  "description": "华为 P40 系列拥有卓越的摄影体验和高性能处理器。"
}

执行搜索

POST /products/_search
{
  "query": {
    "match": {
      "description": "手机"
    }
  }
}

运行结果如下：

"hits": [
  {
    "_index": "products",
    "_type": "_doc",
    "_id": "2",
    "_score": 0.548771,
    "_source": {
      "title": "小米 11",
      "description": "小米 11 是一款性能强劲，价格亲民的智能手机。"
    }
  },
  {
    "_index": "products",
    "_type": "_doc",
    "_id": "1",
    "_score": 0.422087,
    "_source": {
      "title": "iPhone 12",
      "description": "全新款 iPhone 12，具有出色的摄影功能和强大的性能。"
    }
  },
  {
    "_index": "products",
    "_type": "_doc",
    "_id": "3",
    "_score": 0.123301,
    "_source": {
      "title": "华为 P40",
      "description": "华为 P40 系列拥有卓越的摄影体验和高性能处理器。"
    }
  }
]

在这个结果中，相关性得分越高的文档排名越靠前。在我们的示例中，含有关键词 "手机" 的文档得分最高的是 "小米 11"，其次是 "iPhone 12"，最后是 "华为 P40"

如果你想要在不改变数据的情况下改变数据的顺序，可以使用 boost 来设置特定的搜索查询元素的权重，以便在文档评分过程中提高或降低相关性得分。boost 可以应用于查询中的整个查询语句、特定字段、特定词项或字符。例如，如果 "title" 字段对搜索结果更加重要，则可以这样写：

POST /products/_search
{
  "match": {
    "title": {
      "query": "宠物",
      "boost": 2
    }
  }
}

JAVA API

随着 Elasticsearch 8.x 新版本的到来，Type 的概念被废除，为了适应这种数据结构的改变，Elasticsearch 官方从 7.15 版本开始建议使用新的 Elasticsearch Java Client。根据官方文档提示，想要使用需要配置以下环境：

Java 8 or later.
A JSON object mapping library to allow seamless integration of your application classes with the Elasticsearch API. The examples below show usage with Jackson.

<dependency>
  <groupId>co.elastic.clients</groupId>
  <artifactId>elasticsearch-java</artifactId>
  <version>8.11.3</version>
</dependency>

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>2.12.3</version>
</dependency>

设置依赖项后，应用程序可能会失败，并显示 ClassNotFoundException: jakarta.json.spi.JsonProvider，如果发生这种情况，则必须显式添加 jakarta.json:jakarta.json-api:2.0.1 依赖项

<dependency>
  <groupId>jakarta.json</groupId>
  <artifactId>jakarta.json-api</artifactId>
  <version>2.0.1</version>
</dependency>

Connecting

You can connect to the Elastic Cloud using an API key and the Elasticsearch endpoint.

// URL and API key
String serverUrl = "https://localhost:9200";
String apiKey = "VnVhQ2ZHY0JDZGJrU...";

// Create the low-level client
RestClient restClient = RestClient
    .builder(HttpHost.create(serverUrl))
    .setDefaultHeaders(new Header[]{
        new BasicHeader("Authorization", "ApiKey " + apiKey)
    })
    .build();

// Create the transport with a Jackson mapper
ElasticsearchTransport transport = new RestClientTransport(
    restClient, new JacksonJsonpMapper());

// And create the API client
ElasticsearchClient esClient = new ElasticsearchClient(transport);

Creating an index

This is how you create the product index：

1
2
3

esClient.indices().create(c -> c
    .index("products")
);

Indexing documents

This is a simple way of indexing a document, here a Product application object：

Product product = new Product("bk-1", "City bike", 123.0);

IndexResponse response = esClient.index(i -> i
    .index("products")
    .id(product.getSku())
    .document(product)
);

logger.info("Indexed with version " + response.version());

Getting documents

You can get documents by using the following code：

GetResponse<Product> response = esClient.get(g -> g
    .index("products") 
    .id("bk-1"),
    Product.class      
);

if (response.found()) {
    Product product = response.source();
    logger.info("Product name " + product.getName());
} else {
    logger.info ("Product not found");
}

Searching documents

This is how you can create a single match query with the Java client：

String searchText = "bike";

SearchResponse<Product> response = esClient.search(s -> s
        .index("products")
        .query(q -> q
            .match(t -> t
                .field("name")
                .query(searchText)
            )
        ),
    Product.class
);

Updating documents

This is how you can update a document, for example to add a new field：

Product product = new Product("bk-1", "City bike", 123.0);

esClient.update(u -> u
        .index("products")
        .id("bk-1")
        .upsert(product),
    Product.class
);

Deleting documents

1	esClient.delete(d -> d.index("products").id("bk-1"));

Deleting an index

1
2
3

esClient.indices().create(c -> c
    .index("products")
);

Bulk

BulkRequest 包含一个操作集合，每个操作都是一个具有多个变体的类型。要创建此请求，可以方便地将构建器对象用于主请求，并为每个操作使用流畅的 DSL。下面的示例演示如何为列表或应用程序对象编制索引：

List<Product> products = fetchProducts();

BulkRequest.Builder br = new BulkRequest.Builder();

for (Product product : products) {
    br.operations(op -> op           
        .index(idx -> idx            
            .index("products")       
            .id(product.getSku())
            .document(product)
        )
    );
}

BulkResponse result = esClient.bulk(br.build());

// Log errors, if any
if (result.errors()) {
    logger.error("Bulk had errors");
    for (BulkResponseItem item: result.items()) {
        if (item.error() != null) {
            logger.error(item.error().reason());
        }
    }
}

添加操作（请记住，列表属性是累加的）。 op 是一个生成器，其 BulkOperation 是变体类型。此类型具有 index 、 create update 和 delete 变体
选择 index 操作变型， idx 是 IndexOperation 的构建器
设置索引操作的属性，类似于单个文档索引：索引名称、标识符和文档

Learn more about the API conventions of the Java client.

EQL

Elasticsearch EQL（Elasticsearch Query Language）是一种专门为 Elasticsearch 设计和优化的查询语言。它与传统的 Elasticsearch 查询语言（如 Lucene 查询字符串）相比，更加灵活、易用和可扩展。EQL 使用自然语言风格的语法，可以轻松地编写和理解复杂的查询逻辑。当使用 Elasticsearch EQL 进行查询时，可以按不同的需求和数据结构进行编写。以下是一些示例：

全文搜索

GET /my_index/_eql/search
{
  "query": "source.ip: 192.168.0.1"
}

这个查询会在名为 "my_index" 的索引中搜索所有具有 "source.ip" 字段值为 "192.168.0.1" 的文档

范围查询

GET /my_index/_eql/search
{
  "query": "response_time > 100 AND response_time < 200"
}

这个查询会在 "my_index" 索引中搜索所有具有 "response_time" 值大于 100 且小于 200 的文档

布尔查询

GET /my_index/_eql/search
{
  "query": "user.role: admin AND (status: active OR status: pending)"
}

这个查询会在 "my_index" 索引中搜索满足以下条件的文档：具有 "user.role" 值为 "admin"，且 "status" 值为 "active" 或 "pending"

SQL

Elasticsearch SQL 是一个用于在 Elasticsearch 中执行 SQL 风格查询的插件。它允许你使用熟悉的 SQL 语法来查询和操作 Elasticsearch 中的数据。以下是一些示例，首先我们先创建一些数据：

PUT /library/_bulk?refresh
{"index":{"_id": "Leviathan Wakes"}}
{"name": "Leviathan Wakes", "author": "James S.A. Corey", "release_date": "2011-06-02", "page_count": 561}
{"index":{"_id": "Hyperion"}}
{"name": "Hyperion", "author": "Dan Simmons", "release_date": "1989-05-26", "page_count": 482}
{"index":{"_id": "Dune"}}
{"name": "Dune", "author": "Frank Herbert", "release_date": "1965-06-01", "page_count": 604}

然后可以使用 SQL search API 执行 SQL：

POST /_sql?format=txt
{
  "query": "SELECT * FROM library WHERE release_date < '2000-01-01'"
}

运行结果如下：

    author     |     name      |  page_count   | release_date
---------------+---------------+---------------+------------------------
Dan Simmons    | Hyperion       | 482            | 1989-05-26T00:00:00.000Z
Frank Herbert  | Dune           | 604            | 1965-06-01T00:00:00.000Z

还可以使用 SQL CLI。x-pack 的 bin 目录中有一个脚本可以启动它：