Elasticsearch

Elasticsearch is an distributed indexing and document database with search, aggregation and sharding functionality. It's written in Java and builds on Lucene.

Concepts

Documents are the unit of data storage and retrieval, expressed in JSON.
Document Types define the properties of Documents, giving them a schema.
Indices represent collections of related documents.
Index Templates allow reusing index schemas, providing a base from which Indices can be created.
Clusters are deployments of Elasticsearch, comprised of one or more Nodes and a set of Indices.
Nodes store, index, and search the Indices.
Sharding allows us to distribute the storage and retrieval operations across Cluster Nodes, and replicate it to protect against Node failure.

Installation

Configuration is stored in /etc/elasticsearch/elasticsearch.yml.

Since the Elasticsearch mmapfs storage type makes extensive use of memory mapping it's usually necessary to raise the vm.max_map_count sysctl. The Debian and RPM packages supplied by Elastic may raise this on installation.

Management

CAT API

The CAT API provides (barely) human readable output in a pinch. List all actions by querying /_cat, get help with the help query string parameter, verbose output with v, and limit the headers with h and a comma-delimited list of values.

Sharding and replication

Each index has its own sharding and replication configuration. By default, each index is assigned five shards and one replica (for two copies overall).

REST API

Index management

Create by issuing a PUT for the named index:

PUT /:indexName

Delete with a DELETE to the index location:

DELETE /:indexName

Document management

Documents are nested below their type and ID, created or updated with a PUT and a JSON payload:

PUT /:indexName/:documentTypeName/:documentId

:documentJson

They can be fetched by the same URL:

GET /:indexName/:documentTypeName/:documentId

The _source query parameter can be set to false to omit the source record from the response, or field1,field2 to include only field1 and field2.

Deletion is via a DELETE to the same URL:

DELETE /:indexName/:documentTypeName/:documentId

Relevance

Relevance scoring is based on TF/IDF:

TF determines how often does the term appear in the field value. The more appearances, the more relevant the match.
IDF determines how often a term appears in the index. Higher frequency indicates it's of less relevance.
Field-length normalisation adjusts scores such that longer field values are considered less relevant.

Query DSL

Elasticsearch's query DSL exposes most of Lucene's functionality in a JSON document or set of query parameters. The API is stateless: there are no cursors, so there's no pagination.

There are two contexts used in querying Elasticsearch:

The filter context determines whether a document is included in the result for a query without ranking results, performing only exact matching and not attempting to rank results by relevance.
The query context applies relevance scoring too, answering the "how well does the document match" question.

GET /:indexName/:documentTypeNameList/_search

{
  "query": {
    "match": {
      "field1": "value1"
    }
  },
  "size": 100,
  "sort": {
    "field1": {
      "order": "asc"
    }
  }
}

Elasticsearch supports different types of query:

Term-level:
- exists finds documents that contain any indexed value for a field.
- regexp finds documents that contain a value matching the specified regular expression.
- terms seeks one or more exact terms in a field.
Full text:
- match looks for an inexact match for a set of values within the text of a document and accepts an and/or operator, which defaults to or.
- match_phrase attempts to match the entire phrase, useful for matching quotes.
- match_phrase_prefix matches phrases beginning with the specified prefix.

The explain parameter gives justification for scoring.

Backlinks