Elasticsearch
Elasticsearch is an distributed indexing and document database with search, aggregation and sharding functionality. It's written in Java and builds on Lucene.
Concepts
- Documents are the unit of data storage and retrieval, expressed in JSON.
- Document Types define the properties of Documents, giving them a schema.
- Indices represent collections of related documents.
- Index Templates allow reusing index schemas, providing a base from which Indices can be created.
- Clusters are deployments of Elasticsearch, comprised of one or more Nodes and a set of Indices.
- Nodes store, index, and search the Indices.
- Sharding allows us to distribute the storage and retrieval operations across Cluster Nodes, and replicate it to protect against Node failure.
Installation
Configuration is stored in /etc/elasticsearch/elasticsearch.yml
.
Since the Elasticsearch mmapfs
storage type makes extensive use of memory mapping it's usually necessary to raise the vm.max_map_count
sysctl. The Debian and RPM packages supplied by Elastic may raise this on installation.
Management
CAT API
The CAT API provides (barely) human readable output in a pinch. List all actions by querying /_cat
, get help with the help
query string parameter, verbose output with v
, and limit the headers with h
and a comma-delimited list of values.
Sharding and replication
Each index has its own sharding and replication configuration. By default, each index is assigned five shards and one replica (for two copies overall).
REST API
Index management
Create by issuing a PUT for the named index:
PUT /:indexName
Delete with a DELETE
to the index location:
DELETE /:indexName
Document management
Documents are nested below their type and ID, created or updated with a PUT
and a JSON payload:
PUT /:indexName/:documentTypeName/:documentId
:documentJson
They can be fetched by the same URL:
GET /:indexName/:documentTypeName/:documentId
The _source
query parameter can be set to false
to omit the source record from the response, or field1,field2
to include only field1
and field2
.
Deletion is via a DELETE
to the same URL:
DELETE /:indexName/:documentTypeName/:documentId
Relevance
Relevance scoring is based on TF/IDF:
- TF determines how often does the term appear in the field value. The more appearances, the more relevant the match.
- IDF determines how often a term appears in the index. Higher frequency indicates it's of less relevance.
- Field-length normalisation adjusts scores such that longer field values are considered less relevant.
Query DSL
Elasticsearch's query DSL exposes most of Lucene's functionality in a JSON document or set of query parameters. The API is stateless: there are no cursors, so there's no pagination.
There are two contexts used in querying Elasticsearch:
- The filter context determines whether a document is included in the result for a query without ranking results, performing only exact matching and not attempting to rank results by relevance.
- The query context applies relevance scoring too, answering the "how well does the document match" question.
GET /:indexName/:documentTypeNameList/_search
{
"query": {
"match": {
"field1": "value1"
}
},
"size": 100,
"sort": {
"field1": {
"order": "asc"
}
}
}
Elasticsearch supports different types of query:
- Term-level:
exists
finds documents that contain any indexed value for a field.regexp
finds documents that contain a value matching the specified regular expression.terms
seeks one or more exact terms in a field.
- Full text:
match
looks for an inexact match for a set of values within the text of a document and accepts an and/or operator, which defaults to or.match_phrase
attempts to match the entire phrase, useful for matching quotes.match_phrase_prefix
matches phrases beginning with the specified prefix.
The explain
parameter gives justification for scoring.
Backlinks