Elasticsearch Tuning and Tips

Elasticsearch is a distributed database with an HTTP API. Here are some things I’ve learned. I’ve installed Elasticsearch version 7.x on Mac OS via Homebrew

Concepts

Mapping concepts from an RDMBS can be helpful.

Index - this is like a RDBMS table
Document - this is like a RDBMS row
Mapping - this is like an RDBMS DDL structure, although it can be applied upfront or later on. Implicit vs. explicit.
Immutable documents - documents are immutable, when updating a document, it is not modified in place but is marked for deletion and replaced by a new version with the changes. PostgreSQL works this way as well for row updates and deletes.

More Concepts

These concepts are specific to the architecture of Elasticsearch and scalability.

Shard - A self-contained index
- Primary shard - for indexing requests. Each document is in a primary shard. Fixed at index creation.
- Replica shard - a copy of a primary shard. Replica shards can be added to scale search requests.
Node - (servers) nodes serve primary or replica shards
Cluster - a collection of nodes
Deployment (cluster) - this is Elastic.co terminology that is synonymous with cluster. A deployment will contain an Elasticsearch cluster, as well as nodes for other services like Kibana.
Segment merging - how Elasticsearch processes deleted documents

API Concepts

Elasticsearch has an HTTP API. That means HTTP verbs like POST, PUT, GET and DELETE are mapped to concepts like creating, updating, searching and deleting things.

Elasticsearch also supports a bulk API that can be used to create and delete multiple documents.

App Development Concerns

If the application provides explicit mappings for an index, do not create the indices manually but create them via the application so that they have the correct mapping types.

Create Index

curl -XPUT 'http://localhost:9200/foo'

Add Documents To Index

Create a document with id 1 in the index foo with a title of “My title”.

curl -H 'Content-Type: application/json' -X POST 'localhost:9200/foo/_doc/1?pretty' -d '
                                                  {
                                                  "title": "My title"
                                                  }'

Search an Index

There are various ways of querying, this is using the Query String format. We can search for the document we just put into the index.

Adding pretty onto the end will format the JSON output on multiple lines and with indentation.

curl -X GET 'localhost:9200/foo/_search?q=title:title&pretty'

Count documents

Count API

GET /index/_count

More Queries

GET /_cat/indices
GET /_cat/indices/*pattern*
GET /index/_search # list top 10 documents
GET /index/_search -d '{"foo":"bar"}' # some JSON search payload
GET /index/_search -d '{"foo":"bar"}' # some JSON search payload
GET /_cat/shards
GET /_cat/shards/index-name # shards for particular index

Example search request with a payload via curl:

curl -H "Content-Type: application/json" "http://localhost:9200/some_index_name/_search?pretty" -d '
{
    "from": 0,
    "size": 50,
    "sort": [
    ],
    "query": {
        "bool": {
            "filter": [
            ]
        }
    }
}
'

Check index mapping types:

GET /index_name/_mapping

Tuning

Parameter	Default
index.refresh_interval	Every 1s	Tune for indexing speed

Logs

On Mac OS ES 7 via Homebrew. Tailing the log file:

tail -f /usr/local/var/log/elasticsearch/elasticsearch_brew.log

Running via Docker (Recommended method):

docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.17.0

Some activity will be logged like index creation.

Stats

Check the index stats, e.g. for deleted documents:

GET /index/_stats

For a given index, the deleted documents were as much as 32% of the total number of documents in a performance sensitive index we have with over 100 GB of size, and with wildcard queries which are already more costly.

According to how Lucene handles deleted documents, this percentage is within the normal range though.

Use Cases

As a primary database

Elasticsearch can be used as a primary database in a way similar to a RDBMS like PostgreSQL.

The operational concerns here are more about indexing rate, search speed etc. as opposed to search results relevancy.

Resources

As a search engine

Elasticsearch has powerful capabilities for searching.

Tools

Kibana - visualization tool, search logs, API console
Rally benchmarking

Some tooling in Ruby

Tracking searches - Searchjoy
Elasticsearch ORM for Ruby - Chewy

Andrew Atkinson