Performance Tuning with Elasticsearch

This time I would like to take a deeper look at Elasticsearch search performance.

The content comes from performance work I handled for our product.

Background for the improvement effort

Elasticsearch is a full-text search engine built on Apache Lucene. It stores data as documents and enables flexible search. In RDB terms, an index is like a table and a document is like a row.

Documents can contain strings and numbers, of course, but also arrays and objects. That flexibility can come at a cost: as the number of documents grows or document structures become more complex and bloated, performance can degrade.

For those reasons our search had stopped delivering the performance we expected, and we kicked off a performance improvement project.

Reviewing the Elasticsearch document structure

In our service we store the ads we publish in two units: “media” and “products”. Conceptually, a media item has a one-to-many relationship with products.

Search results are displayed at the media level, while the search itself starts at the product level. Because of this, we had to examine whether the documents we store in the index should be media-based or product-based.

To put it concretely, we had two options to consider:

Products carry multiple pieces of normalized information: where they are listed, price, the stations they are associated with, whether they belong to an entire railway line rather than a single station, and so on. This makes search even more complicated.

As a result, we could not immediately decide which document structure would be better.

We therefore measured search speed using both structures and compared how much performance each delivered.

Pattern 1: Documents per media item (three nested levels)

When you search for Elasticsearch performance tips you will often see warnings about using nested fields to hold child and grandchild elements because it can hurt performance.

If we store documents per media item, a single document can hold the media, child elements (products), and grandchild elements (data attached to each product)—three levels deep.

One characteristic of nested fields is that even if you update only a child or grandchild element, Elasticsearch updates the entire document. This hurts write performance.

That said, on Apache Lucene—the engine backing Elasticsearch—the data in nested fields is stored as individual documents. Mapping costs are high, so write performance suffers, but in our tests we did not observe search latency (likely because we do not have an extreme number of nested fields).

Reference:
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/nested.html#_limits_on_nested_mappings_and_objects

Write performance degradation is a real concern. We also anticipated that the data we keep in nested fields would grow, so we ran tests where we increased the number of products and product-specific data. In those tests we hit performance issues with inner hits when retrieving the nested fields.

This likely happened because the documents in the index were media-based and we were not narrowing the search down at the product level, so the number of nested fields inspected via inner hits was large.

Pattern 2: Documents per product (two nested levels)

Next we tried storing documents per product and aggregating them by media using aggregations.

Even after switching to product-based documents we still needed to keep data tied to each product, so nested fields remained. The depth shrank by one level, leaving us with parent and child—two levels.

Once we switched to product-based documents, write performance improved dramatically. Updates could be made per product, which drastically lowered the cost for media that have large product counts.

In terms of search, we do not view aggregations themselves as slow. However, we found that performance degraded heavily when the number of aggregated documents grew large.

Some media have tens of thousands of associated products. Every time such a media item is returned, Elasticsearch needs to aggregate tens of thousands of products into a single media entry, which is very expensive.

On top of that, when we retrieve inner hits, they are fetched per document. That means we would have to fetch tens of thousands of inner hits as well.

So at this point we still cannot declare which document structure is superior. The takeaway is to monitor performance closely and choose the structure that best fits the real-world workload.

Tips for using aggregations

Here are three pain points we ran into when working with aggregations.

Aggregations are used to group multiple documents, similar to GROUP BY in SQL.

For example, dashboards often rely on aggregations to display reports.

  1. Inner hits are unavailable

Elasticsearch queries return documents that match the search conditions, and you control how many are returned using the size parameter. When you use aggregations the results are grouped, so you typically set size to 0. That means you cannot retrieve inner hits. It seems obvious because the results are aggregated, but we could not read inner hits even per bucket.

Consequently, if we need inner hits we issue a query without aggregations. As long as you properly narrow the target documents, performance should remain acceptable.

  1. Filter as much as possible before aggregating

You can apply filters after aggregating, analogous to HAVING in SQL.

The idea is the same as SQL: performance improves if you do as much filtering as possible before aggregating.

The same principle applies to inner hits. The key is to remove unnecessary documents ahead of time.

  1. Sorting works a bit differently

Aggregations let you sort after grouping. You can sort by fields defined in the aggregation.

We use bucket_sort for sorting, but in our tests we could not use scripts inside the sort clause the way we can in a normal Elasticsearch query.

To work around this we defined the value we wanted using bucket_script inside the aggregation and then referenced that from bucket_sort. It works differently than standard queries, so watch out for that detail.

Infrastructure configuration

The first thing to address on the infrastructure side is storage. Having ample disk space is a given, but using HDDs risks performance degradation.

The official documentation for indexing and search explicitly recommends SSDs, so it is worth considering them.

Reference:

We had already chosen SSDs during the initial build, so we did not run comparative tests this time.

A quick aside for those running on AWS EC2: for EBS SSD (general purpose) you can choose between gp2 and gp3. gp2 offers burst capability, while gp3 does not but lets you configure IOPS and throughput in advance.

The maximum IOPS and throughput for gp2 when bursting depend on the EBS volume size. Because our volumes are not that large, we concluded gp3 was a better fit than relying on gp2 bursts.

When you use gp3 with the default IOPS and throughput, it is also cheaper than gp2. We recommend picking between gp2 and gp3 based on the EBS size you need.

Next is shard count. Shards are where the data is actually stored, and you can split an index across multiple shards.

Splitting an index across shards enables parallel execution, which can improve performance.

On the other hand, increasing shard size affects server load, so you should adjust shard size carefully.

The Elasticsearch documentation has a good summary, so if you are reviewing shard counts it is worth a read.

Reference:

Other observations

If you have Kibana, you can inspect queries using the profiler (Kibana -> Dev Tools -> Search Profiler). It shows the queries actually executed on Apache Lucene and measures execution time. If you configure multiple shards you can also see how long each shard takes.

I recommend giving it a try if you have access to Kibana.

It is equally important to monitor real server load. Even though I focused on Elasticsearch search performance here, it is not uncommon to discover the root cause elsewhere.

When you work on performance improvements, the first step is to identify the bottleneck. If Elasticsearch is the culprit, you can then decide whether to optimize the query, tune parameters, or scale the infrastructure.

Start by observing the server metrics to determine where the bottleneck is.

Closing thoughts

This post covered Elasticsearch performance.

For services with search functionality, performance is always a critical topic. A slow service can be extremely stressful for users and can have a huge impact overall.

As data grows and logic evolves, we have to keep watching performance closely.

I hope this write-up helps anyone working with Elasticsearch.