DECISION NODE

September 12, 2024
Decision Nodes

Analyzers in Elasticsearch and OpenSearch: How to Design the Searches You Need

Paul Lisker
,ㅤ
Senior Software Engineer at Mark43

Paul is passionate about the growing intersection of technology and government. In his ten years of experience, he has worked in both the public and private sectors, ranging from the Federal Trade Commission to public safety software platforms. You can learn more about him at lisker.me.

It’s been several successful months at Makers and Markers, the artisanal whiteboards and whiteboard-related products company where you recently successfully set up a search indexing pipeline to keep Elasticsearch (or, if you’ve switched to AWS, OpenSearch) up to date. With all the treasure trove of data now readily available for searching, you realize it’s time to make sure your searches are optimized to surface the information your users need.

Over the years, you’ve heard something about Elasticsearch’s inverted index, and its reliance on tokenizers, filters, and analyzers, but you’re at a loss as to how to begin. You find yourself asking, which analyzers should I use? How can I create an analyzer to fit the needs of my business?

This article covers different approaches to optimizing your searches, considering the complexity and benefits that each presents.

Fundamentals of Elasticsearch Indexing

Elasticsearch is a powerful search engine built on Apache Lucene, with OpenSearch recently forking off Elasticsearch. These datastores are NoSQL datastores that are optimized for storing and searching for full-text documents, making them incredibly popular choices to back search functionality.

The Inverted Index

At its core, Elasticsearch’s capabilities are powered by its inverted index. Demystifying the inverted index gives us a window by which to understand the importance of tokenization and the foundation for the analyzers that help make Elasticsearch so versatile.

Whereas a traditional forward index would store documents, pointing to their content, an inverted index stores the content, pointing to the documents. This inversion works by first splitting the text content into tokens by some predefined tokenizer.

Open in Eraser

In this simple example, we can see how three of our company’s products are indexed into Elasticsearch, with each word (each “token”) pointing to the document on which it appears. The choice of tokenizer, therefore, is crucial. The standard tokenizer usually suffices, splitting on most punctuation and word boundaries. However, this can be paired with token filters to normalize the token according to some heuristic. For example, to enable case-insensitive searches, the aptly-named lowercase token filter normalizes each letter in each token to lowercase!

Choosing an Analyzer

Analyzers are simply a packaged collection of tokenizers and filters. In other words, if tokenizers are the stations where the document is transformed, step by step, the analyzer is the full assembly line, from document to resulting tokens.

Open in Eraser

The standard analyzer, which Elasticsearch uses for all text analysis by default, combines filters to divide text into tokens by word boundaries, lowercases the tokens, removes most punctuation, and filters out stop words (common words, such as “the”) in English.

While this analyzer would likely be helpful in surfacing our wide selection of artisanal whiteboard products to prospective buyers, it would likely be improper for employees searching by SKU codes, where an exact match is crucial. In this scenario, a Keyword Analyzer, which keeps the full document as a single token and does not perform any normalization and preserves characters, such as hyphens, would be ideal.

To understand how the built-in analyzers that come with Elasticsearch work, we can see the example of how they process a simple sentence.

Open in Eraser

Understanding the fundamentals of how Elasticsearch indexing works, particularly through the use of analyzers, provides a solid foundation for customizing your search capabilities. While the default analyzers like the standard and keyword analyzers are quite powerful and often sufficient for many use cases, there are scenarios where tailored text processing can significantly enhance search results.

Pre-Indexing vs. Post-Indexing Analyzers

But be careful, there’s a catch! Since documents are analyzed during indexing, analyzers cannot be changed for a given index after data has been ingested. After all, queries must use the same analyzer as used during indexing to ensure correct search results. If you find the need to change analyzers (perhaps you’ve realized that product names in the artisanal whiteboard world are notoriously case-sensitive), you’ll need to reindex your data into a new index configured with the proper analyzers.

Fortunately, while most analyzers are pre-indexing analyzers, and apply to both the indexed documents and the query, a small category of post-indexing analyzers are available that only apply to the query. These analyzers, such as those built with the synonym, phonetic, or n-gram filters, can therefore be added and modified after the fact.

Recently, customers were struggling with a location search query. They were searching for an address on Martin Luther King, Jr., Ave, but were querying for “MLK Ave”. By configuring a synonym that treated “Martin Luther King, Jr.,” as equivalent to “MLK” and adding it to the existing index, users were immediately able to get the desired results.

Open in Eraser

If you consider the analyzers in the graphic above, we can see that the phonetic analyzer can be particularly helpful with difficult-to-spell words (“whitebored”, perhaps?), whereas the n-gram analyzer can help with autocomplete as users are typing.

Creating a Custom Analyzer

You’ve presented your new search of product reviews to your Product Manager, and the feedback starts pouring back. Our users are so fond of writing smileys on our excellent whiteboards that they’ve switched to searching for reviews with emoticons! But with the volume of feedback, it’s hard to understand the general reception: are users frustrated with inaccurate searches with false positives and negatives abounding? Or do they love the new search functionality you’ve created?

Rather than panicking, though, you have a spark of inspiration: using your newfound knowledge of token filters and tokenizers, you’ll package several helpful transformations to create a new custom analyzer to parse through the barrage of emoticon-filled feedback coming your way.

First, you add a pattern tokenizer that filters out punctuation not found in the emoticons, keeping around colons and parentheses. Then, you add a stop token filter to remove stopwords. Finally, you add a character filter that transforms emoticons :) and :( to happy and sad, respectively. You bundle these up as your custom emoticons analyzer, stand up the index, and you’re set!

Once you’re deployed this analyzer and the upgraded search, you can follow up and fine-tune it with synonyms, continuing to respond to customer feedback and the logs you’ve set up to monitor customer satisfaction.

Performance of Different Analyzers

While the choice of analyzers will most directly impact the quality of the results of a search, it can also affect the performance of the search cluster by increasing the size of a given index on disk, taxing the CPU, and consuming more memory.

For example, a Keyword Analyzer, which treats the entire text of a given field as a single token, has a minimal impact on the search cluster, requiring little memory or CPU. In fact, keyword analyzers return some of the fastest queries, but this speed comes at the tradeoff: keyword analyzers cannot leverage any of the fuzzy search or partial-match features that Elasticsearch provides, rendering them less flexible for text searches.

On the other extreme, n-gram analyzers or synonym analyzers can lead to very resource-intensive queries, depending on how many terms are generated as part of the query. If the n-gram analyzer is configured to generate many overlapping terms of varying lengths, and the search text is lengthy, the final query could result in a high number of terms to be matched. Similarly, synonym analyzers can expand a query to contain many equivalent terms. In both these situations, the complex query could impact performance, with both high CPU and memory consumed during query execution.

Elasticsearch attempts to limit the size of queries and the impact they may have on its cluster health. However, some of these limits can be configured, such as by increasing the max_expansions on a given query, allowing for more complex queries. However, this is always a delicate balance that should be thoroughly tested with the resources available for your search cluster. After all, while increasing the value may result in higher quality search results, the performance of the search itself may be impacted.

Decision Tree for Choosing an Analyzer

Question 1: Is the data you are indexing exact data, such as SKU codes or email addresses?

If the data is exact, use a Keyword Analyzer to preserve the data’s formatting.

Question 2: Does the data have any differentiating special formatting or require case-sensitivity?

Perhaps your product codes are distinguished by case, or has some meaningful formatting or punctuation? A Custom Analyzer that uses the standard tokenizer, disables the lowercase filter, and preserves the important punctuation with its character filter could be ideal.

Question 3: In what language is your text written?

To remove stopwords and preserve meaningful tokens for your search, use a language-specific analyzer. Besides the included list of stopwords, these can often be configured with a provided list of external words.

Question 4: Would the search benefit from fuzzy searching or insensitivity to spelling?

Add an analyzer with a phonetic token filter to be flexible with spelling, and an n-gram or edge n-gram filter for fuzzy searching or partial-word matching. Phonetic filters can use different encoders, defaulting to metaphone, an improved version of the Soundex encoder .

Question 5: Are there any known equivalent terms, whether synonyms or abbreviations?

If yes, create a custom analyzer with a synonym filter configured to your needs. Remember, you can edit this later on!

If you answered no to all of these questions, and the text is written in English, the Standard Analyzer would likely suffice!

While developing a search, I try sample queries with the Analyze API to understand the impact of the analyzer on my search! Finally, it is worth mentioning that multiple analyzers could be combined in a single query using a Boolean Query. These approaches create a high configurability search service that helps ensure that you can optimize the search for your needs.

Conclusion

Elasticsearch and OpenSearch’s success in search is driven by the inverted index created in its Apache Lucene core. However, it’s their implementation of analyzers that augments this fundamental technology into one that is flexible to meet the needs of your product. At Makers and Markers, this meant using Elasticsearch’s in-built functionalities to help your customers search for your products and creating a custom analyzer to meet the need of the users’ emoticon searches. However, the flexibility that the tokenizers and filters provide creates a search service that is malleable and easily adaptable to optimize for your particular context, whether that involves selling whiteboard products or another industry. In my career, I’ve used Elasticsearch to make financial data, emergency locations, and other public safety records searchable. So keep trying out analyzers, and remember that creating custom solutions can be rewarding and the optimal path forward. Happy searching!