Stemming is the process of reducing a word to its root form
. This ensures variants of a word match during a search. For example, walking and walked can be stemmed to the same root word: walk. Once stemmed, an occurrence of either word would match the other in a search.
Stemming is language-dependent but often involves removing prefixes and suffixes from words.
In some cases, the root form of a stemmed word may not be a real word. For example, jumping
and jumpiness
can both be stemmed to jumpi
. While jumpi
isn’t a real English word, it doesn’t matter for search; if all variants of a word are reduced to the same root form, they will match correctly.
In Elasticsearch, stemming is handled by stemmer token filters
. These token filters can be categorized based on how they stem words:
Algorithmic Stemmer
- stem words based on set of rulesDictionary Stemmer
- stem words by looking them into dictionary<aside> 💡 Because stemming changes tokens, we recommend using the same stemmer token filters during index and search analysis.
</aside>
Algorithmic stemmers apply a series of rules to each word to reduce it to its root form. For example, an algorithmic stemmer for English may remove the -s and -es suffixes from the end of plural words.
Advantages:
However, most algorithmic stemmers only alter the existing text of a word. This means they may not work well with irregular words that don’t contain their root form, such as:
be
, are
, and am
mouse
and mice