Analyzers & Tokenizers

Text analysis in Elasticsearch — inverted index tokenization, built-in analyzers, tokenizers, token filters, custom analyzers, and common errors.

35m20m reading15m lab

How Analysis Works

Every text field goes through a three-step pipeline:

Input Text → Character Filters → Tokenizer → Token Filters → Indexed Tokens
StepPurposeExample
Character FilterStrip or replace characters before tokenizationHelloHello
TokenizerSplit text into tokens"quick brown fox"["quick", "brown", "fox"]
Token FilterTransform tokens (lowercase, stemming, synonyms)["Running", "FAST"]["run", "fast"]

Tokenization in Action

Step 1: Raw Text

"Elasticsearch is blazing fast and scalable."

Step 2: Tokenization (split into words)

["Elasticsearch", "is", "blazing", "fast", "and", "scalable"]

Step 3: Lowercasing + Stop Word Removal

["elasticsearch", "blazing", "fast", "scalable"]

Stop words like is, and, the are removed (depending on the analyzer). The remaining tokens are stored in the inverted index.

Token Filters

FilterInputOutputUse Case
LowercasingElasticsearchelasticsearchCase-insensitive search
Punctuationfull-textfull, textWord boundary splitting
Stop wordsis, and, theRemovedReduce noise
StemmingindexingindexMatch word variants
Synonymsfastfast, quickExpand search terms
Edge N-Gramsearchs, se, sea...Autocomplete

Choosing the Right Tokenizer

Use CaseBest Tokenizer
Free text (articles, blog posts)standard
Tags, exact keywords, IPskeyword
Custom structured logspattern, whitespace
File pathspath_hierarchy
URLs and emailsuax_url_email
Autocompleteedge_ngram, ngram
Source code, identifierswhitespace, keyword

Testing Analyzers

Use the _analyze API to see how text is processed:

curl -X POST "http://localhost:9200/_analyze?pretty" \
  -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text": "The Quick Brown Fox jumps over 2 lazy dogs!"
}'
Response shows the tokens: ["the", "quick", "brown", "fox", "jumps", "over", "2", "lazy", "dogs"]

Test a Custom Analyzer on an Existing Index

curl -X POST "http://localhost:9200/my-index/_analyze?pretty" \
  -H 'Content-Type: application/json' -d'
{
  "analyzer": "autocomplete",
  "text": "elasticsearch"
}'

Test Individual Components

# Test just a tokenizer
curl -X POST "http://localhost:9200/_analyze?pretty" \
  -H 'Content-Type: application/json' -d'
{
  "tokenizer": "whitespace",
  "text": "user-name@host.com logged-in"
}'

# Test tokenizer + filters
curl -X POST "http://localhost:9200/_analyze?pretty" \
  -H 'Content-Type: application/json' -d'
{
  "tokenizer": "standard",
  "filter": ["lowercase", "stop"],
  "text": "The Quick Brown Fox"
}'

Built-in Analyzers

Standard Analyzer (default)

Tokenizes on word boundaries, lowercases, removes punctuation.

curl -X POST "http://localhost:9200/_analyze?pretty" \
  -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text": "user.name@example.com logged-in at 10:30"
}'
Tokens: ["user.name", "example.com", "logged", "in", "at", "10", "30"]

Simple Analyzer

Splits on non-letter characters, lowercases.

curl -X POST "http://localhost:9200/_analyze?pretty" \
  -H 'Content-Type: application/json' -d'
{
  "analyzer": "simple",
  "text": "Error-404: Page not found!"
}'
Tokens: ["error", "page", "not", "found"]

Whitespace Analyzer

Splits on whitespace only. No lowercasing.

Keyword Analyzer

No tokenization — the entire input becomes a single token. Used for exact-match fields.

Pattern Analyzer

Splits on a regex pattern (default: \W+).

Language Analyzers

Built-in analyzers for 30+ languages with stemming and stop words:

curl -X POST "http://localhost:9200/_analyze?pretty" \
  -H 'Content-Type: application/json' -d'
{
  "analyzer": "english",
  "text": "The foxes were running quickly through the forests"
}'
Tokens: ["fox", "run", "quickli", "through", "forest"] — notice stemming.

Analyzer Comparison

AnalyzerInput: "user-name@host.com"Tokens
standard["user", "name", "host.com"]Word boundaries
simple["user", "name", "host", "com"]Non-letters split
whitespace["user-name@host.com"]Single token
keyword["user-name@host.com"]Entire input
english["user", "name", "host.com"]Stemmed words

Built-in Tokenizers

TokenizerBehavior
standardUnicode word boundaries
letterSplits on non-letters
whitespaceSplits on whitespace
keywordNo split (entire input)
patternRegex-based splitting
path_hierarchySplits file paths: /a/b/c["/a", "/a/b", "/a/b/c"]
ngramCharacter n-grams for autocomplete
edge_ngramLeading edge n-grams

Token Filters Reference

FilterEffect
lowercase"FOX""fox"
uppercase"fox""FOX"
stopRemove stop words (the, is, at)
stemmer"running""run"
synonym"quick"["quick", "fast"]
asciifolding"café""cafe"
trimRemove leading/trailing whitespace
uniqueDeduplicate tokens
ngramGenerate n-grams from each token
edge_ngramGenerate edge n-grams

Creating Custom Analyzers

Autocomplete Analyzer

curl -X PUT "http://localhost:9200/search-index" \
  -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "autocomplete_tokenizer",
          "filter": ["lowercase"]
        },
        "autocomplete_search": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      },
      "tokenizer": {
        "autocomplete_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": ["letter", "digit"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}'

Log Message Analyzer

For analyzing log messages where you want to preserve paths and error codes:

curl -X PUT "http://localhost:9200/logs-analyzed" \
  -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "log_analyzer": {
          "type": "custom",
          "tokenizer": "pattern",
          "filter": ["lowercase", "stop"],
          "char_filter": ["html_strip"]
        }
      },
      "tokenizer": {
        "pattern": {
          "type": "pattern",
          "pattern": "[\\s,;:]+",
          "lowercase": true
        }
      }
    }
  }
}'

Synonym Analyzer

curl -X PUT "http://localhost:9200/products" \
  -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "filter": {
        "product_synonyms": {
          "type": "synonym",
          "synonyms": [
            "laptop, notebook, portable computer",
            "phone, mobile, smartphone, cell",
            "tv, television, monitor, display"
          ]
        }
      },
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "product_synonyms"]
        }
      }
    }
  }
}'

Multi-fields: Different Analyzers for the Same Field

Index a field multiple ways for different search strategies:

curl -X PUT "http://localhost:9200/articles" \
  -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "keyword": {
            "type": "keyword"
          },
          "autocomplete": {
            "type": "text",
            "analyzer": "autocomplete"
          },
          "english": {
            "type": "text",
            "analyzer": "english"
          }
        }
      }
    }
  }
}'
Now you can search:
  • title — full-text search with standard analyzer
  • title.keyword — exact match, aggregations, sorting
  • title.autocomplete — typeahead suggestions
  • title.english — language-aware search with stemming

Field Types and Analysis

Field TypeAnalyzed?Use Case
textYesFull-text search
keywordNoExact match, aggregations, sorting
integer, longNoNumeric range queries
dateNoDate range queries
booleanNoFilters
ipNoIP range queries

Common Errors

Analyzer Not Found

Error:

{
  "type": "mapper_parsing_exception",
  "reason": "analyzer [autocomplete_analyzer] has not been defined in the mapping"
}

Cause: The custom analyzer is referenced in mappings but not defined in settings. Fix: Define the analyzer in the settings.analysis block of the index:

curl -X PUT "http://localhost:9200/my-index" \
  -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete_analyzer": {
          "type": "custom",
          "tokenizer": "autocomplete_tokenizer",
          "filter": ["lowercase"]
        }
      },
      "tokenizer": {
        "autocomplete_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": ["letter", "digit"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "autocomplete_analyzer"
      }
    }
  }
}'

Key rule: Custom analyzers must be defined in settings at index creation time. You cannot add them to an existing index without closing it first.

Lab: Build and Test Analyzers

  1. 1 Create an index with a custom autocomplete analyzer
  2. 2 Test the analyzer with the _analyze API
  3. 3 Index sample documents
  4. 4 Search with partial queries and verify autocomplete behavior
  5. 5 Compare results between standard and english analyzers
  6. 6 Create a synonym analyzer and test synonym expansion

Next Steps

Discussion