Skip to content

Module 10: Analyzers & Scoring

Overview

This module covers advanced text analysis and scoring techniques in Azure AI Search, focusing on how to optimize search relevance through custom analyzers, scoring profiles, and relevance tuning strategies.

Learning Objectives

By the end of this module, you will be able to:

  • Understand Text Analysis: Master how Azure AI Search processes and analyzes text during indexing and querying
  • Configure Built-in Analyzers: Select and configure appropriate analyzers for different languages and use cases
  • Create Custom Analyzers: Build custom text analysis pipelines with tokenizers, filters, and character filters
  • Implement Scoring Profiles: Design and implement custom scoring algorithms to improve search relevance
  • Optimize Search Relevance: Apply advanced techniques for relevance tuning and result ranking
  • Test and Debug Analyzers: Use tools and techniques to test and troubleshoot analyzer configurations

Prerequisites

Before starting this module, ensure you have:

  • Completed Module 9: Advanced Querying
  • Understanding of search relevance concepts
  • Basic knowledge of text processing and linguistics
  • Familiarity with JSON configuration syntax
  • Access to Azure AI Search service with admin permissions

Module Structure

1. Text Analysis Fundamentals

Understanding the Analysis Process

Text analysis in Azure AI Search occurs during both indexing and querying phases:

Indexing Phase: 1. Character Filtering: Removes or replaces characters (HTML stripping, pattern replacement) 2. Tokenization: Breaks text into individual tokens/terms 3. Token Filtering: Applies filters like lowercase, stemming, stop word removal 4. Storage: Processed tokens are stored in the inverted index

Query Phase: 1. Query Analysis: User queries undergo the same analysis process 2. Term Matching: Analyzed query terms are matched against indexed terms 3. Scoring: Relevance scores are calculated based on term frequency and other factors

Built-in Analyzers

Azure AI Search provides several built-in analyzers:

Language Analyzers: - en.microsoft - Microsoft English analyzer with advanced linguistic processing - en.lucene - Apache Lucene English analyzer - standard.lucene - Language-neutral standard analyzer - Language-specific analyzers for 50+ languages

Specialized Analyzers: - keyword - Treats entire input as a single token (exact matching) - pattern - Uses regular expressions for tokenization - simple - Basic lowercase and whitespace tokenization - stop - Standard analyzer with stop word removal - whitespace - Splits on whitespace only

Example: Analyzer Comparison

{
  "fields": [
    {
      "name": "title_standard",
      "type": "Edm.String",
      "analyzer": "standard.lucene",
      "searchable": true
    },
    {
      "name": "title_english",
      "type": "Edm.String",
      "analyzer": "en.microsoft",
      "searchable": true
    },
    {
      "name": "title_keyword",
      "type": "Edm.String",
      "analyzer": "keyword",
      "searchable": true
    }
  ]
}

2. Custom Analyzer Configuration

Analyzer Components

Custom analyzers consist of three main components:

1. Character Filters (Optional) - Process text before tokenization - Common filters: html_strip, mapping, pattern_replace

2. Tokenizer (Required) - Breaks text into tokens - Options: standard, keyword, pattern, whitespace, edgeNGram, nGram

3. Token Filters (Optional) - Process tokens after tokenization - Common filters: lowercase, stemmer, stopwords, synonym, phonetic

Custom Analyzer Example

{
  "analyzers": [
    {
      "name": "custom_english_analyzer",
      "tokenizer": "standard",
      "charFilters": ["html_strip"],
      "tokenFilters": [
        "lowercase",
        "english_stop",
        "english_stemmer",
        "custom_synonyms"
      ]
    }
  ],
  "charFilters": [
    {
      "name": "html_strip",
      "@odata.type": "#Microsoft.Azure.Search.HtmlStripCharFilter"
    }
  ],
  "tokenFilters": [
    {
      "name": "english_stop",
      "@odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter",
      "stopwords": ["the", "and", "or", "but"]
    },
    {
      "name": "english_stemmer",
      "@odata.type": "#Microsoft.Azure.Search.StemmerTokenFilter",
      "language": "english"
    },
    {
      "name": "custom_synonyms",
      "@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
      "synonyms": ["car,automobile,vehicle", "happy,joyful,glad"]
    }
  ]
}

3. Advanced Analyzer Techniques

N-gram Analyzers for Partial Matching

N-gram analyzers enable partial word matching and autocomplete functionality:

{
  "analyzers": [
    {
      "name": "autocomplete_analyzer",
      "tokenizer": "autocomplete_tokenizer",
      "tokenFilters": ["lowercase"]
    }
  ],
  "tokenizers": [
    {
      "name": "autocomplete_tokenizer",
      "@odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenizer",
      "minGram": 2,
      "maxGram": 25,
      "tokenChars": ["letter", "digit"]
    }
  ]
}

Phonetic Matching

Implement phonetic matching for names and similar-sounding terms:

{
  "tokenFilters": [
    {
      "name": "phonetic_filter",
      "@odata.type": "#Microsoft.Azure.Search.PhoneticTokenFilter",
      "encoder": "soundex"
    }
  ]
}

Pattern-Based Tokenization

Use regular expressions for specialized tokenization:

{
  "tokenizers": [
    {
      "name": "email_tokenizer",
      "@odata.type": "#Microsoft.Azure.Search.PatternTokenizer",
      "pattern": "\\W+",
      "flags": ["CASE_INSENSITIVE"]
    }
  ]
}

4. Scoring Profiles and Relevance Tuning

Understanding Default Scoring

Azure AI Search uses a modified TF-IDF (Term Frequency-Inverse Document Frequency) algorithm:

Components: - Term Frequency (TF): How often a term appears in a document - Inverse Document Frequency (IDF): How rare a term is across the corpus - Field Length Normalization: Shorter fields get higher scores - Coordinate Factor: Documents matching more query terms score higher

Scoring Profile Structure

{
  "scoringProfiles": [
    {
      "name": "content_boost_profile",
      "text": {
        "weights": {
          "title": 3.0,
          "description": 2.0,
          "content": 1.0,
          "tags": 1.5
        }
      },
      "functions": [
        {
          "type": "freshness",
          "fieldName": "publishDate",
          "boost": 2.0,
          "interpolation": "linear",
          "freshness": {
            "boostingDuration": "P30D"
          }
        },
        {
          "type": "magnitude",
          "fieldName": "rating",
          "boost": 1.5,
          "interpolation": "linear",
          "magnitude": {
            "boostingRangeStart": 1,
            "boostingRangeEnd": 5,
            "constantBoostBeyondRange": true
          }
        }
      ],
      "functionAggregation": "sum"
    }
  ]
}

Scoring Function Types

1. Freshness Functions Boost recent content based on date/time fields:

{
  "type": "freshness",
  "fieldName": "lastModified",
  "boost": 2.0,
  "interpolation": "linear",
  "freshness": {
    "boostingDuration": "P7D"  // 7 days
  }
}

2. Magnitude Functions Boost based on numeric field values:

{
  "type": "magnitude",
  "fieldName": "viewCount",
  "boost": 1.8,
  "interpolation": "logarithmic",
  "magnitude": {
    "boostingRangeStart": 0,
    "boostingRangeEnd": 1000,
    "constantBoostBeyondRange": true
  }
}

3. Distance Functions Boost based on geographic proximity:

{
  "type": "distance",
  "fieldName": "location",
  "boost": 2.0,
  "interpolation": "linear",
  "distance": {
    "referencePointParameter": "currentLocation",
    "boostingDistance": 10
  }
}

5. Testing and Debugging Analyzers

Analyze API for Testing

Use the Analyze API to test how text is processed:

POST https://[service-name].search.windows.net/indexes/[index-name]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "The quick brown fox jumps over the lazy dog",
  "analyzer": "en.microsoft"
}

Testing Custom Analyzers

POST https://[service-name].search.windows.net/indexes/[index-name]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "HTML <b>bold</b> text with UPPERCASE",
  "tokenizer": "standard",
  "charFilters": ["html_strip"],
  "tokenFilters": ["lowercase"]
}

Analyzer Testing Best Practices

  1. Test with Real Data: Use actual content from your domain
  2. Test Edge Cases: Empty strings, special characters, very long text
  3. Compare Analyzers: Test multiple analyzers with the same content
  4. Validate Token Output: Ensure tokens match your expectations
  5. Performance Testing: Measure analysis time for large documents

6. Performance Considerations

Analyzer Performance Impact

Indexing Performance: - Complex analyzers slow down indexing - Character filters add processing overhead - Multiple token filters compound processing time - N-gram analyzers significantly increase index size

Query Performance: - Query-time analysis affects search latency - Complex analyzers impact real-time search - Consider using different analyzers for indexing vs. searching

Optimization Strategies

1. Selective Field Analysis Only apply complex analyzers to fields that need them:

{
  "fields": [
    {
      "name": "title",
      "type": "Edm.String",
      "analyzer": "en.microsoft",  // Complex analyzer for important field
      "searchable": true
    },
    {
      "name": "category",
      "type": "Edm.String",
      "analyzer": "keyword",  // Simple analyzer for exact matching
      "searchable": true
    }
  ]
}

2. Separate Index and Search Analyzers Use different analyzers for indexing and searching:

{
  "name": "content",
  "type": "Edm.String",
  "indexAnalyzer": "complex_indexing_analyzer",
  "searchAnalyzer": "simple_search_analyzer",
  "searchable": true
}

7. Common Use Cases and Patterns

E-commerce Search Optimization

{
  "scoringProfiles": [
    {
      "name": "ecommerce_boost",
      "text": {
        "weights": {
          "productName": 4.0,
          "brand": 3.0,
          "description": 1.0,
          "category": 2.0
        }
      },
      "functions": [
        {
          "type": "magnitude",
          "fieldName": "rating",
          "boost": 2.0,
          "interpolation": "linear",
          "magnitude": {
            "boostingRangeStart": 1,
            "boostingRangeEnd": 5
          }
        },
        {
          "type": "magnitude",
          "fieldName": "salesCount",
          "boost": 1.5,
          "interpolation": "logarithmic",
          "magnitude": {
            "boostingRangeStart": 0,
            "boostingRangeEnd": 10000
          }
        }
      ]
    }
  ]
}

Multi-language Content

{
  "analyzers": [
    {
      "name": "multilingual_analyzer",
      "tokenizer": "standard",
      "tokenFilters": [
        "lowercase",
        "asciifolding",  // Converts accented characters
        "multilingual_stemmer"
      ]
    }
  ]
}
{
  "analyzers": [
    {
      "name": "technical_analyzer",
      "tokenizer": "standard",
      "tokenFilters": [
        "lowercase",
        "technical_synonyms",
        "code_preservation"  // Preserve code-like tokens
      ]
    }
  ]
}

Best Practices

Analyzer Selection Guidelines

  1. Start Simple: Begin with built-in analyzers before creating custom ones
  2. Language-Specific: Use language analyzers for single-language content
  3. Domain-Specific: Create custom analyzers for specialized domains
  4. Test Thoroughly: Always test analyzers with representative data
  5. Monitor Performance: Track indexing and query performance impact

Scoring Profile Guidelines

  1. Field Weights: Assign higher weights to more important fields
  2. Function Balance: Don't over-boost with too many functions
  3. Business Logic: Align scoring with business objectives
  4. A/B Testing: Test different profiles to measure effectiveness
  5. Regular Review: Update profiles based on user behavior analytics

Common Pitfalls to Avoid

  1. Over-Engineering: Don't create overly complex analyzers without clear benefits
  2. Inconsistent Analysis: Ensure index and search analyzers are compatible
  3. Performance Neglect: Monitor the impact of complex analyzers on performance
  4. Insufficient Testing: Test analyzers with edge cases and real data
  5. Static Configuration: Regularly review and update analyzer configurations

Troubleshooting

Common Issues

1. Unexpected Search Results - Cause: Analyzer mismatch between indexing and searching - Solution: Use the Analyze API to verify token output

2. Poor Performance - Cause: Complex analyzers or too many token filters - Solution: Simplify analyzers or use different analyzers for index vs. search

3. Missing Results - Cause: Over-aggressive filtering (stop words, stemming) - Solution: Review filter configurations and test with sample queries

4. Scoring Issues - Cause: Incorrect scoring profile configuration - Solution: Test scoring profiles with known data sets

Debugging Tools

  1. Analyze API: Test text processing
  2. Search Explorer: Test queries with different scoring profiles
  3. Query Logs: Analyze search patterns and performance
  4. Index Statistics: Monitor index size and field usage

Next Steps

After completing this module, you should:

  1. Practice: Implement custom analyzers for your specific use case
  2. Experiment: Test different scoring profiles with your data
  3. Monitor: Set up monitoring for search performance and relevance
  4. Advance: Proceed to Module 11: Facets & Aggregations

Additional Resources


This module provides comprehensive coverage of text analysis and scoring in Azure AI Search. The combination of proper analyzer configuration and scoring profiles is crucial for delivering relevant search results that meet user expectations.