Module 10: Analyzers & Scoring¶

Overview¶

This module covers advanced text analysis and scoring techniques in Azure AI Search, focusing on how to optimize search relevance through custom analyzers, scoring profiles, and relevance tuning strategies.

Learning Objectives¶

By the end of this module, you will be able to:

Understand Text Analysis: Master how Azure AI Search processes and analyzes text during indexing and querying
Configure Built-in Analyzers: Select and configure appropriate analyzers for different languages and use cases
Create Custom Analyzers: Build custom text analysis pipelines with tokenizers, filters, and character filters
Implement Scoring Profiles: Design and implement custom scoring algorithms to improve search relevance
Optimize Search Relevance: Apply advanced techniques for relevance tuning and result ranking
Test and Debug Analyzers: Use tools and techniques to test and troubleshoot analyzer configurations

Prerequisites¶

Before starting this module, ensure you have:

Completed Module 9: Advanced Querying
Understanding of search relevance concepts
Basic knowledge of text processing and linguistics
Familiarity with JSON configuration syntax
Access to Azure AI Search service with admin permissions

Module Structure¶

1. Text Analysis Fundamentals¶

Understanding the Analysis Process¶

Text analysis in Azure AI Search occurs during both indexing and querying phases:

Indexing Phase: 1. Character Filtering: Removes or replaces characters (HTML stripping, pattern replacement) 2. Tokenization: Breaks text into individual tokens/terms 3. Token Filtering: Applies filters like lowercase, stemming, stop word removal 4. Storage: Processed tokens are stored in the inverted index

Query Phase: 1. Query Analysis: User queries undergo the same analysis process 2. Term Matching: Analyzed query terms are matched against indexed terms 3. Scoring: Relevance scores are calculated based on term frequency and other factors

Built-in Analyzers¶

Azure AI Search provides several built-in analyzers:

Language Analyzers: - en.microsoft - Microsoft English analyzer with advanced linguistic processing - en.lucene - Apache Lucene English analyzer - standard.lucene - Language-neutral standard analyzer - Language-specific analyzers for 50+ languages

Specialized Analyzers: - keyword - Treats entire input as a single token (exact matching) - pattern - Uses regular expressions for tokenization - simple - Basic lowercase and whitespace tokenization - stop - Standard analyzer with stop word removal - whitespace - Splits on whitespace only

Example: Analyzer Comparison¶

{
  "fields": [
    {
      "name": "title_standard",
      "type": "Edm.String",
      "analyzer": "standard.lucene",
      "searchable": true
    },
    {
      "name": "title_english",
      "type": "Edm.String",
      "analyzer": "en.microsoft",
      "searchable": true
    },
    {
      "name": "title_keyword",
      "type": "Edm.String",
      "analyzer": "keyword",
      "searchable": true
    }
  ]
}

2. Custom Analyzer Configuration¶

Analyzer Components¶

Custom analyzers consist of three main components:

1. Character Filters (Optional) - Process text before tokenization - Common filters: html_strip, mapping, pattern_replace

2. Tokenizer (Required) - Breaks text into tokens - Options: standard, keyword, pattern, whitespace, edgeNGram, nGram

3. Token Filters (Optional) - Process tokens after tokenization - Common filters: lowercase, stemmer, stopwords, synonym, phonetic

Custom Analyzer Example¶

{
  "analyzers": [
    {
      "name": "custom_english_analyzer",
      "tokenizer": "standard",
      "charFilters": ["html_strip"],
      "tokenFilters": [
        "lowercase",
        "english_stop",
        "english_stemmer",
        "custom_synonyms"
      ]
    }
  ],
  "charFilters": [
    {
      "name": "html_strip",
      "@odata.type": "#Microsoft.Azure.Search.HtmlStripCharFilter"
    }
  ],
  "tokenFilters": [
    {
      "name": "english_stop",
      "@odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter",
      "stopwords": ["the", "and", "or", "but"]
    },
    {
      "name": "english_stemmer",
      "@odata.type": "#Microsoft.Azure.Search.StemmerTokenFilter",
      "language": "english"
    },
    {
      "name": "custom_synonyms",
      "@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
      "synonyms": ["car,automobile,vehicle", "happy,joyful,glad"]
    }
  ]
}

3. Advanced Analyzer Techniques¶

N-gram Analyzers for Partial Matching¶

N-gram analyzers enable partial word matching and autocomplete functionality:

{
  "analyzers": [
    {
      "name": "autocomplete_analyzer",
      "tokenizer": "autocomplete_tokenizer",
      "tokenFilters": ["lowercase"]
    }
  ],
  "tokenizers": [
    {
      "name": "autocomplete_tokenizer",
      "@odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenizer",
      "minGram": 2,
      "maxGram": 25,
      "tokenChars": ["letter", "digit"]
    }
  ]
}

Phonetic Matching¶

Implement phonetic matching for names and similar-sounding terms:

{
  "tokenFilters": [
    {
      "name": "phonetic_filter",
      "@odata.type": "#Microsoft.Azure.Search.PhoneticTokenFilter",
      "encoder": "soundex"
    }
  ]
}

Pattern-Based Tokenization¶

Use regular expressions for specialized tokenization:

{
  "tokenizers": [
    {
      "name": "email_tokenizer",
      "@odata.type": "#Microsoft.Azure.Search.PatternTokenizer",
      "pattern": "\\W+",
      "flags": ["CASE_INSENSITIVE"]
    }
  ]
}

4. Scoring Profiles and Relevance Tuning¶

Understanding Default Scoring¶

Azure AI Search uses a modified TF-IDF (Term Frequency-Inverse Document Frequency) algorithm:

Components: - Term Frequency (TF): How often a term appears in a document - Inverse Document Frequency (IDF): How rare a term is across the corpus - Field Length Normalization: Shorter fields get higher scores - Coordinate Factor: Documents matching more query terms score higher

Scoring Profile Structure¶

{
  "scoringProfiles": [
    {
      "name": "content_boost_profile",
      "text": {
        "weights": {
          "title": 3.0,
          "description": 2.0,
          "content": 1.0,
          "tags": 1.5
        }
      },
      "functions": [
        {
          "type": "freshness",
          "fieldName": "publishDate",
          "boost": 2.0,
          "interpolation": "linear",
          "freshness": {
            "boostingDuration": "P30D"
          }
        },
        {
          "type": "magnitude",
          "fieldName": "rating",
          "boost": 1.5,
          "interpolation": "linear",
          "magnitude": {
            "boostingRangeStart": 1,
            "boostingRangeEnd": 5,
            "constantBoostBeyondRange": true
          }
        }
      ],
      "functionAggregation": "sum"
    }
  ]
}

Scoring Function Types¶

1. Freshness Functions Boost recent content based on date/time fields:

{
  "type": "freshness",
  "fieldName": "lastModified",
  "boost": 2.0,
  "interpolation": "linear",
  "freshness": {
    "boostingDuration": "P7D"  // 7 days
  }
}

2. Magnitude Functions Boost based on numeric field values:

{
  "type": "magnitude",
  "fieldName": "viewCount",
  "boost": 1.8,
  "interpolation": "logarithmic",
  "magnitude": {
    "boostingRangeStart": 0,
    "boostingRangeEnd": 1000,
    "constantBoostBeyondRange": true
  }
}

3. Distance Functions Boost based on geographic proximity:

{
  "type": "distance",
  "fieldName": "location",
  "boost": 2.0,
  "interpolation": "linear",
  "distance": {
    "referencePointParameter": "currentLocation",
    "boostingDistance": 10
  }
}

5. Testing and Debugging Analyzers¶

Analyze API for Testing¶

Use the Analyze API to test how text is processed:

POST https://[service-name].search.windows.net/indexes/[index-name]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "The quick brown fox jumps over the lazy dog",
  "analyzer": "en.microsoft"
}

Testing Custom Analyzers¶

POST https://[service-name].search.windows.net/indexes/[index-name]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "HTML <b>bold</b> text with UPPERCASE",
  "tokenizer": "standard",
  "charFilters": ["html_strip"],
  "tokenFilters": ["lowercase"]
}

Analyzer Testing Best Practices¶

Test with Real Data: Use actual content from your domain
Test Edge Cases: Empty strings, special characters, very long text
Compare Analyzers: Test multiple analyzers with the same content
Validate Token Output: Ensure tokens match your expectations
Performance Testing: Measure analysis time for large documents

6. Performance Considerations¶

Analyzer Performance Impact¶

Indexing Performance: - Complex analyzers slow down indexing - Character filters add processing overhead - Multiple token filters compound processing time - N-gram analyzers significantly increase index size

Query Performance: - Query-time analysis affects search latency - Complex analyzers impact real-time search - Consider using different analyzers for indexing vs. searching

Optimization Strategies¶

1. Selective Field Analysis Only apply complex analyzers to fields that need them:

{
  "fields": [
    {
      "name": "title",
      "type": "Edm.String",
      "analyzer": "en.microsoft",  // Complex analyzer for important field
      "searchable": true
    },
    {
      "name": "category",
      "type": "Edm.String",
      "analyzer": "keyword",  // Simple analyzer for exact matching
      "searchable": true
    }
  ]
}

2. Separate Index and Search Analyzers Use different analyzers for indexing and searching:

{
  "name": "content",
  "type": "Edm.String",
  "indexAnalyzer": "complex_indexing_analyzer",
  "searchAnalyzer": "simple_search_analyzer",
  "searchable": true
}

7. Common Use Cases and Patterns¶

E-commerce Search Optimization¶

{
  "scoringProfiles": [
    {
      "name": "ecommerce_boost",
      "text": {
        "weights": {
          "productName": 4.0,
          "brand": 3.0,
          "description": 1.0,
          "category": 2.0
        }
      },
      "functions": [
        {
          "type": "magnitude",
          "fieldName": "rating",
          "boost": 2.0,
          "interpolation": "linear",
          "magnitude": {
            "boostingRangeStart": 1,
            "boostingRangeEnd": 5
          }
        },
        {
          "type": "magnitude",
          "fieldName": "salesCount",
          "boost": 1.5,
          "interpolation": "logarithmic",
          "magnitude": {
            "boostingRangeStart": 0,
            "boostingRangeEnd": 10000
          }
        }
      ]
    }
  ]
}

Multi-language Content¶

{
  "analyzers": [
    {
      "name": "multilingual_analyzer",
      "tokenizer": "standard",
      "tokenFilters": [
        "lowercase",
        "asciifolding",  // Converts accented characters
        "multilingual_stemmer"
      ]
    }
  ]
}

Technical Documentation Search¶

{
  "analyzers": [
    {
      "name": "technical_analyzer",
      "tokenizer": "standard",
      "tokenFilters": [
        "lowercase",
        "technical_synonyms",
        "code_preservation"  // Preserve code-like tokens
      ]
    }
  ]
}

Best Practices¶

Analyzer Selection Guidelines¶

Start Simple: Begin with built-in analyzers before creating custom ones
Language-Specific: Use language analyzers for single-language content
Domain-Specific: Create custom analyzers for specialized domains
Test Thoroughly: Always test analyzers with representative data
Monitor Performance: Track indexing and query performance impact

Scoring Profile Guidelines¶

Field Weights: Assign higher weights to more important fields
Function Balance: Don't over-boost with too many functions
Business Logic: Align scoring with business objectives
A/B Testing: Test different profiles to measure effectiveness
Regular Review: Update profiles based on user behavior analytics

Common Pitfalls to Avoid¶

Over-Engineering: Don't create overly complex analyzers without clear benefits
Inconsistent Analysis: Ensure index and search analyzers are compatible
Performance Neglect: Monitor the impact of complex analyzers on performance
Insufficient Testing: Test analyzers with edge cases and real data
Static Configuration: Regularly review and update analyzer configurations

Troubleshooting¶

Common Issues¶

1. Unexpected Search Results - Cause: Analyzer mismatch between indexing and searching - Solution: Use the Analyze API to verify token output

2. Poor Performance - Cause: Complex analyzers or too many token filters - Solution: Simplify analyzers or use different analyzers for index vs. search

3. Missing Results - Cause: Over-aggressive filtering (stop words, stemming) - Solution: Review filter configurations and test with sample queries

4. Scoring Issues - Cause: Incorrect scoring profile configuration - Solution: Test scoring profiles with known data sets

Debugging Tools¶

Analyze API: Test text processing
Search Explorer: Test queries with different scoring profiles
Query Logs: Analyze search patterns and performance
Index Statistics: Monitor index size and field usage

Next Steps¶

After completing this module, you should:

Practice: Implement custom analyzers for your specific use case
Experiment: Test different scoring profiles with your data
Monitor: Set up monitoring for search performance and relevance
Advance: Proceed to Module 11: Facets & Aggregations

Additional Resources¶

This module provides comprehensive coverage of text analysis and scoring in Azure AI Search. The combination of proper analyzer configuration and scoring profiles is crucial for delivering relevant search results that meet user expectations.