Module 10: Analyzers & Scoring¶
Overview¶
This module covers advanced text analysis and scoring techniques in Azure AI Search, focusing on how to optimize search relevance through custom analyzers, scoring profiles, and relevance tuning strategies.
Learning Objectives¶
By the end of this module, you will be able to:
- Understand Text Analysis: Master how Azure AI Search processes and analyzes text during indexing and querying
- Configure Built-in Analyzers: Select and configure appropriate analyzers for different languages and use cases
- Create Custom Analyzers: Build custom text analysis pipelines with tokenizers, filters, and character filters
- Implement Scoring Profiles: Design and implement custom scoring algorithms to improve search relevance
- Optimize Search Relevance: Apply advanced techniques for relevance tuning and result ranking
- Test and Debug Analyzers: Use tools and techniques to test and troubleshoot analyzer configurations
Prerequisites¶
Before starting this module, ensure you have:
- Completed Module 9: Advanced Querying
- Understanding of search relevance concepts
- Basic knowledge of text processing and linguistics
- Familiarity with JSON configuration syntax
- Access to Azure AI Search service with admin permissions
Module Structure¶
1. Text Analysis Fundamentals¶
Understanding the Analysis Process¶
Text analysis in Azure AI Search occurs during both indexing and querying phases:
Indexing Phase: 1. Character Filtering: Removes or replaces characters (HTML stripping, pattern replacement) 2. Tokenization: Breaks text into individual tokens/terms 3. Token Filtering: Applies filters like lowercase, stemming, stop word removal 4. Storage: Processed tokens are stored in the inverted index
Query Phase: 1. Query Analysis: User queries undergo the same analysis process 2. Term Matching: Analyzed query terms are matched against indexed terms 3. Scoring: Relevance scores are calculated based on term frequency and other factors
Built-in Analyzers¶
Azure AI Search provides several built-in analyzers:
Language Analyzers:
- en.microsoft - Microsoft English analyzer with advanced linguistic processing
- en.lucene - Apache Lucene English analyzer
- standard.lucene - Language-neutral standard analyzer
- Language-specific analyzers for 50+ languages
Specialized Analyzers:
- keyword - Treats entire input as a single token (exact matching)
- pattern - Uses regular expressions for tokenization
- simple - Basic lowercase and whitespace tokenization
- stop - Standard analyzer with stop word removal
- whitespace - Splits on whitespace only
Example: Analyzer Comparison¶
{
"fields": [
{
"name": "title_standard",
"type": "Edm.String",
"analyzer": "standard.lucene",
"searchable": true
},
{
"name": "title_english",
"type": "Edm.String",
"analyzer": "en.microsoft",
"searchable": true
},
{
"name": "title_keyword",
"type": "Edm.String",
"analyzer": "keyword",
"searchable": true
}
]
}
2. Custom Analyzer Configuration¶
Analyzer Components¶
Custom analyzers consist of three main components:
1. Character Filters (Optional)
- Process text before tokenization
- Common filters: html_strip, mapping, pattern_replace
2. Tokenizer (Required)
- Breaks text into tokens
- Options: standard, keyword, pattern, whitespace, edgeNGram, nGram
3. Token Filters (Optional)
- Process tokens after tokenization
- Common filters: lowercase, stemmer, stopwords, synonym, phonetic
Custom Analyzer Example¶
{
"analyzers": [
{
"name": "custom_english_analyzer",
"tokenizer": "standard",
"charFilters": ["html_strip"],
"tokenFilters": [
"lowercase",
"english_stop",
"english_stemmer",
"custom_synonyms"
]
}
],
"charFilters": [
{
"name": "html_strip",
"@odata.type": "#Microsoft.Azure.Search.HtmlStripCharFilter"
}
],
"tokenFilters": [
{
"name": "english_stop",
"@odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter",
"stopwords": ["the", "and", "or", "but"]
},
{
"name": "english_stemmer",
"@odata.type": "#Microsoft.Azure.Search.StemmerTokenFilter",
"language": "english"
},
{
"name": "custom_synonyms",
"@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
"synonyms": ["car,automobile,vehicle", "happy,joyful,glad"]
}
]
}
3. Advanced Analyzer Techniques¶
N-gram Analyzers for Partial Matching¶
N-gram analyzers enable partial word matching and autocomplete functionality:
{
"analyzers": [
{
"name": "autocomplete_analyzer",
"tokenizer": "autocomplete_tokenizer",
"tokenFilters": ["lowercase"]
}
],
"tokenizers": [
{
"name": "autocomplete_tokenizer",
"@odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenizer",
"minGram": 2,
"maxGram": 25,
"tokenChars": ["letter", "digit"]
}
]
}
Phonetic Matching¶
Implement phonetic matching for names and similar-sounding terms:
{
"tokenFilters": [
{
"name": "phonetic_filter",
"@odata.type": "#Microsoft.Azure.Search.PhoneticTokenFilter",
"encoder": "soundex"
}
]
}
Pattern-Based Tokenization¶
Use regular expressions for specialized tokenization:
{
"tokenizers": [
{
"name": "email_tokenizer",
"@odata.type": "#Microsoft.Azure.Search.PatternTokenizer",
"pattern": "\\W+",
"flags": ["CASE_INSENSITIVE"]
}
]
}
4. Scoring Profiles and Relevance Tuning¶
Understanding Default Scoring¶
Azure AI Search uses a modified TF-IDF (Term Frequency-Inverse Document Frequency) algorithm:
Components: - Term Frequency (TF): How often a term appears in a document - Inverse Document Frequency (IDF): How rare a term is across the corpus - Field Length Normalization: Shorter fields get higher scores - Coordinate Factor: Documents matching more query terms score higher
Scoring Profile Structure¶
{
"scoringProfiles": [
{
"name": "content_boost_profile",
"text": {
"weights": {
"title": 3.0,
"description": 2.0,
"content": 1.0,
"tags": 1.5
}
},
"functions": [
{
"type": "freshness",
"fieldName": "publishDate",
"boost": 2.0,
"interpolation": "linear",
"freshness": {
"boostingDuration": "P30D"
}
},
{
"type": "magnitude",
"fieldName": "rating",
"boost": 1.5,
"interpolation": "linear",
"magnitude": {
"boostingRangeStart": 1,
"boostingRangeEnd": 5,
"constantBoostBeyondRange": true
}
}
],
"functionAggregation": "sum"
}
]
}
Scoring Function Types¶
1. Freshness Functions Boost recent content based on date/time fields:
{
"type": "freshness",
"fieldName": "lastModified",
"boost": 2.0,
"interpolation": "linear",
"freshness": {
"boostingDuration": "P7D" // 7 days
}
}
2. Magnitude Functions Boost based on numeric field values:
{
"type": "magnitude",
"fieldName": "viewCount",
"boost": 1.8,
"interpolation": "logarithmic",
"magnitude": {
"boostingRangeStart": 0,
"boostingRangeEnd": 1000,
"constantBoostBeyondRange": true
}
}
3. Distance Functions Boost based on geographic proximity:
{
"type": "distance",
"fieldName": "location",
"boost": 2.0,
"interpolation": "linear",
"distance": {
"referencePointParameter": "currentLocation",
"boostingDistance": 10
}
}
5. Testing and Debugging Analyzers¶
Analyze API for Testing¶
Use the Analyze API to test how text is processed:
POST https://[service-name].search.windows.net/indexes/[index-name]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]
{
"text": "The quick brown fox jumps over the lazy dog",
"analyzer": "en.microsoft"
}
Testing Custom Analyzers¶
POST https://[service-name].search.windows.net/indexes/[index-name]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]
{
"text": "HTML <b>bold</b> text with UPPERCASE",
"tokenizer": "standard",
"charFilters": ["html_strip"],
"tokenFilters": ["lowercase"]
}
Analyzer Testing Best Practices¶
- Test with Real Data: Use actual content from your domain
- Test Edge Cases: Empty strings, special characters, very long text
- Compare Analyzers: Test multiple analyzers with the same content
- Validate Token Output: Ensure tokens match your expectations
- Performance Testing: Measure analysis time for large documents
6. Performance Considerations¶
Analyzer Performance Impact¶
Indexing Performance: - Complex analyzers slow down indexing - Character filters add processing overhead - Multiple token filters compound processing time - N-gram analyzers significantly increase index size
Query Performance: - Query-time analysis affects search latency - Complex analyzers impact real-time search - Consider using different analyzers for indexing vs. searching
Optimization Strategies¶
1. Selective Field Analysis Only apply complex analyzers to fields that need them:
{
"fields": [
{
"name": "title",
"type": "Edm.String",
"analyzer": "en.microsoft", // Complex analyzer for important field
"searchable": true
},
{
"name": "category",
"type": "Edm.String",
"analyzer": "keyword", // Simple analyzer for exact matching
"searchable": true
}
]
}
2. Separate Index and Search Analyzers Use different analyzers for indexing and searching:
{
"name": "content",
"type": "Edm.String",
"indexAnalyzer": "complex_indexing_analyzer",
"searchAnalyzer": "simple_search_analyzer",
"searchable": true
}
7. Common Use Cases and Patterns¶
E-commerce Search Optimization¶
{
"scoringProfiles": [
{
"name": "ecommerce_boost",
"text": {
"weights": {
"productName": 4.0,
"brand": 3.0,
"description": 1.0,
"category": 2.0
}
},
"functions": [
{
"type": "magnitude",
"fieldName": "rating",
"boost": 2.0,
"interpolation": "linear",
"magnitude": {
"boostingRangeStart": 1,
"boostingRangeEnd": 5
}
},
{
"type": "magnitude",
"fieldName": "salesCount",
"boost": 1.5,
"interpolation": "logarithmic",
"magnitude": {
"boostingRangeStart": 0,
"boostingRangeEnd": 10000
}
}
]
}
]
}
Multi-language Content¶
{
"analyzers": [
{
"name": "multilingual_analyzer",
"tokenizer": "standard",
"tokenFilters": [
"lowercase",
"asciifolding", // Converts accented characters
"multilingual_stemmer"
]
}
]
}
Technical Documentation Search¶
{
"analyzers": [
{
"name": "technical_analyzer",
"tokenizer": "standard",
"tokenFilters": [
"lowercase",
"technical_synonyms",
"code_preservation" // Preserve code-like tokens
]
}
]
}
Best Practices¶
Analyzer Selection Guidelines¶
- Start Simple: Begin with built-in analyzers before creating custom ones
- Language-Specific: Use language analyzers for single-language content
- Domain-Specific: Create custom analyzers for specialized domains
- Test Thoroughly: Always test analyzers with representative data
- Monitor Performance: Track indexing and query performance impact
Scoring Profile Guidelines¶
- Field Weights: Assign higher weights to more important fields
- Function Balance: Don't over-boost with too many functions
- Business Logic: Align scoring with business objectives
- A/B Testing: Test different profiles to measure effectiveness
- Regular Review: Update profiles based on user behavior analytics
Common Pitfalls to Avoid¶
- Over-Engineering: Don't create overly complex analyzers without clear benefits
- Inconsistent Analysis: Ensure index and search analyzers are compatible
- Performance Neglect: Monitor the impact of complex analyzers on performance
- Insufficient Testing: Test analyzers with edge cases and real data
- Static Configuration: Regularly review and update analyzer configurations
Troubleshooting¶
Common Issues¶
1. Unexpected Search Results - Cause: Analyzer mismatch between indexing and searching - Solution: Use the Analyze API to verify token output
2. Poor Performance - Cause: Complex analyzers or too many token filters - Solution: Simplify analyzers or use different analyzers for index vs. search
3. Missing Results - Cause: Over-aggressive filtering (stop words, stemming) - Solution: Review filter configurations and test with sample queries
4. Scoring Issues - Cause: Incorrect scoring profile configuration - Solution: Test scoring profiles with known data sets
Debugging Tools¶
- Analyze API: Test text processing
- Search Explorer: Test queries with different scoring profiles
- Query Logs: Analyze search patterns and performance
- Index Statistics: Monitor index size and field usage
Next Steps¶
After completing this module, you should:
- Practice: Implement custom analyzers for your specific use case
- Experiment: Test different scoring profiles with your data
- Monitor: Set up monitoring for search performance and relevance
- Advance: Proceed to Module 11: Facets & Aggregations
Additional Resources¶
- Azure AI Search Analyzers Documentation
- Scoring Profiles Documentation
- Text Analysis in Azure AI Search
- Custom Analyzers Tutorial
- Relevance Tuning Guide
This module provides comprehensive coverage of text analysis and scoring in Azure AI Search. The combination of proper analyzer configuration and scoring profiles is crucial for delivering relevant search results that meet user expectations.