Module 10: Analyzers & Scoring - Best Practices¶
Analyzer Best Practices¶
1. Analyzer Selection Strategy¶
Start with Built-in Analyzers¶
// ✅ Good: Start simple
{
"name": "title",
"type": "Edm.String",
"analyzer": "en.microsoft",
"searchable": true
}
// ❌ Avoid: Complex custom analyzer as first choice
{
"name": "title",
"type": "Edm.String",
"analyzer": "overly_complex_custom_analyzer",
"searchable": true
}
Language-Specific Optimization¶
// ✅ Good: Use appropriate language analyzer
{
"fields": [
{
"name": "title_en",
"type": "Edm.String",
"analyzer": "en.microsoft",
"searchable": true
},
{
"name": "title_fr",
"type": "Edm.String",
"analyzer": "fr.microsoft",
"searchable": true
}
]
}
Field-Specific Analyzer Assignment¶
// ✅ Good: Match analyzer to field purpose
{
"fields": [
{
"name": "title",
"type": "Edm.String",
"analyzer": "en.microsoft", // Rich analysis for search
"searchable": true
},
{
"name": "productCode",
"type": "Edm.String",
"analyzer": "keyword", // Exact matching for codes
"searchable": true
},
{
"name": "description",
"type": "Edm.String",
"analyzer": "standard.lucene", // Balanced for content
"searchable": true
}
]
}
2. Custom Analyzer Design¶
Keep It Simple¶
// ✅ Good: Simple, focused custom analyzer
{
"analyzers": [
{
"name": "product_analyzer",
"tokenizer": "standard",
"tokenFilters": [
"lowercase",
"product_synonyms"
]
}
]
}
// ❌ Avoid: Overly complex analyzer
{
"analyzers": [
{
"name": "complex_analyzer",
"tokenizer": "standard",
"charFilters": ["html_strip", "mapping", "pattern_replace"],
"tokenFilters": [
"lowercase", "stemmer", "stopwords", "synonym",
"phonetic", "ngram", "shingle", "trim"
]
}
]
}
Modular Component Design¶
// ✅ Good: Reusable components
{
"tokenFilters": [
{
"name": "common_stopwords",
"@odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter",
"stopwords": ["the", "and", "or", "but", "in", "on", "at"]
},
{
"name": "english_stemmer",
"@odata.type": "#Microsoft.Azure.Search.StemmerTokenFilter",
"language": "english"
}
],
"analyzers": [
{
"name": "content_analyzer",
"tokenizer": "standard",
"tokenFilters": ["lowercase", "common_stopwords", "english_stemmer"]
},
{
"name": "title_analyzer",
"tokenizer": "standard",
"tokenFilters": ["lowercase", "common_stopwords"] // No stemming for titles
}
]
}
3. Performance Optimization¶
Separate Index and Search Analyzers¶
// ✅ Good: Different analyzers for indexing vs searching
{
"name": "content",
"type": "Edm.String",
"indexAnalyzer": "comprehensive_indexing_analyzer",
"searchAnalyzer": "simple_search_analyzer",
"searchable": true
}
Selective Complex Analysis¶
// ✅ Good: Apply complex analyzers selectively
{
"fields": [
{
"name": "title",
"type": "Edm.String",
"analyzer": "en.microsoft", // Complex for important field
"searchable": true
},
{
"name": "tags",
"type": "Collection(Edm.String)",
"analyzer": "keyword", // Simple for tags
"searchable": true
},
{
"name": "metadata",
"type": "Edm.String",
"analyzer": "simple", // Basic for metadata
"searchable": true
}
]
}
4. Testing and Validation¶
Comprehensive Testing Strategy¶
// Test cases for analyzer validation
{
"testCases": [
{
"input": "The quick brown fox jumps over the lazy dog",
"expectedTokens": ["quick", "brown", "fox", "jump", "lazi", "dog"],
"analyzer": "en.microsoft"
},
{
"input": "HTML <b>bold</b> and <i>italic</i> text",
"expectedTokens": ["html", "bold", "italic", "text"],
"analyzer": "html_strip_analyzer"
},
{
"input": "user@example.com and admin@test.org",
"expectedTokens": ["user", "example.com", "admin", "test.org"],
"analyzer": "email_analyzer"
}
]
}
Edge Case Testing¶
### Test empty input
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [key]
{
"text": "",
"analyzer": "en.microsoft"
}
### Test special characters
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [key]
{
"text": "Special chars: @#$%^&*()_+-=[]{}|;':\",./<>?",
"analyzer": "standard"
}
### Test very long text
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [key]
{
"text": "[very long text string...]",
"analyzer": "en.microsoft"
}
Scoring Profile Best Practices¶
1. Field Weight Strategy¶
Hierarchical Field Importance¶
// ✅ Good: Clear hierarchy of field importance
{
"scoringProfiles": [
{
"name": "content_relevance",
"text": {
"weights": {
"title": 5.0, // Highest importance
"summary": 3.0, // High importance
"content": 1.0, // Base importance
"tags": 2.0, // Medium importance
"category": 1.5 // Low-medium importance
}
}
}
]
}
// ❌ Avoid: Unclear or extreme weights
{
"text": {
"weights": {
"title": 100.0, // Too extreme
"content": 1.1, // Too similar
"tags": 1.05 // Minimal difference
}
}
}
Business Logic Alignment¶
// ✅ Good: E-commerce scoring aligned with business goals
{
"scoringProfiles": [
{
"name": "ecommerce_relevance",
"text": {
"weights": {
"productName": 4.0, // Product name most important
"brand": 3.0, // Brand recognition
"description": 1.0, // Base content
"category": 2.0 // Navigation aid
}
},
"functions": [
{
"type": "magnitude",
"fieldName": "rating",
"boost": 2.0,
"magnitude": {
"boostingRangeStart": 3.0,
"boostingRangeEnd": 5.0
}
},
{
"type": "magnitude",
"fieldName": "salesCount",
"boost": 1.5,
"magnitude": {
"boostingRangeStart": 10,
"boostingRangeEnd": 1000
}
}
]
}
]
}
2. Scoring Function Design¶
Balanced Function Usage¶
// ✅ Good: Balanced combination of functions
{
"functions": [
{
"type": "freshness",
"fieldName": "publishDate",
"boost": 1.5, // Moderate boost
"interpolation": "linear",
"freshness": {
"boostingDuration": "P30D" // 30 days
}
},
{
"type": "magnitude",
"fieldName": "viewCount",
"boost": 1.3, // Moderate boost
"interpolation": "logarithmic",
"magnitude": {
"boostingRangeStart": 0,
"boostingRangeEnd": 10000
}
}
],
"functionAggregation": "sum"
}
// ❌ Avoid: Excessive boosting
{
"functions": [
{
"type": "freshness",
"boost": 10.0, // Too high
"freshness": {
"boostingDuration": "P1D" // Too short
}
}
]
}
Appropriate Interpolation¶
// ✅ Good: Match interpolation to data distribution
{
"functions": [
{
"type": "magnitude",
"fieldName": "price",
"interpolation": "linear", // Good for price ranges
"magnitude": {
"boostingRangeStart": 10,
"boostingRangeEnd": 1000
}
},
{
"type": "magnitude",
"fieldName": "viewCount",
"interpolation": "logarithmic", // Good for view counts
"magnitude": {
"boostingRangeStart": 1,
"boostingRangeEnd": 1000000
}
}
]
}
3. Profile Testing and Validation¶
A/B Testing Framework¶
# Example A/B testing approach
def compare_scoring_profiles(query, profile_a, profile_b, test_documents):
"""Compare two scoring profiles with the same query"""
results_a = search_with_profile(query, profile_a)
results_b = search_with_profile(query, profile_b)
metrics = {
'profile_a': calculate_relevance_metrics(results_a, test_documents),
'profile_b': calculate_relevance_metrics(results_b, test_documents)
}
return metrics
def calculate_relevance_metrics(results, ground_truth):
"""Calculate precision, recall, and NDCG"""
return {
'precision_at_5': precision_at_k(results, ground_truth, 5),
'recall_at_10': recall_at_k(results, ground_truth, 10),
'ndcg_at_10': ndcg_at_k(results, ground_truth, 10)
}
Performance Monitoring¶
// Monitor scoring profile impact
{
"monitoringQueries": [
{
"query": "machine learning",
"expectedTopResults": ["doc1", "doc2", "doc3"],
"scoringProfile": "content_relevance"
},
{
"query": "data science python",
"expectedTopResults": ["doc4", "doc5", "doc6"],
"scoringProfile": "technical_relevance"
}
]
}
Configuration Management¶
1. Version Control¶
Structured Configuration Files¶
// ✅ Good: Well-organized analyzer configuration
{
"version": "1.2.0",
"description": "Product search analyzers for e-commerce",
"analyzers": [
{
"name": "product_name_analyzer",
"description": "Optimized for product name search",
"tokenizer": "standard",
"tokenFilters": ["lowercase", "product_synonyms"]
}
],
"tokenFilters": [
{
"name": "product_synonyms",
"description": "Product-specific synonym mappings",
"@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
"synonyms": [
"laptop,notebook,computer",
"phone,mobile,smartphone"
]
}
]
}
Environment-Specific Configurations¶
// Development environment
{
"environment": "development",
"scoringProfiles": [
{
"name": "dev_scoring",
"text": {
"weights": {
"title": 2.0, // Lower weights for testing
"content": 1.0
}
}
}
]
}
// Production environment
{
"environment": "production",
"scoringProfiles": [
{
"name": "prod_scoring",
"text": {
"weights": {
"title": 4.0, // Optimized weights
"content": 1.0
}
}
}
]
}
2. Documentation Standards¶
Analyzer Documentation¶
{
"analyzers": [
{
"name": "technical_content_analyzer",
"description": "Specialized analyzer for technical documentation",
"use_cases": [
"API documentation search",
"Code snippet analysis",
"Technical term matching"
],
"tokenizer": "standard",
"tokenFilters": [
"lowercase",
"technical_stopwords",
"code_preservation"
],
"performance_notes": "Optimized for technical terms, preserves camelCase",
"last_updated": "2024-01-15",
"owner": "search-team"
}
]
}
Scoring Profile Documentation¶
{
"scoringProfiles": [
{
"name": "content_discovery",
"description": "Optimizes for content discoverability and freshness",
"business_logic": "Prioritizes recent, highly-rated content",
"target_scenarios": [
"Blog post search",
"News article discovery",
"Recent content browsing"
],
"performance_impact": "Low - simple functions only",
"a_b_test_results": {
"improvement_in_ctr": "15%",
"test_period": "2024-01-01 to 2024-01-31"
}
}
]
}
Monitoring and Maintenance¶
1. Performance Monitoring¶
Key Metrics to Track¶
# Monitoring checklist
monitoring_metrics = {
'indexing_performance': {
'documents_per_second': 'target: >100/sec',
'indexing_latency': 'target: <2s per document',
'memory_usage': 'target: <80% of allocated'
},
'query_performance': {
'query_latency': 'target: <100ms',
'throughput': 'target: >50 queries/sec',
'cache_hit_rate': 'target: >70%'
},
'relevance_metrics': {
'click_through_rate': 'target: >5%',
'zero_result_rate': 'target: <10%',
'user_satisfaction': 'target: >4.0/5.0'
}
}
Automated Testing¶
def validate_analyzer_performance():
"""Automated analyzer validation"""
test_cases = [
("simple text", "en.microsoft", ["simpl", "text"]),
("HTML <b>content</b>", "html_analyzer", ["html", "content"]),
("user@domain.com", "email_analyzer", ["user", "domain.com"])
]
for text, analyzer, expected in test_cases:
result = analyze_text(text, analyzer)
assert_tokens_match(result, expected)
2. Regular Maintenance¶
Quarterly Review Checklist¶
- [ ] Analyzer Performance: Review indexing and query performance metrics
- [ ] Relevance Metrics: Analyze click-through rates and user satisfaction
- [ ] Configuration Updates: Update synonyms, stop words, and scoring weights
- [ ] A/B Testing: Test new analyzer and scoring configurations
- [ ] Documentation: Update configuration documentation and best practices
Synonym Management¶
// ✅ Good: Organized synonym management
{
"synonymMaps": [
{
"name": "product_synonyms_v2",
"format": "solr",
"synonyms": [
"// Electronics",
"laptop,notebook,computer,pc",
"phone,mobile,smartphone,cell",
"// Clothing",
"shirt,top,blouse",
"pants,trousers,jeans"
],
"version": "2.1",
"last_updated": "2024-01-15"
}
]
}
Common Anti-Patterns to Avoid¶
1. Analyzer Anti-Patterns¶
// ❌ Don't: Over-complex analyzer chains
{
"name": "bad_analyzer",
"tokenizer": "standard",
"charFilters": ["html_strip", "mapping", "pattern_replace"],
"tokenFilters": [
"lowercase", "stemmer", "stopwords", "synonym",
"phonetic", "ngram", "shingle", "trim", "unique"
]
}
// ❌ Don't: Inconsistent analyzer usage
{
"fields": [
{
"name": "title_en",
"analyzer": "en.microsoft"
},
{
"name": "title_fr",
"analyzer": "standard" // Should use fr.microsoft
}
]
}
2. Scoring Anti-Patterns¶
// ❌ Don't: Extreme scoring weights
{
"text": {
"weights": {
"title": 1000.0, // Too extreme
"content": 0.01 // Too low
}
}
}
// ❌ Don't: Too many scoring functions
{
"functions": [
{"type": "freshness", "boost": 2.0},
{"type": "magnitude", "boost": 3.0},
{"type": "distance", "boost": 1.5},
{"type": "freshness", "boost": 1.8}, // Duplicate types
{"type": "magnitude", "boost": 2.2} // Too many functions
]
}
Following these best practices ensures optimal performance, maintainability, and relevance in your Azure AI Search implementation.