Module 10: Analyzers & Scoring - Best Practices¶

Analyzer Best Practices¶

1. Analyzer Selection Strategy¶

Start with Built-in Analyzers¶

// ✅ Good: Start simple
{
  "name": "title",
  "type": "Edm.String",
  "analyzer": "en.microsoft",
  "searchable": true
}

// ❌ Avoid: Complex custom analyzer as first choice
{
  "name": "title",
  "type": "Edm.String",
  "analyzer": "overly_complex_custom_analyzer",
  "searchable": true
}

Language-Specific Optimization¶

// ✅ Good: Use appropriate language analyzer
{
  "fields": [
    {
      "name": "title_en",
      "type": "Edm.String",
      "analyzer": "en.microsoft",
      "searchable": true
    },
    {
      "name": "title_fr",
      "type": "Edm.String",
      "analyzer": "fr.microsoft",
      "searchable": true
    }
  ]
}

Field-Specific Analyzer Assignment¶

// ✅ Good: Match analyzer to field purpose
{
  "fields": [
    {
      "name": "title",
      "type": "Edm.String",
      "analyzer": "en.microsoft",  // Rich analysis for search
      "searchable": true
    },
    {
      "name": "productCode",
      "type": "Edm.String",
      "analyzer": "keyword",  // Exact matching for codes
      "searchable": true
    },
    {
      "name": "description",
      "type": "Edm.String",
      "analyzer": "standard.lucene",  // Balanced for content
      "searchable": true
    }
  ]
}

2. Custom Analyzer Design¶

Keep It Simple¶

// ✅ Good: Simple, focused custom analyzer
{
  "analyzers": [
    {
      "name": "product_analyzer",
      "tokenizer": "standard",
      "tokenFilters": [
        "lowercase",
        "product_synonyms"
      ]
    }
  ]
}

// ❌ Avoid: Overly complex analyzer
{
  "analyzers": [
    {
      "name": "complex_analyzer",
      "tokenizer": "standard",
      "charFilters": ["html_strip", "mapping", "pattern_replace"],
      "tokenFilters": [
        "lowercase", "stemmer", "stopwords", "synonym", 
        "phonetic", "ngram", "shingle", "trim"
      ]
    }
  ]
}

Modular Component Design¶

// ✅ Good: Reusable components
{
  "tokenFilters": [
    {
      "name": "common_stopwords",
      "@odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter",
      "stopwords": ["the", "and", "or", "but", "in", "on", "at"]
    },
    {
      "name": "english_stemmer",
      "@odata.type": "#Microsoft.Azure.Search.StemmerTokenFilter",
      "language": "english"
    }
  ],
  "analyzers": [
    {
      "name": "content_analyzer",
      "tokenizer": "standard",
      "tokenFilters": ["lowercase", "common_stopwords", "english_stemmer"]
    },
    {
      "name": "title_analyzer",
      "tokenizer": "standard",
      "tokenFilters": ["lowercase", "common_stopwords"]  // No stemming for titles
    }
  ]
}

3. Performance Optimization¶

Separate Index and Search Analyzers¶

// ✅ Good: Different analyzers for indexing vs searching
{
  "name": "content",
  "type": "Edm.String",
  "indexAnalyzer": "comprehensive_indexing_analyzer",
  "searchAnalyzer": "simple_search_analyzer",
  "searchable": true
}

Selective Complex Analysis¶

// ✅ Good: Apply complex analyzers selectively
{
  "fields": [
    {
      "name": "title",
      "type": "Edm.String",
      "analyzer": "en.microsoft",  // Complex for important field
      "searchable": true
    },
    {
      "name": "tags",
      "type": "Collection(Edm.String)",
      "analyzer": "keyword",  // Simple for tags
      "searchable": true
    },
    {
      "name": "metadata",
      "type": "Edm.String",
      "analyzer": "simple",  // Basic for metadata
      "searchable": true
    }
  ]
}

4. Testing and Validation¶

Comprehensive Testing Strategy¶

// Test cases for analyzer validation
{
  "testCases": [
    {
      "input": "The quick brown fox jumps over the lazy dog",
      "expectedTokens": ["quick", "brown", "fox", "jump", "lazi", "dog"],
      "analyzer": "en.microsoft"
    },
    {
      "input": "HTML <b>bold</b> and <i>italic</i> text",
      "expectedTokens": ["html", "bold", "italic", "text"],
      "analyzer": "html_strip_analyzer"
    },
    {
      "input": "user@example.com and admin@test.org",
      "expectedTokens": ["user", "example.com", "admin", "test.org"],
      "analyzer": "email_analyzer"
    }
  ]
}

Edge Case Testing¶

### Test empty input
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [key]

{
  "text": "",
  "analyzer": "en.microsoft"
}

### Test special characters
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [key]

{
  "text": "Special chars: @#$%^&*()_+-=[]{}|;':\",./<>?",
  "analyzer": "standard"
}

### Test very long text
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [key]

{
  "text": "[very long text string...]",
  "analyzer": "en.microsoft"
}

Scoring Profile Best Practices¶

1. Field Weight Strategy¶

Hierarchical Field Importance¶

// ✅ Good: Clear hierarchy of field importance
{
  "scoringProfiles": [
    {
      "name": "content_relevance",
      "text": {
        "weights": {
          "title": 5.0,        // Highest importance
          "summary": 3.0,      // High importance
          "content": 1.0,      // Base importance
          "tags": 2.0,         // Medium importance
          "category": 1.5      // Low-medium importance
        }
      }
    }
  ]
}

// ❌ Avoid: Unclear or extreme weights
{
  "text": {
    "weights": {
      "title": 100.0,    // Too extreme
      "content": 1.1,    // Too similar
      "tags": 1.05       // Minimal difference
    }
  }
}

Business Logic Alignment¶

// ✅ Good: E-commerce scoring aligned with business goals
{
  "scoringProfiles": [
    {
      "name": "ecommerce_relevance",
      "text": {
        "weights": {
          "productName": 4.0,    // Product name most important
          "brand": 3.0,          // Brand recognition
          "description": 1.0,    // Base content
          "category": 2.0        // Navigation aid
        }
      },
      "functions": [
        {
          "type": "magnitude",
          "fieldName": "rating",
          "boost": 2.0,
          "magnitude": {
            "boostingRangeStart": 3.0,
            "boostingRangeEnd": 5.0
          }
        },
        {
          "type": "magnitude",
          "fieldName": "salesCount",
          "boost": 1.5,
          "magnitude": {
            "boostingRangeStart": 10,
            "boostingRangeEnd": 1000
          }
        }
      ]
    }
  ]
}

2. Scoring Function Design¶

Balanced Function Usage¶

// ✅ Good: Balanced combination of functions
{
  "functions": [
    {
      "type": "freshness",
      "fieldName": "publishDate",
      "boost": 1.5,          // Moderate boost
      "interpolation": "linear",
      "freshness": {
        "boostingDuration": "P30D"  // 30 days
      }
    },
    {
      "type": "magnitude",
      "fieldName": "viewCount",
      "boost": 1.3,          // Moderate boost
      "interpolation": "logarithmic",
      "magnitude": {
        "boostingRangeStart": 0,
        "boostingRangeEnd": 10000
      }
    }
  ],
  "functionAggregation": "sum"
}

// ❌ Avoid: Excessive boosting
{
  "functions": [
    {
      "type": "freshness",
      "boost": 10.0,    // Too high
      "freshness": {
        "boostingDuration": "P1D"  // Too short
      }
    }
  ]
}

Appropriate Interpolation¶

// ✅ Good: Match interpolation to data distribution
{
  "functions": [
    {
      "type": "magnitude",
      "fieldName": "price",
      "interpolation": "linear",     // Good for price ranges
      "magnitude": {
        "boostingRangeStart": 10,
        "boostingRangeEnd": 1000
      }
    },
    {
      "type": "magnitude",
      "fieldName": "viewCount",
      "interpolation": "logarithmic", // Good for view counts
      "magnitude": {
        "boostingRangeStart": 1,
        "boostingRangeEnd": 1000000
      }
    }
  ]
}

3. Profile Testing and Validation¶

A/B Testing Framework¶

# Example A/B testing approach
def compare_scoring_profiles(query, profile_a, profile_b, test_documents):
    """Compare two scoring profiles with the same query"""

    results_a = search_with_profile(query, profile_a)
    results_b = search_with_profile(query, profile_b)

    metrics = {
        'profile_a': calculate_relevance_metrics(results_a, test_documents),
        'profile_b': calculate_relevance_metrics(results_b, test_documents)
    }

    return metrics

def calculate_relevance_metrics(results, ground_truth):
    """Calculate precision, recall, and NDCG"""
    return {
        'precision_at_5': precision_at_k(results, ground_truth, 5),
        'recall_at_10': recall_at_k(results, ground_truth, 10),
        'ndcg_at_10': ndcg_at_k(results, ground_truth, 10)
    }

Performance Monitoring¶

// Monitor scoring profile impact
{
  "monitoringQueries": [
    {
      "query": "machine learning",
      "expectedTopResults": ["doc1", "doc2", "doc3"],
      "scoringProfile": "content_relevance"
    },
    {
      "query": "data science python",
      "expectedTopResults": ["doc4", "doc5", "doc6"],
      "scoringProfile": "technical_relevance"
    }
  ]
}

Configuration Management¶

1. Version Control¶

Structured Configuration Files¶

// ✅ Good: Well-organized analyzer configuration
{
  "version": "1.2.0",
  "description": "Product search analyzers for e-commerce",
  "analyzers": [
    {
      "name": "product_name_analyzer",
      "description": "Optimized for product name search",
      "tokenizer": "standard",
      "tokenFilters": ["lowercase", "product_synonyms"]
    }
  ],
  "tokenFilters": [
    {
      "name": "product_synonyms",
      "description": "Product-specific synonym mappings",
      "@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
      "synonyms": [
        "laptop,notebook,computer",
        "phone,mobile,smartphone"
      ]
    }
  ]
}

Environment-Specific Configurations¶

// Development environment
{
  "environment": "development",
  "scoringProfiles": [
    {
      "name": "dev_scoring",
      "text": {
        "weights": {
          "title": 2.0,    // Lower weights for testing
          "content": 1.0
        }
      }
    }
  ]
}

// Production environment
{
  "environment": "production",
  "scoringProfiles": [
    {
      "name": "prod_scoring",
      "text": {
        "weights": {
          "title": 4.0,    // Optimized weights
          "content": 1.0
        }
      }
    }
  ]
}

2. Documentation Standards¶

Analyzer Documentation¶

{
  "analyzers": [
    {
      "name": "technical_content_analyzer",
      "description": "Specialized analyzer for technical documentation",
      "use_cases": [
        "API documentation search",
        "Code snippet analysis",
        "Technical term matching"
      ],
      "tokenizer": "standard",
      "tokenFilters": [
        "lowercase",
        "technical_stopwords",
        "code_preservation"
      ],
      "performance_notes": "Optimized for technical terms, preserves camelCase",
      "last_updated": "2024-01-15",
      "owner": "search-team"
    }
  ]
}

Scoring Profile Documentation¶

{
  "scoringProfiles": [
    {
      "name": "content_discovery",
      "description": "Optimizes for content discoverability and freshness",
      "business_logic": "Prioritizes recent, highly-rated content",
      "target_scenarios": [
        "Blog post search",
        "News article discovery",
        "Recent content browsing"
      ],
      "performance_impact": "Low - simple functions only",
      "a_b_test_results": {
        "improvement_in_ctr": "15%",
        "test_period": "2024-01-01 to 2024-01-31"
      }
    }
  ]
}

Monitoring and Maintenance¶

1. Performance Monitoring¶

Key Metrics to Track¶

# Monitoring checklist
monitoring_metrics = {
    'indexing_performance': {
        'documents_per_second': 'target: >100/sec',
        'indexing_latency': 'target: <2s per document',
        'memory_usage': 'target: <80% of allocated'
    },
    'query_performance': {
        'query_latency': 'target: <100ms',
        'throughput': 'target: >50 queries/sec',
        'cache_hit_rate': 'target: >70%'
    },
    'relevance_metrics': {
        'click_through_rate': 'target: >5%',
        'zero_result_rate': 'target: <10%',
        'user_satisfaction': 'target: >4.0/5.0'
    }
}

Automated Testing¶

def validate_analyzer_performance():
    """Automated analyzer validation"""
    test_cases = [
        ("simple text", "en.microsoft", ["simpl", "text"]),
        ("HTML <b>content</b>", "html_analyzer", ["html", "content"]),
        ("user@domain.com", "email_analyzer", ["user", "domain.com"])
    ]

    for text, analyzer, expected in test_cases:
        result = analyze_text(text, analyzer)
        assert_tokens_match(result, expected)

2. Regular Maintenance¶

Quarterly Review Checklist¶

[ ] Analyzer Performance: Review indexing and query performance metrics
[ ] Relevance Metrics: Analyze click-through rates and user satisfaction
[ ] Configuration Updates: Update synonyms, stop words, and scoring weights
[ ] A/B Testing: Test new analyzer and scoring configurations
[ ] Documentation: Update configuration documentation and best practices

Synonym Management¶

// ✅ Good: Organized synonym management
{
  "synonymMaps": [
    {
      "name": "product_synonyms_v2",
      "format": "solr",
      "synonyms": [
        "// Electronics",
        "laptop,notebook,computer,pc",
        "phone,mobile,smartphone,cell",
        "// Clothing", 
        "shirt,top,blouse",
        "pants,trousers,jeans"
      ],
      "version": "2.1",
      "last_updated": "2024-01-15"
    }
  ]
}

Common Anti-Patterns to Avoid¶

1. Analyzer Anti-Patterns¶

// ❌ Don't: Over-complex analyzer chains
{
  "name": "bad_analyzer",
  "tokenizer": "standard",
  "charFilters": ["html_strip", "mapping", "pattern_replace"],
  "tokenFilters": [
    "lowercase", "stemmer", "stopwords", "synonym", 
    "phonetic", "ngram", "shingle", "trim", "unique"
  ]
}

// ❌ Don't: Inconsistent analyzer usage
{
  "fields": [
    {
      "name": "title_en",
      "analyzer": "en.microsoft"
    },
    {
      "name": "title_fr", 
      "analyzer": "standard"  // Should use fr.microsoft
    }
  ]
}

2. Scoring Anti-Patterns¶

// ❌ Don't: Extreme scoring weights
{
  "text": {
    "weights": {
      "title": 1000.0,  // Too extreme
      "content": 0.01   // Too low
    }
  }
}

// ❌ Don't: Too many scoring functions
{
  "functions": [
    {"type": "freshness", "boost": 2.0},
    {"type": "magnitude", "boost": 3.0},
    {"type": "distance", "boost": 1.5},
    {"type": "freshness", "boost": 1.8},  // Duplicate types
    {"type": "magnitude", "boost": 2.2}   // Too many functions
  ]
}

Following these best practices ensures optimal performance, maintainability, and relevance in your Azure AI Search implementation.