Module 10: Troubleshooting - Analyzers & Scoring¶

Common Analyzer Issues¶

Issue 1: Analyzer Not Found Error¶

Error Message:

{
  "error": {
    "code": "InvalidRequestError",
    "message": "The analyzer 'custom_analyzer' is not defined in the index."
  }
}

Cause: The analyzer is referenced in a field but not defined in the index schema.

Solution: 1. Verify analyzer is defined in the index schema:

{
  "analyzers": [
    {
      "name": "custom_analyzer",
      "tokenizer": "standard",
      "tokenFilters": ["lowercase"]
    }
  ]
}

Ensure analyzer name matches exactly (case-sensitive)
Check that the index was created/updated with the analyzer definition

Prevention: - Always define analyzers before referencing them in fields - Use consistent naming conventions - Validate JSON schema before deployment

Issue 2: Unexpected Tokenization Results¶

Problem: Analyzer produces unexpected tokens or doesn't process text as expected.

Symptoms: - Search results don't match expectations - Tokens are not what you anticipated - Missing or extra tokens in analysis output

Debugging Steps:

Use Analyze API to test:

POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "Your test text here",
  "analyzer": "your_analyzer_name"
}

Test individual components:

POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "Your test text here",
  "tokenizer": "standard",
  "tokenFilters": ["lowercase"]
}

Compare with built-in analyzers:

POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "Your test text here",
  "analyzer": "standard.lucene"
}

Common Causes and Solutions:

Issue	Cause	Solution
HTML tags in tokens	Missing HTML strip filter	Add `html_strip` character filter
Uppercase tokens	Missing lowercase filter	Add `lowercase` token filter
Stop words not removed	Missing stop word filter	Add `stopwords` token filter
No stemming	Missing stemmer	Add `stemmer` token filter
Wrong language processing	Incorrect language analyzer	Use appropriate language-specific analyzer

Issue 3: Performance Problems¶

Symptoms: - Slow indexing performance - High query latency - Memory usage issues - Timeouts during indexing

Diagnostic Steps:

Measure baseline performance:

import time

def measure_indexing_performance(documents, analyzer_name):
    start_time = time.time()
    # Index documents
    result = search_client.upload_documents(documents)
    end_time = time.time()

    duration = end_time - start_time
    docs_per_second = len(documents) / duration

    print(f"Analyzer: {analyzer_name}")
    print(f"Duration: {duration:.2f}s")
    print(f"Docs/second: {docs_per_second:.2f}")

Profile analyzer complexity:

// Simple analyzer (fast)
{
  "name": "simple_analyzer",
  "tokenizer": "standard",
  "tokenFilters": ["lowercase"]
}

// Complex analyzer (slower)
{
  "name": "complex_analyzer",
  "tokenizer": "standard",
  "charFilters": ["html_strip", "mapping"],
  "tokenFilters": [
    "lowercase", "stemmer", "stopwords", 
    "synonym", "phonetic", "ngram"
  ]
}

Optimization Strategies:

Simplify analyzers:
Remove unnecessary token filters
Use built-in analyzers when possible
Avoid complex character filters
Use different analyzers for indexing vs. searching:

{
  "name": "content",
  "type": "Edm.String",
  "indexAnalyzer": "comprehensive_analyzer",
  "searchAnalyzer": "simple_analyzer",
  "searchable": true
}

Selective field analysis:

{
  "fields": [
    {
      "name": "title",
      "analyzer": "en.microsoft"  // Complex for important field
    },
    {
      "name": "metadata",
      "analyzer": "keyword"  // Simple for exact matching
    }
  ]
}

Issue 4: Character Filter Problems¶

Problem: Character filters not working as expected.

Common Issues:

HTML not being stripped:

// ❌ Wrong: Missing character filter
{
  "name": "html_analyzer",
  "tokenizer": "standard",
  "tokenFilters": ["lowercase"]
}

// ✅ Correct: Include HTML strip filter
{
  "name": "html_analyzer",
  "tokenizer": "standard",
  "charFilters": ["html_strip"],
  "tokenFilters": ["lowercase"]
}

Pattern replacement not working:

// ❌ Wrong: Invalid regex pattern
{
  "name": "pattern_replace",
  "@odata.type": "#Microsoft.Azure.Search.PatternReplaceCharFilter",
  "pattern": "[invalid regex",
  "replacement": ""
}

// ✅ Correct: Valid regex pattern
{
  "name": "pattern_replace",
  "@odata.type": "#Microsoft.Azure.Search.PatternReplaceCharFilter",
  "pattern": "\\d+",
  "replacement": "NUMBER"
}

Testing Character Filters:

POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "<p>Test <b>HTML</b> content</p>",
  "charFilters": ["html_strip"],
  "tokenizer": "standard"
}

Common Scoring Issues¶

Issue 5: Scoring Profile Not Applied¶

Problem: Search results don't reflect expected scoring profile behavior.

Symptoms: - Results order unchanged when using scoring profile - Expected boosting not visible in scores - Scoring functions seem to have no effect

Debugging Steps:

Verify scoring profile exists:

GET https://[service].search.windows.net/indexes/[index]?api-version=2024-07-01
api-key: [admin-key]

Check scoring profile syntax:

{
  "scoringProfiles": [
    {
      "name": "content_boost",
      "text": {
        "weights": {
          "title": 3.0,
          "content": 1.0
        }
      }
    }
  ]
}

Test with scoring profile parameter:

POST https://[service].search.windows.net/indexes/[index]/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]

{
  "search": "test query",
  "scoringProfile": "content_boost",
  "includeTotalResultCount": true
}

Common Causes:

Issue	Cause	Solution
Profile not applied	Missing `scoringProfile` parameter	Add parameter to search request
Field weights ignored	Field not searchable	Ensure fields have `"searchable": true`
Functions not working	Invalid field references	Verify field names and types
No score difference	Insufficient test data	Use diverse test documents

Issue 6: Scoring Function Errors¶

Error Message:

{
  "error": {
    "code": "InvalidRequestError",
    "message": "The field 'invalidField' referenced in scoring function does not exist."
  }
}

Common Function Issues:

Field reference errors:

// ❌ Wrong: Field doesn't exist
{
  "type": "magnitude",
  "fieldName": "nonexistent_field",
  "boost": 2.0
}

// ✅ Correct: Valid field reference
{
  "type": "magnitude",
  "fieldName": "rating",
  "boost": 2.0,
  "magnitude": {
    "boostingRangeStart": 1,
    "boostingRangeEnd": 5
  }
}

Invalid field types:

// ❌ Wrong: Using string field for magnitude function
{
  "type": "magnitude",
  "fieldName": "title",  // String field
  "boost": 2.0
}

// ✅ Correct: Using numeric field
{
  "type": "magnitude",
  "fieldName": "rating",  // Numeric field
  "boost": 2.0
}

Missing required parameters:

// ❌ Wrong: Missing required parameters
{
  "type": "freshness",
  "fieldName": "publishDate",
  "boost": 2.0
}

// ✅ Correct: All required parameters
{
  "type": "freshness",
  "fieldName": "publishDate",
  "boost": 2.0,
  "interpolation": "linear",
  "freshness": {
    "boostingDuration": "P30D"
  }
}

Issue 7: Distance Function Problems¶

Problem: Geographic distance scoring not working correctly.

Common Issues:

Invalid coordinate format:

// ❌ Wrong: Invalid coordinates
{
  "location": {
    "type": "Point",
    "coordinates": [200, 100]  // Invalid longitude/latitude
  }
}

// ✅ Correct: Valid coordinates [longitude, latitude]
{
  "location": {
    "type": "Point",
    "coordinates": [-122.131577, 47.678581]
  }
}

Missing scoring parameters:

// ❌ Wrong: Missing location parameter
POST /indexes/restaurants/docs/search
{
  "search": "pizza",
  "scoringProfile": "location_boost"
}

// ✅ Correct: Include location parameter
POST /indexes/restaurants/docs/search
{
  "search": "pizza",
  "scoringProfile": "location_boost",
  "scoringParameters": ["userLocation:-122.133577,47.679581"]
}

Diagnostic Tools and Techniques¶

Tool 1: Analyze API Testing¶

Comprehensive analyzer testing:

def test_analyzer_comprehensive(service_name, admin_key, index_name, analyzer_name):
    """Comprehensive analyzer testing"""

    test_cases = [
        "Simple text",
        "HTML <b>bold</b> content",
        "Special chars: @#$%^&*()",
        "Numbers: 123 and 456.789",
        "Email: user@example.com",
        "URLs: https://www.example.com",
        "Mixed: The quick brown fox jumps over the lazy dog!",
        ""  # Empty string
    ]

    for text in test_cases:
        result = analyze_text(service_name, admin_key, index_name, text, analyzer_name)
        print(f"Input: '{text}'")
        print(f"Tokens: {[token['token'] for token in result['tokens']]}")
        print("---")

def analyze_text(service_name, admin_key, index_name, text, analyzer_name):
    """Call Analyze API"""
    import requests

    url = f"https://{service_name}.search.windows.net/indexes/{index_name}/analyze"
    headers = {
        'Content-Type': 'application/json',
        'api-key': admin_key
    }
    data = {
        'text': text,
        'analyzer': analyzer_name
    }

    response = requests.post(url, headers=headers, json=data, params={'api-version': '2024-07-01'})
    return response.json()

Tool 2: Scoring Profile Validator¶

def validate_scoring_profile(profile_config, index_schema):
    """Validate scoring profile against index schema"""

    errors = []

    # Check field weights
    if 'text' in profile_config and 'weights' in profile_config['text']:
        for field_name in profile_config['text']['weights']:
            if not is_field_searchable(field_name, index_schema):
                errors.append(f"Field '{field_name}' is not searchable")

    # Check scoring functions
    if 'functions' in profile_config:
        for func in profile_config['functions']:
            field_name = func.get('fieldName')
            func_type = func.get('type')

            if not field_exists(field_name, index_schema):
                errors.append(f"Field '{field_name}' does not exist")
            elif not is_field_compatible(field_name, func_type, index_schema):
                errors.append(f"Field '{field_name}' is not compatible with {func_type} function")

    return errors

def is_field_searchable(field_name, index_schema):
    """Check if field is searchable"""
    for field in index_schema['fields']:
        if field['name'] == field_name:
            return field.get('searchable', False)
    return False

def is_field_compatible(field_name, func_type, index_schema):
    """Check if field type is compatible with function type"""
    field_type = get_field_type(field_name, index_schema)

    compatibility = {
        'magnitude': ['Edm.Double', 'Edm.Int32', 'Edm.Int64'],
        'freshness': ['Edm.DateTimeOffset'],
        'distance': ['Edm.GeographyPoint']
    }

    return field_type in compatibility.get(func_type, [])

Tool 3: Performance Monitor¶

import time
import statistics
from datetime import datetime

class PerformanceMonitor:
    def __init__(self):
        self.metrics = []

    def measure_operation(self, operation_name, operation_func, *args, **kwargs):
        """Measure operation performance"""
        start_time = time.time()
        start_memory = self.get_memory_usage()

        try:
            result = operation_func(*args, **kwargs)
            success = True
            error = None
        except Exception as e:
            result = None
            success = False
            error = str(e)

        end_time = time.time()
        end_memory = self.get_memory_usage()

        metric = {
            'operation': operation_name,
            'timestamp': datetime.now(),
            'duration': end_time - start_time,
            'memory_delta': end_memory - start_memory,
            'success': success,
            'error': error
        }

        self.metrics.append(metric)
        return result, metric

    def get_memory_usage(self):
        """Get current memory usage (simplified)"""
        import psutil
        process = psutil.Process()
        return process.memory_info().rss / 1024 / 1024  # MB

    def generate_report(self):
        """Generate performance report"""
        if not self.metrics:
            return "No metrics collected"

        operations = {}
        for metric in self.metrics:
            op_name = metric['operation']
            if op_name not in operations:
                operations[op_name] = []
            operations[op_name].append(metric)

        report = []
        for op_name, op_metrics in operations.items():
            durations = [m['duration'] for m in op_metrics if m['success']]
            success_rate = len([m for m in op_metrics if m['success']]) / len(op_metrics)

            if durations:
                report.append(f"{op_name}:")
                report.append(f"  Average duration: {statistics.mean(durations):.3f}s")
                report.append(f"  Min duration: {min(durations):.3f}s")
                report.append(f"  Max duration: {max(durations):.3f}s")
                report.append(f"  Success rate: {success_rate:.1%}")
                report.append("")

        return "\n".join(report)

# Usage example
monitor = PerformanceMonitor()

# Measure indexing performance
result, metric = monitor.measure_operation(
    "document_indexing",
    search_client.upload_documents,
    documents
)

# Measure query performance
result, metric = monitor.measure_operation(
    "search_query",
    search_client.search,
    "test query",
    scoring_profile="content_boost"
)

print(monitor.generate_report())

Prevention Strategies¶

1. Development Best Practices¶

Test Early and Often: Use Analyze API during development
Version Control: Track analyzer and scoring profile changes
Documentation: Document analyzer purposes and expected behavior
Validation: Implement automated validation for configurations

2. Deployment Checklist¶

[ ] Analyzer definitions are complete and valid
[ ] All referenced fields exist and have correct attributes
[ ] Scoring profiles reference valid fields with appropriate types
[ ] Performance testing completed with representative data
[ ] Backup and rollback procedures in place

3. Monitoring Setup¶

# Example monitoring configuration
monitoring_config = {
    'performance_thresholds': {
        'indexing_rate': 100,  # docs per second
        'query_latency': 100,  # milliseconds
        'error_rate': 0.01     # 1%
    },
    'test_queries': [
        'machine learning',
        'data science',
        'artificial intelligence'
    ],
    'alert_conditions': [
        'query_latency > 200ms',
        'error_rate > 5%',
        'indexing_rate < 50 docs/sec'
    ]
}

Getting Help¶

1. Azure Support Resources¶

Azure Portal: Monitor service health and metrics
Azure Support: Create support tickets for complex issues
Documentation: Official Azure AI Search documentation
Community Forums: Stack Overflow, Microsoft Q&A

2. Diagnostic Information to Collect¶

When reporting issues, include:

Service Details: Service name, tier, region
Index Schema: Complete index definition
Analyzer Configuration: Full analyzer and scoring profile definitions
Sample Data: Representative test documents
Error Messages: Complete error responses
Performance Metrics: Timing and throughput measurements

3. Common Support Scenarios¶

Issue Type	Information Needed	Expected Resolution Time
Configuration errors	Index schema, error messages	1-2 business days
Performance issues	Metrics, sample data, usage patterns	3-5 business days
Unexpected behavior	Test cases, expected vs. actual results	2-3 business days
Service limits	Usage patterns, scaling requirements	1-2 business days

This troubleshooting guide covers the most common issues encountered when working with analyzers and scoring profiles in Azure AI Search. Regular testing and monitoring help prevent many of these issues from occurring in production.