Skip to content

Module 10: Troubleshooting - Analyzers & Scoring

Common Analyzer Issues

Issue 1: Analyzer Not Found Error

Error Message:

{
  "error": {
    "code": "InvalidRequestError",
    "message": "The analyzer 'custom_analyzer' is not defined in the index."
  }
}

Cause: The analyzer is referenced in a field but not defined in the index schema.

Solution: 1. Verify analyzer is defined in the index schema:

{
  "analyzers": [
    {
      "name": "custom_analyzer",
      "tokenizer": "standard",
      "tokenFilters": ["lowercase"]
    }
  ]
}
  1. Ensure analyzer name matches exactly (case-sensitive)
  2. Check that the index was created/updated with the analyzer definition

Prevention: - Always define analyzers before referencing them in fields - Use consistent naming conventions - Validate JSON schema before deployment

Issue 2: Unexpected Tokenization Results

Problem: Analyzer produces unexpected tokens or doesn't process text as expected.

Symptoms: - Search results don't match expectations - Tokens are not what you anticipated - Missing or extra tokens in analysis output

Debugging Steps:

  1. Use Analyze API to test:
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "Your test text here",
  "analyzer": "your_analyzer_name"
}
  1. Test individual components:
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "Your test text here",
  "tokenizer": "standard",
  "tokenFilters": ["lowercase"]
}
  1. Compare with built-in analyzers:
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "Your test text here",
  "analyzer": "standard.lucene"
}

Common Causes and Solutions:

Issue Cause Solution
HTML tags in tokens Missing HTML strip filter Add html_strip character filter
Uppercase tokens Missing lowercase filter Add lowercase token filter
Stop words not removed Missing stop word filter Add stopwords token filter
No stemming Missing stemmer Add stemmer token filter
Wrong language processing Incorrect language analyzer Use appropriate language-specific analyzer

Issue 3: Performance Problems

Symptoms: - Slow indexing performance - High query latency - Memory usage issues - Timeouts during indexing

Diagnostic Steps:

  1. Measure baseline performance:
import time

def measure_indexing_performance(documents, analyzer_name):
    start_time = time.time()
    # Index documents
    result = search_client.upload_documents(documents)
    end_time = time.time()

    duration = end_time - start_time
    docs_per_second = len(documents) / duration

    print(f"Analyzer: {analyzer_name}")
    print(f"Duration: {duration:.2f}s")
    print(f"Docs/second: {docs_per_second:.2f}")
  1. Profile analyzer complexity:
// Simple analyzer (fast)
{
  "name": "simple_analyzer",
  "tokenizer": "standard",
  "tokenFilters": ["lowercase"]
}

// Complex analyzer (slower)
{
  "name": "complex_analyzer",
  "tokenizer": "standard",
  "charFilters": ["html_strip", "mapping"],
  "tokenFilters": [
    "lowercase", "stemmer", "stopwords", 
    "synonym", "phonetic", "ngram"
  ]
}

Optimization Strategies:

  1. Simplify analyzers:
  2. Remove unnecessary token filters
  3. Use built-in analyzers when possible
  4. Avoid complex character filters

  5. Use different analyzers for indexing vs. searching:

{
  "name": "content",
  "type": "Edm.String",
  "indexAnalyzer": "comprehensive_analyzer",
  "searchAnalyzer": "simple_analyzer",
  "searchable": true
}
  1. Selective field analysis:
{
  "fields": [
    {
      "name": "title",
      "analyzer": "en.microsoft"  // Complex for important field
    },
    {
      "name": "metadata",
      "analyzer": "keyword"  // Simple for exact matching
    }
  ]
}

Issue 4: Character Filter Problems

Problem: Character filters not working as expected.

Common Issues:

  1. HTML not being stripped:
// ❌ Wrong: Missing character filter
{
  "name": "html_analyzer",
  "tokenizer": "standard",
  "tokenFilters": ["lowercase"]
}

// ✅ Correct: Include HTML strip filter
{
  "name": "html_analyzer",
  "tokenizer": "standard",
  "charFilters": ["html_strip"],
  "tokenFilters": ["lowercase"]
}
  1. Pattern replacement not working:
// ❌ Wrong: Invalid regex pattern
{
  "name": "pattern_replace",
  "@odata.type": "#Microsoft.Azure.Search.PatternReplaceCharFilter",
  "pattern": "[invalid regex",
  "replacement": ""
}

// ✅ Correct: Valid regex pattern
{
  "name": "pattern_replace",
  "@odata.type": "#Microsoft.Azure.Search.PatternReplaceCharFilter",
  "pattern": "\\d+",
  "replacement": "NUMBER"
}

Testing Character Filters:

POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "<p>Test <b>HTML</b> content</p>",
  "charFilters": ["html_strip"],
  "tokenizer": "standard"
}

Common Scoring Issues

Issue 5: Scoring Profile Not Applied

Problem: Search results don't reflect expected scoring profile behavior.

Symptoms: - Results order unchanged when using scoring profile - Expected boosting not visible in scores - Scoring functions seem to have no effect

Debugging Steps:

  1. Verify scoring profile exists:
GET https://[service].search.windows.net/indexes/[index]?api-version=2024-07-01
api-key: [admin-key]
  1. Check scoring profile syntax:
{
  "scoringProfiles": [
    {
      "name": "content_boost",
      "text": {
        "weights": {
          "title": 3.0,
          "content": 1.0
        }
      }
    }
  ]
}
  1. Test with scoring profile parameter:
POST https://[service].search.windows.net/indexes/[index]/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]

{
  "search": "test query",
  "scoringProfile": "content_boost",
  "includeTotalResultCount": true
}

Common Causes:

Issue Cause Solution
Profile not applied Missing scoringProfile parameter Add parameter to search request
Field weights ignored Field not searchable Ensure fields have "searchable": true
Functions not working Invalid field references Verify field names and types
No score difference Insufficient test data Use diverse test documents

Issue 6: Scoring Function Errors

Error Message:

{
  "error": {
    "code": "InvalidRequestError",
    "message": "The field 'invalidField' referenced in scoring function does not exist."
  }
}

Common Function Issues:

  1. Field reference errors:
// ❌ Wrong: Field doesn't exist
{
  "type": "magnitude",
  "fieldName": "nonexistent_field",
  "boost": 2.0
}

// ✅ Correct: Valid field reference
{
  "type": "magnitude",
  "fieldName": "rating",
  "boost": 2.0,
  "magnitude": {
    "boostingRangeStart": 1,
    "boostingRangeEnd": 5
  }
}
  1. Invalid field types:
// ❌ Wrong: Using string field for magnitude function
{
  "type": "magnitude",
  "fieldName": "title",  // String field
  "boost": 2.0
}

// ✅ Correct: Using numeric field
{
  "type": "magnitude",
  "fieldName": "rating",  // Numeric field
  "boost": 2.0
}
  1. Missing required parameters:
// ❌ Wrong: Missing required parameters
{
  "type": "freshness",
  "fieldName": "publishDate",
  "boost": 2.0
}

// ✅ Correct: All required parameters
{
  "type": "freshness",
  "fieldName": "publishDate",
  "boost": 2.0,
  "interpolation": "linear",
  "freshness": {
    "boostingDuration": "P30D"
  }
}

Issue 7: Distance Function Problems

Problem: Geographic distance scoring not working correctly.

Common Issues:

  1. Invalid coordinate format:
// ❌ Wrong: Invalid coordinates
{
  "location": {
    "type": "Point",
    "coordinates": [200, 100]  // Invalid longitude/latitude
  }
}

// ✅ Correct: Valid coordinates [longitude, latitude]
{
  "location": {
    "type": "Point",
    "coordinates": [-122.131577, 47.678581]
  }
}
  1. Missing scoring parameters:
// ❌ Wrong: Missing location parameter
POST /indexes/restaurants/docs/search
{
  "search": "pizza",
  "scoringProfile": "location_boost"
}

// ✅ Correct: Include location parameter
POST /indexes/restaurants/docs/search
{
  "search": "pizza",
  "scoringProfile": "location_boost",
  "scoringParameters": ["userLocation:-122.133577,47.679581"]
}

Diagnostic Tools and Techniques

Tool 1: Analyze API Testing

Comprehensive analyzer testing:

def test_analyzer_comprehensive(service_name, admin_key, index_name, analyzer_name):
    """Comprehensive analyzer testing"""

    test_cases = [
        "Simple text",
        "HTML <b>bold</b> content",
        "Special chars: @#$%^&*()",
        "Numbers: 123 and 456.789",
        "Email: user@example.com",
        "URLs: https://www.example.com",
        "Mixed: The quick brown fox jumps over the lazy dog!",
        ""  # Empty string
    ]

    for text in test_cases:
        result = analyze_text(service_name, admin_key, index_name, text, analyzer_name)
        print(f"Input: '{text}'")
        print(f"Tokens: {[token['token'] for token in result['tokens']]}")
        print("---")

def analyze_text(service_name, admin_key, index_name, text, analyzer_name):
    """Call Analyze API"""
    import requests

    url = f"https://{service_name}.search.windows.net/indexes/{index_name}/analyze"
    headers = {
        'Content-Type': 'application/json',
        'api-key': admin_key
    }
    data = {
        'text': text,
        'analyzer': analyzer_name
    }

    response = requests.post(url, headers=headers, json=data, params={'api-version': '2024-07-01'})
    return response.json()

Tool 2: Scoring Profile Validator

def validate_scoring_profile(profile_config, index_schema):
    """Validate scoring profile against index schema"""

    errors = []

    # Check field weights
    if 'text' in profile_config and 'weights' in profile_config['text']:
        for field_name in profile_config['text']['weights']:
            if not is_field_searchable(field_name, index_schema):
                errors.append(f"Field '{field_name}' is not searchable")

    # Check scoring functions
    if 'functions' in profile_config:
        for func in profile_config['functions']:
            field_name = func.get('fieldName')
            func_type = func.get('type')

            if not field_exists(field_name, index_schema):
                errors.append(f"Field '{field_name}' does not exist")
            elif not is_field_compatible(field_name, func_type, index_schema):
                errors.append(f"Field '{field_name}' is not compatible with {func_type} function")

    return errors

def is_field_searchable(field_name, index_schema):
    """Check if field is searchable"""
    for field in index_schema['fields']:
        if field['name'] == field_name:
            return field.get('searchable', False)
    return False

def is_field_compatible(field_name, func_type, index_schema):
    """Check if field type is compatible with function type"""
    field_type = get_field_type(field_name, index_schema)

    compatibility = {
        'magnitude': ['Edm.Double', 'Edm.Int32', 'Edm.Int64'],
        'freshness': ['Edm.DateTimeOffset'],
        'distance': ['Edm.GeographyPoint']
    }

    return field_type in compatibility.get(func_type, [])

Tool 3: Performance Monitor

import time
import statistics
from datetime import datetime

class PerformanceMonitor:
    def __init__(self):
        self.metrics = []

    def measure_operation(self, operation_name, operation_func, *args, **kwargs):
        """Measure operation performance"""
        start_time = time.time()
        start_memory = self.get_memory_usage()

        try:
            result = operation_func(*args, **kwargs)
            success = True
            error = None
        except Exception as e:
            result = None
            success = False
            error = str(e)

        end_time = time.time()
        end_memory = self.get_memory_usage()

        metric = {
            'operation': operation_name,
            'timestamp': datetime.now(),
            'duration': end_time - start_time,
            'memory_delta': end_memory - start_memory,
            'success': success,
            'error': error
        }

        self.metrics.append(metric)
        return result, metric

    def get_memory_usage(self):
        """Get current memory usage (simplified)"""
        import psutil
        process = psutil.Process()
        return process.memory_info().rss / 1024 / 1024  # MB

    def generate_report(self):
        """Generate performance report"""
        if not self.metrics:
            return "No metrics collected"

        operations = {}
        for metric in self.metrics:
            op_name = metric['operation']
            if op_name not in operations:
                operations[op_name] = []
            operations[op_name].append(metric)

        report = []
        for op_name, op_metrics in operations.items():
            durations = [m['duration'] for m in op_metrics if m['success']]
            success_rate = len([m for m in op_metrics if m['success']]) / len(op_metrics)

            if durations:
                report.append(f"{op_name}:")
                report.append(f"  Average duration: {statistics.mean(durations):.3f}s")
                report.append(f"  Min duration: {min(durations):.3f}s")
                report.append(f"  Max duration: {max(durations):.3f}s")
                report.append(f"  Success rate: {success_rate:.1%}")
                report.append("")

        return "\n".join(report)

# Usage example
monitor = PerformanceMonitor()

# Measure indexing performance
result, metric = monitor.measure_operation(
    "document_indexing",
    search_client.upload_documents,
    documents
)

# Measure query performance
result, metric = monitor.measure_operation(
    "search_query",
    search_client.search,
    "test query",
    scoring_profile="content_boost"
)

print(monitor.generate_report())

Prevention Strategies

1. Development Best Practices

  • Test Early and Often: Use Analyze API during development
  • Version Control: Track analyzer and scoring profile changes
  • Documentation: Document analyzer purposes and expected behavior
  • Validation: Implement automated validation for configurations

2. Deployment Checklist

  • [ ] Analyzer definitions are complete and valid
  • [ ] All referenced fields exist and have correct attributes
  • [ ] Scoring profiles reference valid fields with appropriate types
  • [ ] Performance testing completed with representative data
  • [ ] Backup and rollback procedures in place

3. Monitoring Setup

# Example monitoring configuration
monitoring_config = {
    'performance_thresholds': {
        'indexing_rate': 100,  # docs per second
        'query_latency': 100,  # milliseconds
        'error_rate': 0.01     # 1%
    },
    'test_queries': [
        'machine learning',
        'data science',
        'artificial intelligence'
    ],
    'alert_conditions': [
        'query_latency > 200ms',
        'error_rate > 5%',
        'indexing_rate < 50 docs/sec'
    ]
}

Getting Help

1. Azure Support Resources

  • Azure Portal: Monitor service health and metrics
  • Azure Support: Create support tickets for complex issues
  • Documentation: Official Azure AI Search documentation
  • Community Forums: Stack Overflow, Microsoft Q&A

2. Diagnostic Information to Collect

When reporting issues, include:

  • Service Details: Service name, tier, region
  • Index Schema: Complete index definition
  • Analyzer Configuration: Full analyzer and scoring profile definitions
  • Sample Data: Representative test documents
  • Error Messages: Complete error responses
  • Performance Metrics: Timing and throughput measurements

3. Common Support Scenarios

Issue Type Information Needed Expected Resolution Time
Configuration errors Index schema, error messages 1-2 business days
Performance issues Metrics, sample data, usage patterns 3-5 business days
Unexpected behavior Test cases, expected vs. actual results 2-3 business days
Service limits Usage patterns, scaling requirements 1-2 business days

This troubleshooting guide covers the most common issues encountered when working with analyzers and scoring profiles in Azure AI Search. Regular testing and monitoring help prevent many of these issues from occurring in production.