Module 10: Troubleshooting - Analyzers & Scoring¶
Common Analyzer Issues¶
Issue 1: Analyzer Not Found Error¶
Error Message:
{
"error": {
"code": "InvalidRequestError",
"message": "The analyzer 'custom_analyzer' is not defined in the index."
}
}
Cause: The analyzer is referenced in a field but not defined in the index schema.
Solution: 1. Verify analyzer is defined in the index schema:
{
"analyzers": [
{
"name": "custom_analyzer",
"tokenizer": "standard",
"tokenFilters": ["lowercase"]
}
]
}
- Ensure analyzer name matches exactly (case-sensitive)
- Check that the index was created/updated with the analyzer definition
Prevention: - Always define analyzers before referencing them in fields - Use consistent naming conventions - Validate JSON schema before deployment
Issue 2: Unexpected Tokenization Results¶
Problem: Analyzer produces unexpected tokens or doesn't process text as expected.
Symptoms: - Search results don't match expectations - Tokens are not what you anticipated - Missing or extra tokens in analysis output
Debugging Steps:
- Use Analyze API to test:
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]
{
"text": "Your test text here",
"analyzer": "your_analyzer_name"
}
- Test individual components:
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]
{
"text": "Your test text here",
"tokenizer": "standard",
"tokenFilters": ["lowercase"]
}
- Compare with built-in analyzers:
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]
{
"text": "Your test text here",
"analyzer": "standard.lucene"
}
Common Causes and Solutions:
| Issue | Cause | Solution |
|---|---|---|
| HTML tags in tokens | Missing HTML strip filter | Add html_strip character filter |
| Uppercase tokens | Missing lowercase filter | Add lowercase token filter |
| Stop words not removed | Missing stop word filter | Add stopwords token filter |
| No stemming | Missing stemmer | Add stemmer token filter |
| Wrong language processing | Incorrect language analyzer | Use appropriate language-specific analyzer |
Issue 3: Performance Problems¶
Symptoms: - Slow indexing performance - High query latency - Memory usage issues - Timeouts during indexing
Diagnostic Steps:
- Measure baseline performance:
import time
def measure_indexing_performance(documents, analyzer_name):
start_time = time.time()
# Index documents
result = search_client.upload_documents(documents)
end_time = time.time()
duration = end_time - start_time
docs_per_second = len(documents) / duration
print(f"Analyzer: {analyzer_name}")
print(f"Duration: {duration:.2f}s")
print(f"Docs/second: {docs_per_second:.2f}")
- Profile analyzer complexity:
// Simple analyzer (fast)
{
"name": "simple_analyzer",
"tokenizer": "standard",
"tokenFilters": ["lowercase"]
}
// Complex analyzer (slower)
{
"name": "complex_analyzer",
"tokenizer": "standard",
"charFilters": ["html_strip", "mapping"],
"tokenFilters": [
"lowercase", "stemmer", "stopwords",
"synonym", "phonetic", "ngram"
]
}
Optimization Strategies:
- Simplify analyzers:
- Remove unnecessary token filters
- Use built-in analyzers when possible
-
Avoid complex character filters
-
Use different analyzers for indexing vs. searching:
{
"name": "content",
"type": "Edm.String",
"indexAnalyzer": "comprehensive_analyzer",
"searchAnalyzer": "simple_analyzer",
"searchable": true
}
- Selective field analysis:
{
"fields": [
{
"name": "title",
"analyzer": "en.microsoft" // Complex for important field
},
{
"name": "metadata",
"analyzer": "keyword" // Simple for exact matching
}
]
}
Issue 4: Character Filter Problems¶
Problem: Character filters not working as expected.
Common Issues:
- HTML not being stripped:
// ❌ Wrong: Missing character filter
{
"name": "html_analyzer",
"tokenizer": "standard",
"tokenFilters": ["lowercase"]
}
// ✅ Correct: Include HTML strip filter
{
"name": "html_analyzer",
"tokenizer": "standard",
"charFilters": ["html_strip"],
"tokenFilters": ["lowercase"]
}
- Pattern replacement not working:
// ❌ Wrong: Invalid regex pattern
{
"name": "pattern_replace",
"@odata.type": "#Microsoft.Azure.Search.PatternReplaceCharFilter",
"pattern": "[invalid regex",
"replacement": ""
}
// ✅ Correct: Valid regex pattern
{
"name": "pattern_replace",
"@odata.type": "#Microsoft.Azure.Search.PatternReplaceCharFilter",
"pattern": "\\d+",
"replacement": "NUMBER"
}
Testing Character Filters:
POST https://[service].search.windows.net/indexes/[index]/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]
{
"text": "<p>Test <b>HTML</b> content</p>",
"charFilters": ["html_strip"],
"tokenizer": "standard"
}
Common Scoring Issues¶
Issue 5: Scoring Profile Not Applied¶
Problem: Search results don't reflect expected scoring profile behavior.
Symptoms: - Results order unchanged when using scoring profile - Expected boosting not visible in scores - Scoring functions seem to have no effect
Debugging Steps:
- Verify scoring profile exists:
GET https://[service].search.windows.net/indexes/[index]?api-version=2024-07-01
api-key: [admin-key]
- Check scoring profile syntax:
{
"scoringProfiles": [
{
"name": "content_boost",
"text": {
"weights": {
"title": 3.0,
"content": 1.0
}
}
}
]
}
- Test with scoring profile parameter:
POST https://[service].search.windows.net/indexes/[index]/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]
{
"search": "test query",
"scoringProfile": "content_boost",
"includeTotalResultCount": true
}
Common Causes:
| Issue | Cause | Solution |
|---|---|---|
| Profile not applied | Missing scoringProfile parameter |
Add parameter to search request |
| Field weights ignored | Field not searchable | Ensure fields have "searchable": true |
| Functions not working | Invalid field references | Verify field names and types |
| No score difference | Insufficient test data | Use diverse test documents |
Issue 6: Scoring Function Errors¶
Error Message:
{
"error": {
"code": "InvalidRequestError",
"message": "The field 'invalidField' referenced in scoring function does not exist."
}
}
Common Function Issues:
- Field reference errors:
// ❌ Wrong: Field doesn't exist
{
"type": "magnitude",
"fieldName": "nonexistent_field",
"boost": 2.0
}
// ✅ Correct: Valid field reference
{
"type": "magnitude",
"fieldName": "rating",
"boost": 2.0,
"magnitude": {
"boostingRangeStart": 1,
"boostingRangeEnd": 5
}
}
- Invalid field types:
// ❌ Wrong: Using string field for magnitude function
{
"type": "magnitude",
"fieldName": "title", // String field
"boost": 2.0
}
// ✅ Correct: Using numeric field
{
"type": "magnitude",
"fieldName": "rating", // Numeric field
"boost": 2.0
}
- Missing required parameters:
// ❌ Wrong: Missing required parameters
{
"type": "freshness",
"fieldName": "publishDate",
"boost": 2.0
}
// ✅ Correct: All required parameters
{
"type": "freshness",
"fieldName": "publishDate",
"boost": 2.0,
"interpolation": "linear",
"freshness": {
"boostingDuration": "P30D"
}
}
Issue 7: Distance Function Problems¶
Problem: Geographic distance scoring not working correctly.
Common Issues:
- Invalid coordinate format:
// ❌ Wrong: Invalid coordinates
{
"location": {
"type": "Point",
"coordinates": [200, 100] // Invalid longitude/latitude
}
}
// ✅ Correct: Valid coordinates [longitude, latitude]
{
"location": {
"type": "Point",
"coordinates": [-122.131577, 47.678581]
}
}
- Missing scoring parameters:
// ❌ Wrong: Missing location parameter
POST /indexes/restaurants/docs/search
{
"search": "pizza",
"scoringProfile": "location_boost"
}
// ✅ Correct: Include location parameter
POST /indexes/restaurants/docs/search
{
"search": "pizza",
"scoringProfile": "location_boost",
"scoringParameters": ["userLocation:-122.133577,47.679581"]
}
Diagnostic Tools and Techniques¶
Tool 1: Analyze API Testing¶
Comprehensive analyzer testing:
def test_analyzer_comprehensive(service_name, admin_key, index_name, analyzer_name):
"""Comprehensive analyzer testing"""
test_cases = [
"Simple text",
"HTML <b>bold</b> content",
"Special chars: @#$%^&*()",
"Numbers: 123 and 456.789",
"Email: user@example.com",
"URLs: https://www.example.com",
"Mixed: The quick brown fox jumps over the lazy dog!",
"" # Empty string
]
for text in test_cases:
result = analyze_text(service_name, admin_key, index_name, text, analyzer_name)
print(f"Input: '{text}'")
print(f"Tokens: {[token['token'] for token in result['tokens']]}")
print("---")
def analyze_text(service_name, admin_key, index_name, text, analyzer_name):
"""Call Analyze API"""
import requests
url = f"https://{service_name}.search.windows.net/indexes/{index_name}/analyze"
headers = {
'Content-Type': 'application/json',
'api-key': admin_key
}
data = {
'text': text,
'analyzer': analyzer_name
}
response = requests.post(url, headers=headers, json=data, params={'api-version': '2024-07-01'})
return response.json()
Tool 2: Scoring Profile Validator¶
def validate_scoring_profile(profile_config, index_schema):
"""Validate scoring profile against index schema"""
errors = []
# Check field weights
if 'text' in profile_config and 'weights' in profile_config['text']:
for field_name in profile_config['text']['weights']:
if not is_field_searchable(field_name, index_schema):
errors.append(f"Field '{field_name}' is not searchable")
# Check scoring functions
if 'functions' in profile_config:
for func in profile_config['functions']:
field_name = func.get('fieldName')
func_type = func.get('type')
if not field_exists(field_name, index_schema):
errors.append(f"Field '{field_name}' does not exist")
elif not is_field_compatible(field_name, func_type, index_schema):
errors.append(f"Field '{field_name}' is not compatible with {func_type} function")
return errors
def is_field_searchable(field_name, index_schema):
"""Check if field is searchable"""
for field in index_schema['fields']:
if field['name'] == field_name:
return field.get('searchable', False)
return False
def is_field_compatible(field_name, func_type, index_schema):
"""Check if field type is compatible with function type"""
field_type = get_field_type(field_name, index_schema)
compatibility = {
'magnitude': ['Edm.Double', 'Edm.Int32', 'Edm.Int64'],
'freshness': ['Edm.DateTimeOffset'],
'distance': ['Edm.GeographyPoint']
}
return field_type in compatibility.get(func_type, [])
Tool 3: Performance Monitor¶
import time
import statistics
from datetime import datetime
class PerformanceMonitor:
def __init__(self):
self.metrics = []
def measure_operation(self, operation_name, operation_func, *args, **kwargs):
"""Measure operation performance"""
start_time = time.time()
start_memory = self.get_memory_usage()
try:
result = operation_func(*args, **kwargs)
success = True
error = None
except Exception as e:
result = None
success = False
error = str(e)
end_time = time.time()
end_memory = self.get_memory_usage()
metric = {
'operation': operation_name,
'timestamp': datetime.now(),
'duration': end_time - start_time,
'memory_delta': end_memory - start_memory,
'success': success,
'error': error
}
self.metrics.append(metric)
return result, metric
def get_memory_usage(self):
"""Get current memory usage (simplified)"""
import psutil
process = psutil.Process()
return process.memory_info().rss / 1024 / 1024 # MB
def generate_report(self):
"""Generate performance report"""
if not self.metrics:
return "No metrics collected"
operations = {}
for metric in self.metrics:
op_name = metric['operation']
if op_name not in operations:
operations[op_name] = []
operations[op_name].append(metric)
report = []
for op_name, op_metrics in operations.items():
durations = [m['duration'] for m in op_metrics if m['success']]
success_rate = len([m for m in op_metrics if m['success']]) / len(op_metrics)
if durations:
report.append(f"{op_name}:")
report.append(f" Average duration: {statistics.mean(durations):.3f}s")
report.append(f" Min duration: {min(durations):.3f}s")
report.append(f" Max duration: {max(durations):.3f}s")
report.append(f" Success rate: {success_rate:.1%}")
report.append("")
return "\n".join(report)
# Usage example
monitor = PerformanceMonitor()
# Measure indexing performance
result, metric = monitor.measure_operation(
"document_indexing",
search_client.upload_documents,
documents
)
# Measure query performance
result, metric = monitor.measure_operation(
"search_query",
search_client.search,
"test query",
scoring_profile="content_boost"
)
print(monitor.generate_report())
Prevention Strategies¶
1. Development Best Practices¶
- Test Early and Often: Use Analyze API during development
- Version Control: Track analyzer and scoring profile changes
- Documentation: Document analyzer purposes and expected behavior
- Validation: Implement automated validation for configurations
2. Deployment Checklist¶
- [ ] Analyzer definitions are complete and valid
- [ ] All referenced fields exist and have correct attributes
- [ ] Scoring profiles reference valid fields with appropriate types
- [ ] Performance testing completed with representative data
- [ ] Backup and rollback procedures in place
3. Monitoring Setup¶
# Example monitoring configuration
monitoring_config = {
'performance_thresholds': {
'indexing_rate': 100, # docs per second
'query_latency': 100, # milliseconds
'error_rate': 0.01 # 1%
},
'test_queries': [
'machine learning',
'data science',
'artificial intelligence'
],
'alert_conditions': [
'query_latency > 200ms',
'error_rate > 5%',
'indexing_rate < 50 docs/sec'
]
}
Getting Help¶
1. Azure Support Resources¶
- Azure Portal: Monitor service health and metrics
- Azure Support: Create support tickets for complex issues
- Documentation: Official Azure AI Search documentation
- Community Forums: Stack Overflow, Microsoft Q&A
2. Diagnostic Information to Collect¶
When reporting issues, include:
- Service Details: Service name, tier, region
- Index Schema: Complete index definition
- Analyzer Configuration: Full analyzer and scoring profile definitions
- Sample Data: Representative test documents
- Error Messages: Complete error responses
- Performance Metrics: Timing and throughput measurements
3. Common Support Scenarios¶
| Issue Type | Information Needed | Expected Resolution Time |
|---|---|---|
| Configuration errors | Index schema, error messages | 1-2 business days |
| Performance issues | Metrics, sample data, usage patterns | 3-5 business days |
| Unexpected behavior | Test cases, expected vs. actual results | 2-3 business days |
| Service limits | Usage patterns, scaling requirements | 1-2 business days |
This troubleshooting guide covers the most common issues encountered when working with analyzers and scoring profiles in Azure AI Search. Regular testing and monitoring help prevent many of these issues from occurring in production.