Module 10: Practice Implementation - Analyzers & Scoring¶
Overview¶
This practice guide provides hands-on exercises to implement and test custom analyzers and scoring profiles in Azure AI Search. Each exercise builds upon previous concepts and includes validation steps.
Exercise 1: Built-in Analyzer Comparison¶
Objective¶
Compare different built-in analyzers to understand their behavior and choose appropriate analyzers for different content types.
Setup¶
Create a test index with multiple analyzer fields:
{
"name": "analyzer-test-index",
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true,
"searchable": false
},
{
"name": "content_standard",
"type": "Edm.String",
"analyzer": "standard.lucene",
"searchable": true
},
{
"name": "content_english",
"type": "Edm.String",
"analyzer": "en.microsoft",
"searchable": true
},
{
"name": "content_keyword",
"type": "Edm.String",
"analyzer": "keyword",
"searchable": true
},
{
"name": "content_simple",
"type": "Edm.String",
"analyzer": "simple",
"searchable": true
}
]
}
Test Data¶
{
"value": [
{
"id": "1",
"content_standard": "The quick brown foxes are running through the forest",
"content_english": "The quick brown foxes are running through the forest",
"content_keyword": "The quick brown foxes are running through the forest",
"content_simple": "The quick brown foxes are running through the forest"
},
{
"id": "2",
"content_standard": "HTML <b>bold</b> and <i>italic</i> formatting",
"content_english": "HTML <b>bold</b> and <i>italic</i> formatting",
"content_keyword": "HTML <b>bold</b> and <i>italic</i> formatting",
"content_simple": "HTML <b>bold</b> and <i>italic</i> formatting"
}
]
}
Analysis Tasks¶
- Test Tokenization: Use the Analyze API to see how each analyzer processes text:
POST https://[service].search.windows.net/indexes/analyzer-test-index/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]
{
"text": "The quick brown foxes are running",
"analyzer": "standard.lucene"
}
- Compare Results: Test the same text with different analyzers and document differences
- Search Testing: Perform searches and compare result relevance
Expected Outcomes¶
- Standard Analyzer: Lowercases, removes punctuation, basic tokenization
- English Analyzer: Stemming (foxes → fox, running → run), stop word removal
- Keyword Analyzer: Treats entire input as single token
- Simple Analyzer: Basic lowercase and whitespace tokenization
Validation¶
Create a comparison table:
| Analyzer | Input | Tokens | Notes |
|---|---|---|---|
| standard.lucene | "running foxes" | ["running", "foxes"] | Basic processing |
| en.microsoft | "running foxes" | ["run", "fox"] | Stemming applied |
| keyword | "running foxes" | ["running foxes"] | Single token |
| simple | "Running Foxes" | ["running", "foxes"] | Lowercase only |
Exercise 2: Custom Analyzer Creation¶
Objective¶
Build a custom analyzer for e-commerce product search that handles HTML content and applies domain-specific processing.
Custom Analyzer Definition¶
{
"name": "ecommerce-analyzer-index",
"analyzers": [
{
"name": "product_analyzer",
"tokenizer": "standard",
"charFilters": ["html_strip", "product_mapping"],
"tokenFilters": [
"lowercase",
"product_stopwords",
"product_synonyms"
]
}
],
"charFilters": [
{
"name": "html_strip",
"@odata.type": "#Microsoft.Azure.Search.HtmlStripCharFilter"
},
{
"name": "product_mapping",
"@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
"mappings": [
"& => and",
"@ => at"
]
}
],
"tokenFilters": [
{
"name": "product_stopwords",
"@odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter",
"stopwords": ["the", "and", "or", "but", "in", "on", "at", "to", "for", "of", "with", "by"]
},
{
"name": "product_synonyms",
"@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
"synonyms": [
"laptop,notebook,computer",
"phone,mobile,smartphone",
"tv,television,monitor"
]
}
],
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true
},
{
"name": "productName",
"type": "Edm.String",
"analyzer": "product_analyzer",
"searchable": true
},
{
"name": "description",
"type": "Edm.String",
"analyzer": "product_analyzer",
"searchable": true
}
]
}
Test Implementation¶
- Create the Index: Deploy the custom analyzer configuration
- Test Analysis: Verify the analyzer processes text correctly
POST https://[service].search.windows.net/indexes/ecommerce-analyzer-index/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]
{
"text": "<p>High-performance <b>laptop</b> & notebook computer</p>",
"analyzer": "product_analyzer"
}
- Expected Output:
["high", "performance", "laptop", "and", "notebook", "computer"]
Sample Data¶
{
"value": [
{
"id": "1",
"productName": "Dell XPS 13 Laptop",
"description": "<p>Ultra-thin <b>notebook</b> computer with high performance</p>"
},
{
"id": "2",
"productName": "iPhone 14 Pro",
"description": "<div>Advanced <i>smartphone</i> with professional camera</div>"
},
{
"id": "3",
"productName": "Samsung 55\" Smart TV",
"description": "4K <b>television</b> with streaming capabilities"
}
]
}
Validation Tasks¶
- HTML Stripping: Verify HTML tags are removed
- Character Mapping: Confirm
&becomesand - Synonym Expansion: Test that searching for "laptop" finds "notebook" products
- Stop Word Removal: Verify common words are filtered out
Exercise 3: N-gram Analyzer for Autocomplete¶
Objective¶
Implement an edge n-gram analyzer to enable autocomplete functionality.
Autocomplete Analyzer Configuration¶
{
"name": "autocomplete-index",
"analyzers": [
{
"name": "autocomplete_analyzer",
"tokenizer": "autocomplete_tokenizer",
"tokenFilters": ["lowercase"]
},
{
"name": "search_analyzer",
"tokenizer": "standard",
"tokenFilters": ["lowercase"]
}
],
"tokenizers": [
{
"name": "autocomplete_tokenizer",
"@odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenizer",
"minGram": 2,
"maxGram": 25,
"tokenChars": ["letter", "digit"]
}
],
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true
},
{
"name": "title",
"type": "Edm.String",
"indexAnalyzer": "autocomplete_analyzer",
"searchAnalyzer": "search_analyzer",
"searchable": true
}
]
}
Test Data¶
{
"value": [
{
"id": "1",
"title": "Machine Learning Fundamentals"
},
{
"id": "2",
"title": "Deep Learning with Python"
},
{
"id": "3",
"title": "Natural Language Processing"
}
]
}
Testing Autocomplete¶
- Analyze Indexing: See how text is tokenized for indexing:
POST https://[service].search.windows.net/indexes/autocomplete-index/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]
{
"text": "Machine Learning",
"analyzer": "autocomplete_analyzer"
}
Expected tokens: ["ma", "mac", "mach", "machi", "machin", "machine", "le", "lea", "lear", "learn", "learni", "learnin", "learning"]
- Test Autocomplete Queries:
POST https://[service].search.windows.net/indexes/autocomplete-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]
{
"search": "mach",
"searchFields": "title"
}
Validation¶
- Partial matches work (searching "mach" finds "Machine Learning")
- Performance is acceptable for autocomplete scenarios
- Index size increase is manageable
Exercise 4: Basic Scoring Profile¶
Objective¶
Create a scoring profile that weights different fields and applies magnitude boosting.
Scoring Profile Configuration¶
{
"name": "content-scoring-index",
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true
},
{
"name": "title",
"type": "Edm.String",
"searchable": true
},
{
"name": "content",
"type": "Edm.String",
"searchable": true
},
{
"name": "category",
"type": "Edm.String",
"searchable": true
},
{
"name": "rating",
"type": "Edm.Double",
"filterable": true
},
{
"name": "viewCount",
"type": "Edm.Int32",
"filterable": true
},
{
"name": "publishDate",
"type": "Edm.DateTimeOffset",
"filterable": true
}
],
"scoringProfiles": [
{
"name": "content_relevance",
"text": {
"weights": {
"title": 4.0,
"content": 1.0,
"category": 2.0
}
},
"functions": [
{
"type": "magnitude",
"fieldName": "rating",
"boost": 2.0,
"interpolation": "linear",
"magnitude": {
"boostingRangeStart": 1,
"boostingRangeEnd": 5,
"constantBoostBeyondRange": true
}
},
{
"type": "magnitude",
"fieldName": "viewCount",
"boost": 1.5,
"interpolation": "logarithmic",
"magnitude": {
"boostingRangeStart": 0,
"boostingRangeEnd": 10000,
"constantBoostBeyondRange": true
}
},
{
"type": "freshness",
"fieldName": "publishDate",
"boost": 1.3,
"interpolation": "linear",
"freshness": {
"boostingDuration": "P30D"
}
}
],
"functionAggregation": "sum"
}
]
}
Test Data¶
{
"value": [
{
"id": "1",
"title": "Introduction to Machine Learning",
"content": "Machine learning is a powerful subset of artificial intelligence...",
"category": "Technology",
"rating": 4.5,
"viewCount": 1250,
"publishDate": "2024-01-15T10:00:00Z"
},
{
"id": "2",
"title": "Advanced Machine Learning Techniques",
"content": "Deep dive into advanced machine learning algorithms...",
"category": "Technology",
"rating": 4.8,
"viewCount": 850,
"publishDate": "2024-02-20T14:30:00Z"
},
{
"id": "3",
"title": "Machine Learning in Practice",
"content": "Practical applications of machine learning in business...",
"category": "Business",
"rating": 3.9,
"viewCount": 2100,
"publishDate": "2023-12-10T09:15:00Z"
}
]
}
Testing Scoring¶
- Default Scoring:
POST https://[service].search.windows.net/indexes/content-scoring-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]
{
"search": "machine learning",
"select": "id,title,rating,viewCount,publishDate",
"top": 10
}
- Custom Scoring Profile:
POST https://[service].search.windows.net/indexes/content-scoring-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]
{
"search": "machine learning",
"scoringProfile": "content_relevance",
"select": "id,title,rating,viewCount,publishDate",
"top": 10
}
Analysis Tasks¶
- Compare Rankings: Document how result order changes with the scoring profile
- Score Analysis: Use
"includeTotalResultCount": trueto see score values - Field Weight Impact: Test queries that match different fields
- Function Impact: Analyze how rating, view count, and freshness affect scores
Expected Observations¶
- Higher-rated content should rank higher
- Recent content gets freshness boost
- Popular content (high view count) gets magnitude boost
- Title matches score higher than content matches
Exercise 5: Advanced Scoring with Distance¶
Objective¶
Implement location-based scoring for a restaurant search scenario.
Location-Based Index¶
{
"name": "restaurant-index",
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true
},
{
"name": "name",
"type": "Edm.String",
"searchable": true
},
{
"name": "cuisine",
"type": "Edm.String",
"searchable": true,
"filterable": true
},
{
"name": "description",
"type": "Edm.String",
"searchable": true
},
{
"name": "location",
"type": "Edm.GeographyPoint",
"filterable": true
},
{
"name": "rating",
"type": "Edm.Double",
"filterable": true
},
{
"name": "priceRange",
"type": "Edm.Int32",
"filterable": true
}
],
"scoringProfiles": [
{
"name": "location_relevance",
"text": {
"weights": {
"name": 3.0,
"cuisine": 2.0,
"description": 1.0
}
},
"functions": [
{
"type": "distance",
"fieldName": "location",
"boost": 2.0,
"interpolation": "linear",
"distance": {
"referencePointParameter": "userLocation",
"boostingDistance": 5
}
},
{
"type": "magnitude",
"fieldName": "rating",
"boost": 1.5,
"interpolation": "linear",
"magnitude": {
"boostingRangeStart": 1,
"boostingRangeEnd": 5
}
}
],
"functionAggregation": "sum"
}
]
}
Test Data¶
{
"value": [
{
"id": "1",
"name": "Mario's Italian Kitchen",
"cuisine": "Italian",
"description": "Authentic Italian cuisine with fresh pasta",
"location": {
"type": "Point",
"coordinates": [-122.131577, 47.678581]
},
"rating": 4.5,
"priceRange": 3
},
{
"id": "2",
"name": "Sakura Sushi",
"cuisine": "Japanese",
"description": "Fresh sushi and traditional Japanese dishes",
"location": {
"type": "Point",
"coordinates": [-122.135577, 47.680581]
},
"rating": 4.8,
"priceRange": 4
}
]
}
Location-Based Search¶
POST https://[service].search.windows.net/indexes/restaurant-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]
{
"search": "italian",
"scoringProfile": "location_relevance",
"scoringParameters": ["userLocation:-122.133577,47.679581"],
"select": "id,name,cuisine,rating,location",
"top": 10
}
Validation¶
- Restaurants closer to user location rank higher
- High-rated restaurants get additional boost
- Distance function works correctly with geographic coordinates
Exercise 6: Performance Testing and Optimization¶
Objective¶
Measure and optimize analyzer and scoring profile performance.
Performance Test Setup¶
- Create Large Test Dataset: Generate 10,000+ documents
- Measure Indexing Performance: Time document indexing with different analyzers
- Measure Query Performance: Test query latency with different scoring profiles
Performance Testing Script (Python)¶
import time
import statistics
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
def measure_indexing_performance(search_client, documents, analyzer_name):
"""Measure indexing performance for specific analyzer"""
start_time = time.time()
try:
result = search_client.upload_documents(documents)
end_time = time.time()
duration = end_time - start_time
docs_per_second = len(documents) / duration
return {
'analyzer': analyzer_name,
'duration': duration,
'docs_per_second': docs_per_second,
'success_count': len([r for r in result if r.succeeded])
}
except Exception as e:
return {'error': str(e)}
def measure_query_performance(search_client, query, scoring_profile=None, iterations=10):
"""Measure query performance"""
latencies = []
for _ in range(iterations):
start_time = time.time()
search_params = {'search_text': query}
if scoring_profile:
search_params['scoring_profile'] = scoring_profile
results = search_client.search(**search_params)
list(results) # Force execution
end_time = time.time()
latencies.append((end_time - start_time) * 1000) # Convert to ms
return {
'query': query,
'scoring_profile': scoring_profile,
'avg_latency_ms': statistics.mean(latencies),
'min_latency_ms': min(latencies),
'max_latency_ms': max(latencies),
'std_dev_ms': statistics.stdev(latencies) if len(latencies) > 1 else 0
}
# Example usage
def run_performance_tests():
# Test different analyzers
analyzers = ['standard.lucene', 'en.microsoft', 'custom_analyzer']
for analyzer in analyzers:
perf = measure_indexing_performance(client, test_docs, analyzer)
print(f"Analyzer {analyzer}: {perf['docs_per_second']:.2f} docs/sec")
# Test scoring profiles
profiles = [None, 'content_relevance', 'location_relevance']
for profile in profiles:
perf = measure_query_performance(client, "machine learning", profile)
print(f"Profile {profile}: {perf['avg_latency_ms']:.2f}ms avg")
Optimization Strategies¶
- Analyzer Optimization:
- Use simpler analyzers for less critical fields
- Implement separate index/search analyzers
-
Remove unnecessary token filters
-
Scoring Profile Optimization:
- Reduce number of scoring functions
- Use appropriate interpolation methods
-
Balance boost values
-
Index Design Optimization:
- Selective field analysis
- Appropriate field attributes
- Efficient data types
Exercise 7: A/B Testing Framework¶
Objective¶
Implement A/B testing to compare different analyzer and scoring configurations.
A/B Testing Implementation¶
import random
import json
from datetime import datetime
class ABTestFramework:
def __init__(self, search_client):
self.search_client = search_client
self.test_results = []
def run_ab_test(self, query, config_a, config_b, test_queries, user_sessions=100):
"""Run A/B test comparing two configurations"""
results = {
'config_a': {'queries': [], 'metrics': {}},
'config_b': {'queries': [], 'metrics': {}}
}
for session in range(user_sessions):
# Randomly assign to A or B group
config = config_a if random.random() < 0.5 else config_b
group = 'config_a' if config == config_a else 'config_b'
# Run test query
query_result = self.execute_search(query, config)
results[group]['queries'].append(query_result)
# Calculate metrics
for group in ['config_a', 'config_b']:
results[group]['metrics'] = self.calculate_metrics(results[group]['queries'])
return results
def execute_search(self, query, config):
"""Execute search with specific configuration"""
search_params = {
'search_text': query,
'top': 10
}
if 'scoring_profile' in config:
search_params['scoring_profile'] = config['scoring_profile']
if 'scoring_parameters' in config:
search_params['scoring_parameters'] = config['scoring_parameters']
start_time = time.time()
results = list(self.search_client.search(**search_params))
end_time = time.time()
return {
'query': query,
'results': results,
'latency': (end_time - start_time) * 1000,
'result_count': len(results),
'timestamp': datetime.now()
}
def calculate_metrics(self, query_results):
"""Calculate performance metrics"""
latencies = [r['latency'] for r in query_results]
result_counts = [r['result_count'] for r in query_results]
return {
'avg_latency': statistics.mean(latencies),
'avg_results': statistics.mean(result_counts),
'total_queries': len(query_results),
'success_rate': len([r for r in query_results if r['result_count'] > 0]) / len(query_results)
}
# Example A/B test
def run_scoring_ab_test():
ab_tester = ABTestFramework(search_client)
config_a = {'scoring_profile': 'content_relevance'}
config_b = {'scoring_profile': 'enhanced_relevance'}
test_queries = [
"machine learning",
"data science",
"artificial intelligence",
"python programming"
]
for query in test_queries:
results = ab_tester.run_ab_test(query, config_a, config_b, test_queries)
print(f"Query: {query}")
print(f"Config A - Avg Latency: {results['config_a']['metrics']['avg_latency']:.2f}ms")
print(f"Config B - Avg Latency: {results['config_b']['metrics']['avg_latency']:.2f}ms")
print(f"Config A - Success Rate: {results['config_a']['metrics']['success_rate']:.2%}")
print(f"Config B - Success Rate: {results['config_b']['metrics']['success_rate']:.2%}")
print("---")
Validation and Assessment¶
Completion Checklist¶
- [ ] Exercise 1: Successfully compared built-in analyzers
- [ ] Exercise 2: Created and tested custom analyzer
- [ ] Exercise 3: Implemented n-gram analyzer for autocomplete
- [ ] Exercise 4: Built basic scoring profile with field weights and functions
- [ ] Exercise 5: Implemented location-based scoring
- [ ] Exercise 6: Conducted performance testing and optimization
- [ ] Exercise 7: Set up A/B testing framework
Assessment Criteria¶
- Technical Implementation (40%)
- Correct analyzer and scoring profile syntax
- Proper use of Azure AI Search APIs
-
Error handling and validation
-
Performance Optimization (30%)
- Measured performance impact
- Applied optimization strategies
-
Balanced functionality vs. performance
-
Testing and Validation (20%)
- Comprehensive test coverage
- Proper use of Analyze API
-
A/B testing implementation
-
Documentation and Analysis (10%)
- Clear documentation of configurations
- Analysis of results and trade-offs
- Recommendations for production use
Next Steps¶
After completing these exercises:
- Apply to Real Data: Implement analyzers and scoring for your actual use case
- Monitor Production: Set up monitoring for performance and relevance
- Iterate and Improve: Use A/B testing to continuously optimize
- Advanced Topics: Explore semantic search and vector search capabilities
These practical exercises provide hands-on experience with the core concepts of text analysis and scoring in Azure AI Search, preparing you for real-world implementation scenarios.