Exercise 1: Analyzer Comparison Lab¶
📋 Exercise Details¶
- Difficulty: Beginner
- Duration: 45-60 minutes
- Skills: Built-in analyzers, text analysis, tokenization comparison
🎯 Objective¶
Learn to compare and evaluate different built-in analyzers in Azure AI Search by creating a comprehensive testing framework that demonstrates how various analyzers process text differently.
📚 Prerequisites¶
- Completed Module 10 documentation reading
- Azure AI Search service (Standard tier or higher)
- Admin API key for index creation
- Basic understanding of JSON and REST APIs
- Development environment set up (Python, JavaScript, or C#)
🛠️ Instructions¶
Step 1: Create Analyzer Comparison Index¶
Create an index with fields using different built-in analyzers:
{
"name": "analyzer-comparison-lab",
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true,
"searchable": false
},
{
"name": "content_standard",
"type": "Edm.String",
"analyzer": "standard.lucene",
"searchable": true
},
{
"name": "content_english",
"type": "Edm.String",
"analyzer": "en.microsoft",
"searchable": true
},
{
"name": "content_keyword",
"type": "Edm.String",
"analyzer": "keyword",
"searchable": true
},
{
"name": "content_simple",
"type": "Edm.String",
"analyzer": "simple",
"searchable": true
},
{
"name": "content_whitespace",
"type": "Edm.String",
"analyzer": "whitespace",
"searchable": true
}
]
}
Step 2: Prepare Test Content¶
Create diverse test documents that highlight analyzer differences:
[
{
"id": "1",
"content_standard": "The quick brown foxes are running through the forest",
"content_english": "The quick brown foxes are running through the forest",
"content_keyword": "The quick brown foxes are running through the forest",
"content_simple": "The quick brown foxes are running through the forest",
"content_whitespace": "The quick brown foxes are running through the forest"
},
{
"id": "2",
"content_standard": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
"content_english": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
"content_keyword": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
"content_simple": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
"content_whitespace": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text"
},
{
"id": "3",
"content_standard": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
"content_english": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
"content_keyword": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
"content_simple": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
"content_whitespace": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com"
}
]
Step 3: Implement Analyzer Testing Framework¶
Create a testing framework that uses the Analyze API to compare how different analyzers process text:
Python Implementation¶
import requests
import json
from typing import Dict, List
class AnalyzerComparator:
def __init__(self, service_name: str, admin_key: str, index_name: str):
self.service_name = service_name
self.admin_key = admin_key
self.index_name = index_name
self.endpoint = f"https://{service_name}.search.windows.net"
def analyze_text(self, text: str, analyzer: str) -> List[str]:
"""Analyze text using specified analyzer."""
url = f"{self.endpoint}/indexes/{self.index_name}/analyze"
headers = {
"Content-Type": "application/json",
"api-key": self.admin_key
}
data = {
"text": text,
"analyzer": analyzer
}
response = requests.post(url, headers=headers, json=data,
params={"api-version": "2024-07-01"})
if response.status_code == 200:
result = response.json()
return [token["token"] for token in result.get("tokens", [])]
else:
print(f"Error analyzing with {analyzer}: {response.text}")
return []
def compare_analyzers(self, text: str, analyzers: List[str]) -> Dict[str, List[str]]:
"""Compare how different analyzers process the same text."""
results = {}
for analyzer in analyzers:
results[analyzer] = self.analyze_text(text, analyzer)
return results
def print_comparison(self, text: str, analyzers: List[str]):
"""Print a formatted comparison of analyzer results."""
print(f"\nInput text: {text}")
print("-" * 60)
results = self.compare_analyzers(text, analyzers)
for analyzer, tokens in results.items():
print(f"{analyzer:15} -> {tokens}")
# Analysis
print(f"\nAnalysis:")
token_counts = {analyzer: len(tokens) for analyzer, tokens in results.items()}
for analyzer, count in token_counts.items():
print(f" {analyzer}: {count} tokens")
# Identify unique behaviors
if "en.microsoft" in results and "standard.lucene" in results:
english_tokens = set(results["en.microsoft"])
standard_tokens = set(results["standard.lucene"])
if len(english_tokens) < len(standard_tokens):
removed = standard_tokens - english_tokens
print(f" English analyzer removed: {removed}")
if "keyword" in results:
print(f" Keyword analyzer preserved entire input as single token")
# Usage example
comparator = AnalyzerComparator("your-service", "your-key", "analyzer-comparison-lab")
test_texts = [
"The quick brown foxes are running",
"HTML <b>bold</b> formatting",
"user@example.com and admin@test.org"
]
analyzers = ["standard.lucene", "en.microsoft", "keyword", "simple", "whitespace"]
for text in test_texts:
comparator.print_comparison(text, analyzers)
Step 4: Conduct Systematic Testing¶
Test each analyzer with different types of content:
- Basic Text Processing
- Simple sentences
- Punctuation handling
-
Case sensitivity
-
Linguistic Features
- Stemming behavior
- Stop word removal
-
Pluralization
-
Special Content
- HTML tags
- Email addresses
- URLs and special characters
-
Numbers and dates
-
Edge Cases
- Empty strings
- Very long text
- Special Unicode characters
Step 5: Search Behavior Analysis¶
Compare search results using different analyzer fields:
def compare_search_results(search_client, query: str):
"""Compare search results across different analyzer fields."""
fields_to_test = [
("content_standard", "Standard"),
("content_english", "English"),
("content_keyword", "Keyword"),
("content_simple", "Simple"),
("content_whitespace", "Whitespace")
]
print(f"\nSearch query: '{query}'")
print("=" * 50)
for field, name in fields_to_test:
try:
results = list(search_client.search(
search_text=query,
search_fields=[field],
select=["id", field],
top=3
))
print(f"\n{name} Analyzer ({field}):")
if results:
for i, result in enumerate(results, 1):
score = result.get("@search.score", 0)
content = result.get(field, "")[:50] + "..."
print(f" {i}. Score: {score:.3f} - {content}")
else:
print(" No results found")
except Exception as e:
print(f" Error: {e}")
✅ Validation¶
Expected Outcomes¶
Document your findings for each analyzer:
- Standard Analyzer (
standard.lucene) - Tokenizes on whitespace and punctuation
- Converts to lowercase
- Removes most punctuation
-
Language-neutral processing
-
English Analyzer (
en.microsoft) - Includes stemming (running → run, foxes → fox)
- Removes English stop words (the, and, or, etc.)
- Advanced linguistic processing
-
Better for English content search
-
Keyword Analyzer (
keyword) - Treats entire input as single token
- Preserves exact text including case and punctuation
-
Perfect for exact matching scenarios
-
Simple Analyzer (
simple) - Splits on non-letter characters
- Converts to lowercase
- No linguistic processing
-
Fast and lightweight
-
Whitespace Analyzer (
whitespace) - Splits only on whitespace
- Preserves punctuation and case
- Minimal processing
Validation Checklist¶
- [ ] Index created successfully with all analyzer fields
- [ ] Documents uploaded and indexed properly
- [ ] Analyze API returns expected tokens for each analyzer
- [ ] Search results differ appropriately between analyzers
- [ ] Performance differences documented
- [ ] Use case recommendations identified
Test Cases¶
Create a test suite that validates expected behavior:
test_cases = [
{
"input": "The quick brown foxes are running",
"expected": {
"standard.lucene": ["the", "quick", "brown", "foxes", "are", "running"],
"en.microsoft": ["quick", "brown", "fox", "run"], # Stemmed, stop words removed
"keyword": ["The quick brown foxes are running"],
"simple": ["the", "quick", "brown", "foxes", "are", "running"],
"whitespace": ["The", "quick", "brown", "foxes", "are", "running"]
}
},
{
"input": "HTML <b>bold</b> text",
"expected": {
"standard.lucene": ["html", "b", "bold", "b", "text"],
"keyword": ["HTML <b>bold</b> text"],
"simple": ["html", "b", "bold", "b", "text"]
}
}
]
def validate_analyzer_behavior(comparator, test_cases):
"""Validate that analyzers behave as expected."""
passed = 0
failed = 0
for test_case in test_cases:
input_text = test_case["input"]
expected = test_case["expected"]
print(f"\nTesting: {input_text}")
for analyzer, expected_tokens in expected.items():
actual_tokens = comparator.analyze_text(input_text, analyzer)
if actual_tokens == expected_tokens:
print(f" ✅ {analyzer}: PASS")
passed += 1
else:
print(f" ❌ {analyzer}: FAIL")
print(f" Expected: {expected_tokens}")
print(f" Actual: {actual_tokens}")
failed += 1
print(f"\nValidation Results: {passed} passed, {failed} failed")
return failed == 0
🚀 Extensions¶
Extension 1: Language-Specific Testing¶
Test language-specific analyzers (fr.microsoft, de.microsoft, etc.) with multilingual content.
Extension 2: Performance Benchmarking¶
Measure and compare the performance of different analyzers:
import time
def benchmark_analyzers(comparator, text, analyzers, iterations=100):
"""Benchmark analyzer performance."""
results = {}
for analyzer in analyzers:
start_time = time.time()
for _ in range(iterations):
comparator.analyze_text(text, analyzer)
end_time = time.time()
avg_time = (end_time - start_time) / iterations * 1000 # ms
results[analyzer] = avg_time
print(f"\nPerformance Benchmark ({iterations} iterations):")
for analyzer, avg_time in sorted(results.items(), key=lambda x: x[1]):
print(f" {analyzer}: {avg_time:.2f}ms average")
return results
Extension 3: Custom Test Data¶
Create test data specific to your domain (e-commerce, legal, medical, etc.) and analyze how different analyzers handle domain-specific terminology.
Extension 4: Visualization¶
Create charts or graphs showing the differences in token counts, processing times, and search result relevance across analyzers.
💡 Solutions¶
Key Insights¶
After completing this exercise, you should understand:
- When to use each analyzer:
- Standard: General-purpose, multilingual content
- English: English-language content requiring linguistic processing
- Keyword: Exact matching, IDs, codes
- Simple: Lightweight processing, tags, categories
-
Whitespace: Preserving punctuation, minimal processing
-
Trade-offs:
- Processing complexity vs. performance
- Search recall vs. precision
-
Index size vs. search capabilities
-
Testing methodology:
- Use representative data
- Test edge cases
- Measure both functionality and performance
- Validate with real search scenarios
Common Pitfalls¶
- Not testing with real data: Synthetic test data may not reveal real-world issues
- Ignoring performance: Complex analyzers can impact indexing and query performance
- Over-engineering: Starting with simple analyzers and adding complexity as needed
- Inconsistent testing: Using different test data for different analyzers
Best Practices¶
- Always test analyzers with your actual content
- Consider both indexing and query performance
- Document your findings and rationale for analyzer choices
- Plan for future content types and languages
- Monitor analyzer performance in production
Next Exercise: Custom Analyzer Workshop - Learn to build domain-specific analyzers with custom components.