Exercise 1: Analyzer Comparison Lab¶

📋 Exercise Details¶

Difficulty: Beginner
Duration: 45-60 minutes
Skills: Built-in analyzers, text analysis, tokenization comparison

🎯 Objective¶

Learn to compare and evaluate different built-in analyzers in Azure AI Search by creating a comprehensive testing framework that demonstrates how various analyzers process text differently.

📚 Prerequisites¶

Completed Module 10 documentation reading
Azure AI Search service (Standard tier or higher)
Admin API key for index creation
Basic understanding of JSON and REST APIs
Development environment set up (Python, JavaScript, or C#)

🛠️ Instructions¶

Step 1: Create Analyzer Comparison Index¶

Create an index with fields using different built-in analyzers:

{
  "name": "analyzer-comparison-lab",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true,
      "searchable": false
    },
    {
      "name": "content_standard",
      "type": "Edm.String",
      "analyzer": "standard.lucene",
      "searchable": true
    },
    {
      "name": "content_english",
      "type": "Edm.String", 
      "analyzer": "en.microsoft",
      "searchable": true
    },
    {
      "name": "content_keyword",
      "type": "Edm.String",
      "analyzer": "keyword",
      "searchable": true
    },
    {
      "name": "content_simple",
      "type": "Edm.String",
      "analyzer": "simple",
      "searchable": true
    },
    {
      "name": "content_whitespace",
      "type": "Edm.String",
      "analyzer": "whitespace",
      "searchable": true
    }
  ]
}

Step 2: Prepare Test Content¶

Create diverse test documents that highlight analyzer differences:

[
  {
    "id": "1",
    "content_standard": "The quick brown foxes are running through the forest",
    "content_english": "The quick brown foxes are running through the forest",
    "content_keyword": "The quick brown foxes are running through the forest",
    "content_simple": "The quick brown foxes are running through the forest",
    "content_whitespace": "The quick brown foxes are running through the forest"
  },
  {
    "id": "2", 
    "content_standard": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
    "content_english": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
    "content_keyword": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
    "content_simple": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
    "content_whitespace": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text"
  },
  {
    "id": "3",
    "content_standard": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
    "content_english": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
    "content_keyword": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
    "content_simple": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
    "content_whitespace": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com"
  }
]

Step 3: Implement Analyzer Testing Framework¶

Create a testing framework that uses the Analyze API to compare how different analyzers process text:

Python Implementation¶

import requests
import json
from typing import Dict, List

class AnalyzerComparator:
    def __init__(self, service_name: str, admin_key: str, index_name: str):
        self.service_name = service_name
        self.admin_key = admin_key
        self.index_name = index_name
        self.endpoint = f"https://{service_name}.search.windows.net"

    def analyze_text(self, text: str, analyzer: str) -> List[str]:
        """Analyze text using specified analyzer."""
        url = f"{self.endpoint}/indexes/{self.index_name}/analyze"
        headers = {
            "Content-Type": "application/json",
            "api-key": self.admin_key
        }
        data = {
            "text": text,
            "analyzer": analyzer
        }

        response = requests.post(url, headers=headers, json=data, 
                               params={"api-version": "2024-07-01"})

        if response.status_code == 200:
            result = response.json()
            return [token["token"] for token in result.get("tokens", [])]
        else:
            print(f"Error analyzing with {analyzer}: {response.text}")
            return []

    def compare_analyzers(self, text: str, analyzers: List[str]) -> Dict[str, List[str]]:
        """Compare how different analyzers process the same text."""
        results = {}
        for analyzer in analyzers:
            results[analyzer] = self.analyze_text(text, analyzer)
        return results

    def print_comparison(self, text: str, analyzers: List[str]):
        """Print a formatted comparison of analyzer results."""
        print(f"\nInput text: {text}")
        print("-" * 60)

        results = self.compare_analyzers(text, analyzers)

        for analyzer, tokens in results.items():
            print(f"{analyzer:15} -> {tokens}")

        # Analysis
        print(f"\nAnalysis:")
        token_counts = {analyzer: len(tokens) for analyzer, tokens in results.items()}

        for analyzer, count in token_counts.items():
            print(f"  {analyzer}: {count} tokens")

        # Identify unique behaviors
        if "en.microsoft" in results and "standard.lucene" in results:
            english_tokens = set(results["en.microsoft"])
            standard_tokens = set(results["standard.lucene"])

            if len(english_tokens) < len(standard_tokens):
                removed = standard_tokens - english_tokens
                print(f"  English analyzer removed: {removed}")

        if "keyword" in results:
            print(f"  Keyword analyzer preserved entire input as single token")

# Usage example
comparator = AnalyzerComparator("your-service", "your-key", "analyzer-comparison-lab")

test_texts = [
    "The quick brown foxes are running",
    "HTML <b>bold</b> formatting",
    "user@example.com and admin@test.org"
]

analyzers = ["standard.lucene", "en.microsoft", "keyword", "simple", "whitespace"]

for text in test_texts:
    comparator.print_comparison(text, analyzers)

Step 4: Conduct Systematic Testing¶

Test each analyzer with different types of content:

Basic Text Processing
Simple sentences
Punctuation handling
Case sensitivity
Linguistic Features
Stemming behavior
Stop word removal
Pluralization
Special Content
HTML tags
Email addresses
URLs and special characters
Numbers and dates
Edge Cases
Empty strings
Very long text
Special Unicode characters

Step 5: Search Behavior Analysis¶

Compare search results using different analyzer fields:

def compare_search_results(search_client, query: str):
    """Compare search results across different analyzer fields."""

    fields_to_test = [
        ("content_standard", "Standard"),
        ("content_english", "English"),
        ("content_keyword", "Keyword"),
        ("content_simple", "Simple"),
        ("content_whitespace", "Whitespace")
    ]

    print(f"\nSearch query: '{query}'")
    print("=" * 50)

    for field, name in fields_to_test:
        try:
            results = list(search_client.search(
                search_text=query,
                search_fields=[field],
                select=["id", field],
                top=3
            ))

            print(f"\n{name} Analyzer ({field}):")
            if results:
                for i, result in enumerate(results, 1):
                    score = result.get("@search.score", 0)
                    content = result.get(field, "")[:50] + "..."
                    print(f"  {i}. Score: {score:.3f} - {content}")
            else:
                print("  No results found")

        except Exception as e:
            print(f"  Error: {e}")

✅ Validation¶

Expected Outcomes¶

Document your findings for each analyzer:

Standard Analyzer (standard.lucene)
Tokenizes on whitespace and punctuation
Converts to lowercase
Removes most punctuation
Language-neutral processing
English Analyzer (en.microsoft)
Includes stemming (running → run, foxes → fox)
Removes English stop words (the, and, or, etc.)
Advanced linguistic processing
Better for English content search
Keyword Analyzer (keyword)
Treats entire input as single token
Preserves exact text including case and punctuation
Perfect for exact matching scenarios
Simple Analyzer (simple)
Splits on non-letter characters
Converts to lowercase
No linguistic processing
Fast and lightweight
Whitespace Analyzer (whitespace)
Splits only on whitespace
Preserves punctuation and case
Minimal processing

Validation Checklist¶

[ ] Index created successfully with all analyzer fields
[ ] Documents uploaded and indexed properly
[ ] Analyze API returns expected tokens for each analyzer
[ ] Search results differ appropriately between analyzers
[ ] Performance differences documented
[ ] Use case recommendations identified

Test Cases¶

Create a test suite that validates expected behavior:

test_cases = [
    {
        "input": "The quick brown foxes are running",
        "expected": {
            "standard.lucene": ["the", "quick", "brown", "foxes", "are", "running"],
            "en.microsoft": ["quick", "brown", "fox", "run"],  # Stemmed, stop words removed
            "keyword": ["The quick brown foxes are running"],
            "simple": ["the", "quick", "brown", "foxes", "are", "running"],
            "whitespace": ["The", "quick", "brown", "foxes", "are", "running"]
        }
    },
    {
        "input": "HTML <b>bold</b> text",
        "expected": {
            "standard.lucene": ["html", "b", "bold", "b", "text"],
            "keyword": ["HTML <b>bold</b> text"],
            "simple": ["html", "b", "bold", "b", "text"]
        }
    }
]

def validate_analyzer_behavior(comparator, test_cases):
    """Validate that analyzers behave as expected."""
    passed = 0
    failed = 0

    for test_case in test_cases:
        input_text = test_case["input"]
        expected = test_case["expected"]

        print(f"\nTesting: {input_text}")

        for analyzer, expected_tokens in expected.items():
            actual_tokens = comparator.analyze_text(input_text, analyzer)

            if actual_tokens == expected_tokens:
                print(f"  ✅ {analyzer}: PASS")
                passed += 1
            else:
                print(f"  ❌ {analyzer}: FAIL")
                print(f"     Expected: {expected_tokens}")
                print(f"     Actual: {actual_tokens}")
                failed += 1

    print(f"\nValidation Results: {passed} passed, {failed} failed")
    return failed == 0

🚀 Extensions¶

Extension 1: Language-Specific Testing¶

Test language-specific analyzers (fr.microsoft, de.microsoft, etc.) with multilingual content.

Extension 2: Performance Benchmarking¶

Measure and compare the performance of different analyzers:

import time

def benchmark_analyzers(comparator, text, analyzers, iterations=100):
    """Benchmark analyzer performance."""
    results = {}

    for analyzer in analyzers:
        start_time = time.time()

        for _ in range(iterations):
            comparator.analyze_text(text, analyzer)

        end_time = time.time()
        avg_time = (end_time - start_time) / iterations * 1000  # ms
        results[analyzer] = avg_time

    print(f"\nPerformance Benchmark ({iterations} iterations):")
    for analyzer, avg_time in sorted(results.items(), key=lambda x: x[1]):
        print(f"  {analyzer}: {avg_time:.2f}ms average")

    return results

Extension 3: Custom Test Data¶

Create test data specific to your domain (e-commerce, legal, medical, etc.) and analyze how different analyzers handle domain-specific terminology.

Extension 4: Visualization¶

Create charts or graphs showing the differences in token counts, processing times, and search result relevance across analyzers.

💡 Solutions¶

Key Insights¶

After completing this exercise, you should understand:

When to use each analyzer:
Standard: General-purpose, multilingual content
English: English-language content requiring linguistic processing
Keyword: Exact matching, IDs, codes
Simple: Lightweight processing, tags, categories
Whitespace: Preserving punctuation, minimal processing
Trade-offs:
Processing complexity vs. performance
Search recall vs. precision
Index size vs. search capabilities
Testing methodology:
Use representative data
Test edge cases
Measure both functionality and performance
Validate with real search scenarios

Common Pitfalls¶

Not testing with real data: Synthetic test data may not reveal real-world issues
Ignoring performance: Complex analyzers can impact indexing and query performance
Over-engineering: Starting with simple analyzers and adding complexity as needed
Inconsistent testing: Using different test data for different analyzers

Best Practices¶

Always test analyzers with your actual content
Consider both indexing and query performance
Document your findings and rationale for analyzer choices
Plan for future content types and languages
Monitor analyzer performance in production

Next Exercise: Custom Analyzer Workshop - Learn to build domain-specific analyzers with custom components.