Skip to content

Exercise 1: Analyzer Comparison Lab

📋 Exercise Details

  • Difficulty: Beginner
  • Duration: 45-60 minutes
  • Skills: Built-in analyzers, text analysis, tokenization comparison

🎯 Objective

Learn to compare and evaluate different built-in analyzers in Azure AI Search by creating a comprehensive testing framework that demonstrates how various analyzers process text differently.

📚 Prerequisites

  • Completed Module 10 documentation reading
  • Azure AI Search service (Standard tier or higher)
  • Admin API key for index creation
  • Basic understanding of JSON and REST APIs
  • Development environment set up (Python, JavaScript, or C#)

🛠️ Instructions

Step 1: Create Analyzer Comparison Index

Create an index with fields using different built-in analyzers:

{
  "name": "analyzer-comparison-lab",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true,
      "searchable": false
    },
    {
      "name": "content_standard",
      "type": "Edm.String",
      "analyzer": "standard.lucene",
      "searchable": true
    },
    {
      "name": "content_english",
      "type": "Edm.String", 
      "analyzer": "en.microsoft",
      "searchable": true
    },
    {
      "name": "content_keyword",
      "type": "Edm.String",
      "analyzer": "keyword",
      "searchable": true
    },
    {
      "name": "content_simple",
      "type": "Edm.String",
      "analyzer": "simple",
      "searchable": true
    },
    {
      "name": "content_whitespace",
      "type": "Edm.String",
      "analyzer": "whitespace",
      "searchable": true
    }
  ]
}

Step 2: Prepare Test Content

Create diverse test documents that highlight analyzer differences:

[
  {
    "id": "1",
    "content_standard": "The quick brown foxes are running through the forest",
    "content_english": "The quick brown foxes are running through the forest",
    "content_keyword": "The quick brown foxes are running through the forest",
    "content_simple": "The quick brown foxes are running through the forest",
    "content_whitespace": "The quick brown foxes are running through the forest"
  },
  {
    "id": "2", 
    "content_standard": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
    "content_english": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
    "content_keyword": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
    "content_simple": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text",
    "content_whitespace": "HTML <b>bold</b> and <i>italic</i> formatting with UPPERCASE text"
  },
  {
    "id": "3",
    "content_standard": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
    "content_english": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
    "content_keyword": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
    "content_simple": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com",
    "content_whitespace": "Email: user@example.com, Phone: (555) 123-4567, URL: https://www.example.com"
  }
]

Step 3: Implement Analyzer Testing Framework

Create a testing framework that uses the Analyze API to compare how different analyzers process text:

Python Implementation

import requests
import json
from typing import Dict, List

class AnalyzerComparator:
    def __init__(self, service_name: str, admin_key: str, index_name: str):
        self.service_name = service_name
        self.admin_key = admin_key
        self.index_name = index_name
        self.endpoint = f"https://{service_name}.search.windows.net"

    def analyze_text(self, text: str, analyzer: str) -> List[str]:
        """Analyze text using specified analyzer."""
        url = f"{self.endpoint}/indexes/{self.index_name}/analyze"
        headers = {
            "Content-Type": "application/json",
            "api-key": self.admin_key
        }
        data = {
            "text": text,
            "analyzer": analyzer
        }

        response = requests.post(url, headers=headers, json=data, 
                               params={"api-version": "2024-07-01"})

        if response.status_code == 200:
            result = response.json()
            return [token["token"] for token in result.get("tokens", [])]
        else:
            print(f"Error analyzing with {analyzer}: {response.text}")
            return []

    def compare_analyzers(self, text: str, analyzers: List[str]) -> Dict[str, List[str]]:
        """Compare how different analyzers process the same text."""
        results = {}
        for analyzer in analyzers:
            results[analyzer] = self.analyze_text(text, analyzer)
        return results

    def print_comparison(self, text: str, analyzers: List[str]):
        """Print a formatted comparison of analyzer results."""
        print(f"\nInput text: {text}")
        print("-" * 60)

        results = self.compare_analyzers(text, analyzers)

        for analyzer, tokens in results.items():
            print(f"{analyzer:15} -> {tokens}")

        # Analysis
        print(f"\nAnalysis:")
        token_counts = {analyzer: len(tokens) for analyzer, tokens in results.items()}

        for analyzer, count in token_counts.items():
            print(f"  {analyzer}: {count} tokens")

        # Identify unique behaviors
        if "en.microsoft" in results and "standard.lucene" in results:
            english_tokens = set(results["en.microsoft"])
            standard_tokens = set(results["standard.lucene"])

            if len(english_tokens) < len(standard_tokens):
                removed = standard_tokens - english_tokens
                print(f"  English analyzer removed: {removed}")

        if "keyword" in results:
            print(f"  Keyword analyzer preserved entire input as single token")

# Usage example
comparator = AnalyzerComparator("your-service", "your-key", "analyzer-comparison-lab")

test_texts = [
    "The quick brown foxes are running",
    "HTML <b>bold</b> formatting",
    "user@example.com and admin@test.org"
]

analyzers = ["standard.lucene", "en.microsoft", "keyword", "simple", "whitespace"]

for text in test_texts:
    comparator.print_comparison(text, analyzers)

Step 4: Conduct Systematic Testing

Test each analyzer with different types of content:

  1. Basic Text Processing
  2. Simple sentences
  3. Punctuation handling
  4. Case sensitivity

  5. Linguistic Features

  6. Stemming behavior
  7. Stop word removal
  8. Pluralization

  9. Special Content

  10. HTML tags
  11. Email addresses
  12. URLs and special characters
  13. Numbers and dates

  14. Edge Cases

  15. Empty strings
  16. Very long text
  17. Special Unicode characters

Step 5: Search Behavior Analysis

Compare search results using different analyzer fields:

def compare_search_results(search_client, query: str):
    """Compare search results across different analyzer fields."""

    fields_to_test = [
        ("content_standard", "Standard"),
        ("content_english", "English"),
        ("content_keyword", "Keyword"),
        ("content_simple", "Simple"),
        ("content_whitespace", "Whitespace")
    ]

    print(f"\nSearch query: '{query}'")
    print("=" * 50)

    for field, name in fields_to_test:
        try:
            results = list(search_client.search(
                search_text=query,
                search_fields=[field],
                select=["id", field],
                top=3
            ))

            print(f"\n{name} Analyzer ({field}):")
            if results:
                for i, result in enumerate(results, 1):
                    score = result.get("@search.score", 0)
                    content = result.get(field, "")[:50] + "..."
                    print(f"  {i}. Score: {score:.3f} - {content}")
            else:
                print("  No results found")

        except Exception as e:
            print(f"  Error: {e}")

✅ Validation

Expected Outcomes

Document your findings for each analyzer:

  1. Standard Analyzer (standard.lucene)
  2. Tokenizes on whitespace and punctuation
  3. Converts to lowercase
  4. Removes most punctuation
  5. Language-neutral processing

  6. English Analyzer (en.microsoft)

  7. Includes stemming (running → run, foxes → fox)
  8. Removes English stop words (the, and, or, etc.)
  9. Advanced linguistic processing
  10. Better for English content search

  11. Keyword Analyzer (keyword)

  12. Treats entire input as single token
  13. Preserves exact text including case and punctuation
  14. Perfect for exact matching scenarios

  15. Simple Analyzer (simple)

  16. Splits on non-letter characters
  17. Converts to lowercase
  18. No linguistic processing
  19. Fast and lightweight

  20. Whitespace Analyzer (whitespace)

  21. Splits only on whitespace
  22. Preserves punctuation and case
  23. Minimal processing

Validation Checklist

  • [ ] Index created successfully with all analyzer fields
  • [ ] Documents uploaded and indexed properly
  • [ ] Analyze API returns expected tokens for each analyzer
  • [ ] Search results differ appropriately between analyzers
  • [ ] Performance differences documented
  • [ ] Use case recommendations identified

Test Cases

Create a test suite that validates expected behavior:

test_cases = [
    {
        "input": "The quick brown foxes are running",
        "expected": {
            "standard.lucene": ["the", "quick", "brown", "foxes", "are", "running"],
            "en.microsoft": ["quick", "brown", "fox", "run"],  # Stemmed, stop words removed
            "keyword": ["The quick brown foxes are running"],
            "simple": ["the", "quick", "brown", "foxes", "are", "running"],
            "whitespace": ["The", "quick", "brown", "foxes", "are", "running"]
        }
    },
    {
        "input": "HTML <b>bold</b> text",
        "expected": {
            "standard.lucene": ["html", "b", "bold", "b", "text"],
            "keyword": ["HTML <b>bold</b> text"],
            "simple": ["html", "b", "bold", "b", "text"]
        }
    }
]

def validate_analyzer_behavior(comparator, test_cases):
    """Validate that analyzers behave as expected."""
    passed = 0
    failed = 0

    for test_case in test_cases:
        input_text = test_case["input"]
        expected = test_case["expected"]

        print(f"\nTesting: {input_text}")

        for analyzer, expected_tokens in expected.items():
            actual_tokens = comparator.analyze_text(input_text, analyzer)

            if actual_tokens == expected_tokens:
                print(f"  ✅ {analyzer}: PASS")
                passed += 1
            else:
                print(f"  ❌ {analyzer}: FAIL")
                print(f"     Expected: {expected_tokens}")
                print(f"     Actual: {actual_tokens}")
                failed += 1

    print(f"\nValidation Results: {passed} passed, {failed} failed")
    return failed == 0

🚀 Extensions

Extension 1: Language-Specific Testing

Test language-specific analyzers (fr.microsoft, de.microsoft, etc.) with multilingual content.

Extension 2: Performance Benchmarking

Measure and compare the performance of different analyzers:

import time

def benchmark_analyzers(comparator, text, analyzers, iterations=100):
    """Benchmark analyzer performance."""
    results = {}

    for analyzer in analyzers:
        start_time = time.time()

        for _ in range(iterations):
            comparator.analyze_text(text, analyzer)

        end_time = time.time()
        avg_time = (end_time - start_time) / iterations * 1000  # ms
        results[analyzer] = avg_time

    print(f"\nPerformance Benchmark ({iterations} iterations):")
    for analyzer, avg_time in sorted(results.items(), key=lambda x: x[1]):
        print(f"  {analyzer}: {avg_time:.2f}ms average")

    return results

Extension 3: Custom Test Data

Create test data specific to your domain (e-commerce, legal, medical, etc.) and analyze how different analyzers handle domain-specific terminology.

Extension 4: Visualization

Create charts or graphs showing the differences in token counts, processing times, and search result relevance across analyzers.

💡 Solutions

Key Insights

After completing this exercise, you should understand:

  1. When to use each analyzer:
  2. Standard: General-purpose, multilingual content
  3. English: English-language content requiring linguistic processing
  4. Keyword: Exact matching, IDs, codes
  5. Simple: Lightweight processing, tags, categories
  6. Whitespace: Preserving punctuation, minimal processing

  7. Trade-offs:

  8. Processing complexity vs. performance
  9. Search recall vs. precision
  10. Index size vs. search capabilities

  11. Testing methodology:

  12. Use representative data
  13. Test edge cases
  14. Measure both functionality and performance
  15. Validate with real search scenarios

Common Pitfalls

  • Not testing with real data: Synthetic test data may not reveal real-world issues
  • Ignoring performance: Complex analyzers can impact indexing and query performance
  • Over-engineering: Starting with simple analyzers and adding complexity as needed
  • Inconsistent testing: Using different test data for different analyzers

Best Practices

  • Always test analyzers with your actual content
  • Consider both indexing and query performance
  • Document your findings and rationale for analyzer choices
  • Plan for future content types and languages
  • Monitor analyzer performance in production

Next Exercise: Custom Analyzer Workshop - Learn to build domain-specific analyzers with custom components.