Skip to content

Module 10: Practice Implementation - Analyzers & Scoring

Overview

This practice guide provides hands-on exercises to implement and test custom analyzers and scoring profiles in Azure AI Search. Each exercise builds upon previous concepts and includes validation steps.

Exercise 1: Built-in Analyzer Comparison

Objective

Compare different built-in analyzers to understand their behavior and choose appropriate analyzers for different content types.

Setup

Create a test index with multiple analyzer fields:

{
  "name": "analyzer-test-index",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true,
      "searchable": false
    },
    {
      "name": "content_standard",
      "type": "Edm.String",
      "analyzer": "standard.lucene",
      "searchable": true
    },
    {
      "name": "content_english",
      "type": "Edm.String",
      "analyzer": "en.microsoft",
      "searchable": true
    },
    {
      "name": "content_keyword",
      "type": "Edm.String",
      "analyzer": "keyword",
      "searchable": true
    },
    {
      "name": "content_simple",
      "type": "Edm.String",
      "analyzer": "simple",
      "searchable": true
    }
  ]
}

Test Data

{
  "value": [
    {
      "id": "1",
      "content_standard": "The quick brown foxes are running through the forest",
      "content_english": "The quick brown foxes are running through the forest",
      "content_keyword": "The quick brown foxes are running through the forest",
      "content_simple": "The quick brown foxes are running through the forest"
    },
    {
      "id": "2",
      "content_standard": "HTML <b>bold</b> and <i>italic</i> formatting",
      "content_english": "HTML <b>bold</b> and <i>italic</i> formatting",
      "content_keyword": "HTML <b>bold</b> and <i>italic</i> formatting",
      "content_simple": "HTML <b>bold</b> and <i>italic</i> formatting"
    }
  ]
}

Analysis Tasks

  1. Test Tokenization: Use the Analyze API to see how each analyzer processes text:
POST https://[service].search.windows.net/indexes/analyzer-test-index/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "The quick brown foxes are running",
  "analyzer": "standard.lucene"
}
  1. Compare Results: Test the same text with different analyzers and document differences
  2. Search Testing: Perform searches and compare result relevance

Expected Outcomes

  • Standard Analyzer: Lowercases, removes punctuation, basic tokenization
  • English Analyzer: Stemming (foxes → fox, running → run), stop word removal
  • Keyword Analyzer: Treats entire input as single token
  • Simple Analyzer: Basic lowercase and whitespace tokenization

Validation

Create a comparison table:

Analyzer Input Tokens Notes
standard.lucene "running foxes" ["running", "foxes"] Basic processing
en.microsoft "running foxes" ["run", "fox"] Stemming applied
keyword "running foxes" ["running foxes"] Single token
simple "Running Foxes" ["running", "foxes"] Lowercase only

Exercise 2: Custom Analyzer Creation

Objective

Build a custom analyzer for e-commerce product search that handles HTML content and applies domain-specific processing.

Custom Analyzer Definition

{
  "name": "ecommerce-analyzer-index",
  "analyzers": [
    {
      "name": "product_analyzer",
      "tokenizer": "standard",
      "charFilters": ["html_strip", "product_mapping"],
      "tokenFilters": [
        "lowercase",
        "product_stopwords",
        "product_synonyms"
      ]
    }
  ],
  "charFilters": [
    {
      "name": "html_strip",
      "@odata.type": "#Microsoft.Azure.Search.HtmlStripCharFilter"
    },
    {
      "name": "product_mapping",
      "@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
      "mappings": [
        "& => and",
        "@ => at"
      ]
    }
  ],
  "tokenFilters": [
    {
      "name": "product_stopwords",
      "@odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter",
      "stopwords": ["the", "and", "or", "but", "in", "on", "at", "to", "for", "of", "with", "by"]
    },
    {
      "name": "product_synonyms",
      "@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
      "synonyms": [
        "laptop,notebook,computer",
        "phone,mobile,smartphone",
        "tv,television,monitor"
      ]
    }
  ],
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true
    },
    {
      "name": "productName",
      "type": "Edm.String",
      "analyzer": "product_analyzer",
      "searchable": true
    },
    {
      "name": "description",
      "type": "Edm.String",
      "analyzer": "product_analyzer",
      "searchable": true
    }
  ]
}

Test Implementation

  1. Create the Index: Deploy the custom analyzer configuration
  2. Test Analysis: Verify the analyzer processes text correctly
POST https://[service].search.windows.net/indexes/ecommerce-analyzer-index/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "<p>High-performance <b>laptop</b> & notebook computer</p>",
  "analyzer": "product_analyzer"
}
  1. Expected Output: ["high", "performance", "laptop", "and", "notebook", "computer"]

Sample Data

{
  "value": [
    {
      "id": "1",
      "productName": "Dell XPS 13 Laptop",
      "description": "<p>Ultra-thin <b>notebook</b> computer with high performance</p>"
    },
    {
      "id": "2",
      "productName": "iPhone 14 Pro",
      "description": "<div>Advanced <i>smartphone</i> with professional camera</div>"
    },
    {
      "id": "3",
      "productName": "Samsung 55\" Smart TV",
      "description": "4K <b>television</b> with streaming capabilities"
    }
  ]
}

Validation Tasks

  1. HTML Stripping: Verify HTML tags are removed
  2. Character Mapping: Confirm & becomes and
  3. Synonym Expansion: Test that searching for "laptop" finds "notebook" products
  4. Stop Word Removal: Verify common words are filtered out

Exercise 3: N-gram Analyzer for Autocomplete

Objective

Implement an edge n-gram analyzer to enable autocomplete functionality.

Autocomplete Analyzer Configuration

{
  "name": "autocomplete-index",
  "analyzers": [
    {
      "name": "autocomplete_analyzer",
      "tokenizer": "autocomplete_tokenizer",
      "tokenFilters": ["lowercase"]
    },
    {
      "name": "search_analyzer",
      "tokenizer": "standard",
      "tokenFilters": ["lowercase"]
    }
  ],
  "tokenizers": [
    {
      "name": "autocomplete_tokenizer",
      "@odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenizer",
      "minGram": 2,
      "maxGram": 25,
      "tokenChars": ["letter", "digit"]
    }
  ],
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true
    },
    {
      "name": "title",
      "type": "Edm.String",
      "indexAnalyzer": "autocomplete_analyzer",
      "searchAnalyzer": "search_analyzer",
      "searchable": true
    }
  ]
}

Test Data

{
  "value": [
    {
      "id": "1",
      "title": "Machine Learning Fundamentals"
    },
    {
      "id": "2", 
      "title": "Deep Learning with Python"
    },
    {
      "id": "3",
      "title": "Natural Language Processing"
    }
  ]
}

Testing Autocomplete

  1. Analyze Indexing: See how text is tokenized for indexing:
POST https://[service].search.windows.net/indexes/autocomplete-index/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "Machine Learning",
  "analyzer": "autocomplete_analyzer"
}

Expected tokens: ["ma", "mac", "mach", "machi", "machin", "machine", "le", "lea", "lear", "learn", "learni", "learnin", "learning"]

  1. Test Autocomplete Queries:
POST https://[service].search.windows.net/indexes/autocomplete-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]

{
  "search": "mach",
  "searchFields": "title"
}

Validation

  • Partial matches work (searching "mach" finds "Machine Learning")
  • Performance is acceptable for autocomplete scenarios
  • Index size increase is manageable

Exercise 4: Basic Scoring Profile

Objective

Create a scoring profile that weights different fields and applies magnitude boosting.

Scoring Profile Configuration

{
  "name": "content-scoring-index",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true
    },
    {
      "name": "title",
      "type": "Edm.String",
      "searchable": true
    },
    {
      "name": "content",
      "type": "Edm.String",
      "searchable": true
    },
    {
      "name": "category",
      "type": "Edm.String",
      "searchable": true
    },
    {
      "name": "rating",
      "type": "Edm.Double",
      "filterable": true
    },
    {
      "name": "viewCount",
      "type": "Edm.Int32",
      "filterable": true
    },
    {
      "name": "publishDate",
      "type": "Edm.DateTimeOffset",
      "filterable": true
    }
  ],
  "scoringProfiles": [
    {
      "name": "content_relevance",
      "text": {
        "weights": {
          "title": 4.0,
          "content": 1.0,
          "category": 2.0
        }
      },
      "functions": [
        {
          "type": "magnitude",
          "fieldName": "rating",
          "boost": 2.0,
          "interpolation": "linear",
          "magnitude": {
            "boostingRangeStart": 1,
            "boostingRangeEnd": 5,
            "constantBoostBeyondRange": true
          }
        },
        {
          "type": "magnitude",
          "fieldName": "viewCount",
          "boost": 1.5,
          "interpolation": "logarithmic",
          "magnitude": {
            "boostingRangeStart": 0,
            "boostingRangeEnd": 10000,
            "constantBoostBeyondRange": true
          }
        },
        {
          "type": "freshness",
          "fieldName": "publishDate",
          "boost": 1.3,
          "interpolation": "linear",
          "freshness": {
            "boostingDuration": "P30D"
          }
        }
      ],
      "functionAggregation": "sum"
    }
  ]
}

Test Data

{
  "value": [
    {
      "id": "1",
      "title": "Introduction to Machine Learning",
      "content": "Machine learning is a powerful subset of artificial intelligence...",
      "category": "Technology",
      "rating": 4.5,
      "viewCount": 1250,
      "publishDate": "2024-01-15T10:00:00Z"
    },
    {
      "id": "2",
      "title": "Advanced Machine Learning Techniques",
      "content": "Deep dive into advanced machine learning algorithms...",
      "category": "Technology",
      "rating": 4.8,
      "viewCount": 850,
      "publishDate": "2024-02-20T14:30:00Z"
    },
    {
      "id": "3",
      "title": "Machine Learning in Practice",
      "content": "Practical applications of machine learning in business...",
      "category": "Business",
      "rating": 3.9,
      "viewCount": 2100,
      "publishDate": "2023-12-10T09:15:00Z"
    }
  ]
}

Testing Scoring

  1. Default Scoring:
POST https://[service].search.windows.net/indexes/content-scoring-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]

{
  "search": "machine learning",
  "select": "id,title,rating,viewCount,publishDate",
  "top": 10
}
  1. Custom Scoring Profile:
POST https://[service].search.windows.net/indexes/content-scoring-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]

{
  "search": "machine learning",
  "scoringProfile": "content_relevance",
  "select": "id,title,rating,viewCount,publishDate",
  "top": 10
}

Analysis Tasks

  1. Compare Rankings: Document how result order changes with the scoring profile
  2. Score Analysis: Use "includeTotalResultCount": true to see score values
  3. Field Weight Impact: Test queries that match different fields
  4. Function Impact: Analyze how rating, view count, and freshness affect scores

Expected Observations

  • Higher-rated content should rank higher
  • Recent content gets freshness boost
  • Popular content (high view count) gets magnitude boost
  • Title matches score higher than content matches

Exercise 5: Advanced Scoring with Distance

Objective

Implement location-based scoring for a restaurant search scenario.

Location-Based Index

{
  "name": "restaurant-index",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true
    },
    {
      "name": "name",
      "type": "Edm.String",
      "searchable": true
    },
    {
      "name": "cuisine",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true
    },
    {
      "name": "description",
      "type": "Edm.String",
      "searchable": true
    },
    {
      "name": "location",
      "type": "Edm.GeographyPoint",
      "filterable": true
    },
    {
      "name": "rating",
      "type": "Edm.Double",
      "filterable": true
    },
    {
      "name": "priceRange",
      "type": "Edm.Int32",
      "filterable": true
    }
  ],
  "scoringProfiles": [
    {
      "name": "location_relevance",
      "text": {
        "weights": {
          "name": 3.0,
          "cuisine": 2.0,
          "description": 1.0
        }
      },
      "functions": [
        {
          "type": "distance",
          "fieldName": "location",
          "boost": 2.0,
          "interpolation": "linear",
          "distance": {
            "referencePointParameter": "userLocation",
            "boostingDistance": 5
          }
        },
        {
          "type": "magnitude",
          "fieldName": "rating",
          "boost": 1.5,
          "interpolation": "linear",
          "magnitude": {
            "boostingRangeStart": 1,
            "boostingRangeEnd": 5
          }
        }
      ],
      "functionAggregation": "sum"
    }
  ]
}

Test Data

{
  "value": [
    {
      "id": "1",
      "name": "Mario's Italian Kitchen",
      "cuisine": "Italian",
      "description": "Authentic Italian cuisine with fresh pasta",
      "location": {
        "type": "Point",
        "coordinates": [-122.131577, 47.678581]
      },
      "rating": 4.5,
      "priceRange": 3
    },
    {
      "id": "2",
      "name": "Sakura Sushi",
      "cuisine": "Japanese",
      "description": "Fresh sushi and traditional Japanese dishes",
      "location": {
        "type": "Point",
        "coordinates": [-122.135577, 47.680581]
      },
      "rating": 4.8,
      "priceRange": 4
    }
  ]
}
POST https://[service].search.windows.net/indexes/restaurant-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]

{
  "search": "italian",
  "scoringProfile": "location_relevance",
  "scoringParameters": ["userLocation:-122.133577,47.679581"],
  "select": "id,name,cuisine,rating,location",
  "top": 10
}

Validation

  • Restaurants closer to user location rank higher
  • High-rated restaurants get additional boost
  • Distance function works correctly with geographic coordinates

Exercise 6: Performance Testing and Optimization

Objective

Measure and optimize analyzer and scoring profile performance.

Performance Test Setup

  1. Create Large Test Dataset: Generate 10,000+ documents
  2. Measure Indexing Performance: Time document indexing with different analyzers
  3. Measure Query Performance: Test query latency with different scoring profiles

Performance Testing Script (Python)

import time
import statistics
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

def measure_indexing_performance(search_client, documents, analyzer_name):
    """Measure indexing performance for specific analyzer"""
    start_time = time.time()

    try:
        result = search_client.upload_documents(documents)
        end_time = time.time()

        duration = end_time - start_time
        docs_per_second = len(documents) / duration

        return {
            'analyzer': analyzer_name,
            'duration': duration,
            'docs_per_second': docs_per_second,
            'success_count': len([r for r in result if r.succeeded])
        }
    except Exception as e:
        return {'error': str(e)}

def measure_query_performance(search_client, query, scoring_profile=None, iterations=10):
    """Measure query performance"""
    latencies = []

    for _ in range(iterations):
        start_time = time.time()

        search_params = {'search_text': query}
        if scoring_profile:
            search_params['scoring_profile'] = scoring_profile

        results = search_client.search(**search_params)
        list(results)  # Force execution

        end_time = time.time()
        latencies.append((end_time - start_time) * 1000)  # Convert to ms

    return {
        'query': query,
        'scoring_profile': scoring_profile,
        'avg_latency_ms': statistics.mean(latencies),
        'min_latency_ms': min(latencies),
        'max_latency_ms': max(latencies),
        'std_dev_ms': statistics.stdev(latencies) if len(latencies) > 1 else 0
    }

# Example usage
def run_performance_tests():
    # Test different analyzers
    analyzers = ['standard.lucene', 'en.microsoft', 'custom_analyzer']

    for analyzer in analyzers:
        perf = measure_indexing_performance(client, test_docs, analyzer)
        print(f"Analyzer {analyzer}: {perf['docs_per_second']:.2f} docs/sec")

    # Test scoring profiles
    profiles = [None, 'content_relevance', 'location_relevance']

    for profile in profiles:
        perf = measure_query_performance(client, "machine learning", profile)
        print(f"Profile {profile}: {perf['avg_latency_ms']:.2f}ms avg")

Optimization Strategies

  1. Analyzer Optimization:
  2. Use simpler analyzers for less critical fields
  3. Implement separate index/search analyzers
  4. Remove unnecessary token filters

  5. Scoring Profile Optimization:

  6. Reduce number of scoring functions
  7. Use appropriate interpolation methods
  8. Balance boost values

  9. Index Design Optimization:

  10. Selective field analysis
  11. Appropriate field attributes
  12. Efficient data types

Exercise 7: A/B Testing Framework

Objective

Implement A/B testing to compare different analyzer and scoring configurations.

A/B Testing Implementation

import random
import json
from datetime import datetime

class ABTestFramework:
    def __init__(self, search_client):
        self.search_client = search_client
        self.test_results = []

    def run_ab_test(self, query, config_a, config_b, test_queries, user_sessions=100):
        """Run A/B test comparing two configurations"""

        results = {
            'config_a': {'queries': [], 'metrics': {}},
            'config_b': {'queries': [], 'metrics': {}}
        }

        for session in range(user_sessions):
            # Randomly assign to A or B group
            config = config_a if random.random() < 0.5 else config_b
            group = 'config_a' if config == config_a else 'config_b'

            # Run test query
            query_result = self.execute_search(query, config)
            results[group]['queries'].append(query_result)

        # Calculate metrics
        for group in ['config_a', 'config_b']:
            results[group]['metrics'] = self.calculate_metrics(results[group]['queries'])

        return results

    def execute_search(self, query, config):
        """Execute search with specific configuration"""
        search_params = {
            'search_text': query,
            'top': 10
        }

        if 'scoring_profile' in config:
            search_params['scoring_profile'] = config['scoring_profile']

        if 'scoring_parameters' in config:
            search_params['scoring_parameters'] = config['scoring_parameters']

        start_time = time.time()
        results = list(self.search_client.search(**search_params))
        end_time = time.time()

        return {
            'query': query,
            'results': results,
            'latency': (end_time - start_time) * 1000,
            'result_count': len(results),
            'timestamp': datetime.now()
        }

    def calculate_metrics(self, query_results):
        """Calculate performance metrics"""
        latencies = [r['latency'] for r in query_results]
        result_counts = [r['result_count'] for r in query_results]

        return {
            'avg_latency': statistics.mean(latencies),
            'avg_results': statistics.mean(result_counts),
            'total_queries': len(query_results),
            'success_rate': len([r for r in query_results if r['result_count'] > 0]) / len(query_results)
        }

# Example A/B test
def run_scoring_ab_test():
    ab_tester = ABTestFramework(search_client)

    config_a = {'scoring_profile': 'content_relevance'}
    config_b = {'scoring_profile': 'enhanced_relevance'}

    test_queries = [
        "machine learning",
        "data science",
        "artificial intelligence",
        "python programming"
    ]

    for query in test_queries:
        results = ab_tester.run_ab_test(query, config_a, config_b, test_queries)

        print(f"Query: {query}")
        print(f"Config A - Avg Latency: {results['config_a']['metrics']['avg_latency']:.2f}ms")
        print(f"Config B - Avg Latency: {results['config_b']['metrics']['avg_latency']:.2f}ms")
        print(f"Config A - Success Rate: {results['config_a']['metrics']['success_rate']:.2%}")
        print(f"Config B - Success Rate: {results['config_b']['metrics']['success_rate']:.2%}")
        print("---")

Validation and Assessment

Completion Checklist

  • [ ] Exercise 1: Successfully compared built-in analyzers
  • [ ] Exercise 2: Created and tested custom analyzer
  • [ ] Exercise 3: Implemented n-gram analyzer for autocomplete
  • [ ] Exercise 4: Built basic scoring profile with field weights and functions
  • [ ] Exercise 5: Implemented location-based scoring
  • [ ] Exercise 6: Conducted performance testing and optimization
  • [ ] Exercise 7: Set up A/B testing framework

Assessment Criteria

  1. Technical Implementation (40%)
  2. Correct analyzer and scoring profile syntax
  3. Proper use of Azure AI Search APIs
  4. Error handling and validation

  5. Performance Optimization (30%)

  6. Measured performance impact
  7. Applied optimization strategies
  8. Balanced functionality vs. performance

  9. Testing and Validation (20%)

  10. Comprehensive test coverage
  11. Proper use of Analyze API
  12. A/B testing implementation

  13. Documentation and Analysis (10%)

  14. Clear documentation of configurations
  15. Analysis of results and trade-offs
  16. Recommendations for production use

Next Steps

After completing these exercises:

  1. Apply to Real Data: Implement analyzers and scoring for your actual use case
  2. Monitor Production: Set up monitoring for performance and relevance
  3. Iterate and Improve: Use A/B testing to continuously optimize
  4. Advanced Topics: Explore semantic search and vector search capabilities

These practical exercises provide hands-on experience with the core concepts of text analysis and scoring in Azure AI Search, preparing you for real-world implementation scenarios.