Module 10: Practice Implementation - Analyzers & Scoring¶

Overview¶

This practice guide provides hands-on exercises to implement and test custom analyzers and scoring profiles in Azure AI Search. Each exercise builds upon previous concepts and includes validation steps.

Exercise 1: Built-in Analyzer Comparison¶

Objective¶

Compare different built-in analyzers to understand their behavior and choose appropriate analyzers for different content types.

Setup¶

Create a test index with multiple analyzer fields:

{
  "name": "analyzer-test-index",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true,
      "searchable": false
    },
    {
      "name": "content_standard",
      "type": "Edm.String",
      "analyzer": "standard.lucene",
      "searchable": true
    },
    {
      "name": "content_english",
      "type": "Edm.String",
      "analyzer": "en.microsoft",
      "searchable": true
    },
    {
      "name": "content_keyword",
      "type": "Edm.String",
      "analyzer": "keyword",
      "searchable": true
    },
    {
      "name": "content_simple",
      "type": "Edm.String",
      "analyzer": "simple",
      "searchable": true
    }
  ]
}

Test Data¶

{
  "value": [
    {
      "id": "1",
      "content_standard": "The quick brown foxes are running through the forest",
      "content_english": "The quick brown foxes are running through the forest",
      "content_keyword": "The quick brown foxes are running through the forest",
      "content_simple": "The quick brown foxes are running through the forest"
    },
    {
      "id": "2",
      "content_standard": "HTML <b>bold</b> and <i>italic</i> formatting",
      "content_english": "HTML <b>bold</b> and <i>italic</i> formatting",
      "content_keyword": "HTML <b>bold</b> and <i>italic</i> formatting",
      "content_simple": "HTML <b>bold</b> and <i>italic</i> formatting"
    }
  ]
}

Analysis Tasks¶

Test Tokenization: Use the Analyze API to see how each analyzer processes text:

POST https://[service].search.windows.net/indexes/analyzer-test-index/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "The quick brown foxes are running",
  "analyzer": "standard.lucene"
}

Compare Results: Test the same text with different analyzers and document differences
Search Testing: Perform searches and compare result relevance

Expected Outcomes¶

Standard Analyzer: Lowercases, removes punctuation, basic tokenization
English Analyzer: Stemming (foxes → fox, running → run), stop word removal
Keyword Analyzer: Treats entire input as single token
Simple Analyzer: Basic lowercase and whitespace tokenization

Validation¶

Create a comparison table:

Analyzer	Input	Tokens	Notes
standard.lucene	"running foxes"	["running", "foxes"]	Basic processing
en.microsoft	"running foxes"	["run", "fox"]	Stemming applied
keyword	"running foxes"	["running foxes"]	Single token
simple	"Running Foxes"	["running", "foxes"]	Lowercase only

Exercise 2: Custom Analyzer Creation¶

Objective¶

Build a custom analyzer for e-commerce product search that handles HTML content and applies domain-specific processing.

Custom Analyzer Definition¶

{
  "name": "ecommerce-analyzer-index",
  "analyzers": [
    {
      "name": "product_analyzer",
      "tokenizer": "standard",
      "charFilters": ["html_strip", "product_mapping"],
      "tokenFilters": [
        "lowercase",
        "product_stopwords",
        "product_synonyms"
      ]
    }
  ],
  "charFilters": [
    {
      "name": "html_strip",
      "@odata.type": "#Microsoft.Azure.Search.HtmlStripCharFilter"
    },
    {
      "name": "product_mapping",
      "@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
      "mappings": [
        "& => and",
        "@ => at"
      ]
    }
  ],
  "tokenFilters": [
    {
      "name": "product_stopwords",
      "@odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter",
      "stopwords": ["the", "and", "or", "but", "in", "on", "at", "to", "for", "of", "with", "by"]
    },
    {
      "name": "product_synonyms",
      "@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
      "synonyms": [
        "laptop,notebook,computer",
        "phone,mobile,smartphone",
        "tv,television,monitor"
      ]
    }
  ],
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true
    },
    {
      "name": "productName",
      "type": "Edm.String",
      "analyzer": "product_analyzer",
      "searchable": true
    },
    {
      "name": "description",
      "type": "Edm.String",
      "analyzer": "product_analyzer",
      "searchable": true
    }
  ]
}

Test Implementation¶

Create the Index: Deploy the custom analyzer configuration
Test Analysis: Verify the analyzer processes text correctly

POST https://[service].search.windows.net/indexes/ecommerce-analyzer-index/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "<p>High-performance <b>laptop</b> & notebook computer</p>",
  "analyzer": "product_analyzer"
}

Expected Output: ["high", "performance", "laptop", "and", "notebook", "computer"]

Sample Data¶

{
  "value": [
    {
      "id": "1",
      "productName": "Dell XPS 13 Laptop",
      "description": "<p>Ultra-thin <b>notebook</b> computer with high performance</p>"
    },
    {
      "id": "2",
      "productName": "iPhone 14 Pro",
      "description": "<div>Advanced <i>smartphone</i> with professional camera</div>"
    },
    {
      "id": "3",
      "productName": "Samsung 55\" Smart TV",
      "description": "4K <b>television</b> with streaming capabilities"
    }
  ]
}

Validation Tasks¶

HTML Stripping: Verify HTML tags are removed
Character Mapping: Confirm & becomes and
Synonym Expansion: Test that searching for "laptop" finds "notebook" products
Stop Word Removal: Verify common words are filtered out

Exercise 3: N-gram Analyzer for Autocomplete¶

Objective¶

Implement an edge n-gram analyzer to enable autocomplete functionality.

Autocomplete Analyzer Configuration¶

{
  "name": "autocomplete-index",
  "analyzers": [
    {
      "name": "autocomplete_analyzer",
      "tokenizer": "autocomplete_tokenizer",
      "tokenFilters": ["lowercase"]
    },
    {
      "name": "search_analyzer",
      "tokenizer": "standard",
      "tokenFilters": ["lowercase"]
    }
  ],
  "tokenizers": [
    {
      "name": "autocomplete_tokenizer",
      "@odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenizer",
      "minGram": 2,
      "maxGram": 25,
      "tokenChars": ["letter", "digit"]
    }
  ],
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true
    },
    {
      "name": "title",
      "type": "Edm.String",
      "indexAnalyzer": "autocomplete_analyzer",
      "searchAnalyzer": "search_analyzer",
      "searchable": true
    }
  ]
}

Test Data¶

{
  "value": [
    {
      "id": "1",
      "title": "Machine Learning Fundamentals"
    },
    {
      "id": "2", 
      "title": "Deep Learning with Python"
    },
    {
      "id": "3",
      "title": "Natural Language Processing"
    }
  ]
}

Testing Autocomplete¶

Analyze Indexing: See how text is tokenized for indexing:

POST https://[service].search.windows.net/indexes/autocomplete-index/analyze?api-version=2024-07-01
Content-Type: application/json
api-key: [admin-key]

{
  "text": "Machine Learning",
  "analyzer": "autocomplete_analyzer"
}

Expected tokens: ["ma", "mac", "mach", "machi", "machin", "machine", "le", "lea", "lear", "learn", "learni", "learnin", "learning"]

Test Autocomplete Queries:

POST https://[service].search.windows.net/indexes/autocomplete-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]

{
  "search": "mach",
  "searchFields": "title"
}

Validation¶

Partial matches work (searching "mach" finds "Machine Learning")
Performance is acceptable for autocomplete scenarios
Index size increase is manageable

Exercise 4: Basic Scoring Profile¶

Objective¶

Create a scoring profile that weights different fields and applies magnitude boosting.

Scoring Profile Configuration¶

{
  "name": "content-scoring-index",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true
    },
    {
      "name": "title",
      "type": "Edm.String",
      "searchable": true
    },
    {
      "name": "content",
      "type": "Edm.String",
      "searchable": true
    },
    {
      "name": "category",
      "type": "Edm.String",
      "searchable": true
    },
    {
      "name": "rating",
      "type": "Edm.Double",
      "filterable": true
    },
    {
      "name": "viewCount",
      "type": "Edm.Int32",
      "filterable": true
    },
    {
      "name": "publishDate",
      "type": "Edm.DateTimeOffset",
      "filterable": true
    }
  ],
  "scoringProfiles": [
    {
      "name": "content_relevance",
      "text": {
        "weights": {
          "title": 4.0,
          "content": 1.0,
          "category": 2.0
        }
      },
      "functions": [
        {
          "type": "magnitude",
          "fieldName": "rating",
          "boost": 2.0,
          "interpolation": "linear",
          "magnitude": {
            "boostingRangeStart": 1,
            "boostingRangeEnd": 5,
            "constantBoostBeyondRange": true
          }
        },
        {
          "type": "magnitude",
          "fieldName": "viewCount",
          "boost": 1.5,
          "interpolation": "logarithmic",
          "magnitude": {
            "boostingRangeStart": 0,
            "boostingRangeEnd": 10000,
            "constantBoostBeyondRange": true
          }
        },
        {
          "type": "freshness",
          "fieldName": "publishDate",
          "boost": 1.3,
          "interpolation": "linear",
          "freshness": {
            "boostingDuration": "P30D"
          }
        }
      ],
      "functionAggregation": "sum"
    }
  ]
}

Test Data¶

{
  "value": [
    {
      "id": "1",
      "title": "Introduction to Machine Learning",
      "content": "Machine learning is a powerful subset of artificial intelligence...",
      "category": "Technology",
      "rating": 4.5,
      "viewCount": 1250,
      "publishDate": "2024-01-15T10:00:00Z"
    },
    {
      "id": "2",
      "title": "Advanced Machine Learning Techniques",
      "content": "Deep dive into advanced machine learning algorithms...",
      "category": "Technology",
      "rating": 4.8,
      "viewCount": 850,
      "publishDate": "2024-02-20T14:30:00Z"
    },
    {
      "id": "3",
      "title": "Machine Learning in Practice",
      "content": "Practical applications of machine learning in business...",
      "category": "Business",
      "rating": 3.9,
      "viewCount": 2100,
      "publishDate": "2023-12-10T09:15:00Z"
    }
  ]
}

Testing Scoring¶

Default Scoring:

POST https://[service].search.windows.net/indexes/content-scoring-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]

{
  "search": "machine learning",
  "select": "id,title,rating,viewCount,publishDate",
  "top": 10
}

Custom Scoring Profile:

POST https://[service].search.windows.net/indexes/content-scoring-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]

{
  "search": "machine learning",
  "scoringProfile": "content_relevance",
  "select": "id,title,rating,viewCount,publishDate",
  "top": 10
}

Analysis Tasks¶

Compare Rankings: Document how result order changes with the scoring profile
Score Analysis: Use "includeTotalResultCount": true to see score values
Field Weight Impact: Test queries that match different fields
Function Impact: Analyze how rating, view count, and freshness affect scores

Expected Observations¶

Higher-rated content should rank higher
Recent content gets freshness boost
Popular content (high view count) gets magnitude boost
Title matches score higher than content matches

Exercise 5: Advanced Scoring with Distance¶

Objective¶

Implement location-based scoring for a restaurant search scenario.

Location-Based Index¶

{
  "name": "restaurant-index",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true
    },
    {
      "name": "name",
      "type": "Edm.String",
      "searchable": true
    },
    {
      "name": "cuisine",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true
    },
    {
      "name": "description",
      "type": "Edm.String",
      "searchable": true
    },
    {
      "name": "location",
      "type": "Edm.GeographyPoint",
      "filterable": true
    },
    {
      "name": "rating",
      "type": "Edm.Double",
      "filterable": true
    },
    {
      "name": "priceRange",
      "type": "Edm.Int32",
      "filterable": true
    }
  ],
  "scoringProfiles": [
    {
      "name": "location_relevance",
      "text": {
        "weights": {
          "name": 3.0,
          "cuisine": 2.0,
          "description": 1.0
        }
      },
      "functions": [
        {
          "type": "distance",
          "fieldName": "location",
          "boost": 2.0,
          "interpolation": "linear",
          "distance": {
            "referencePointParameter": "userLocation",
            "boostingDistance": 5
          }
        },
        {
          "type": "magnitude",
          "fieldName": "rating",
          "boost": 1.5,
          "interpolation": "linear",
          "magnitude": {
            "boostingRangeStart": 1,
            "boostingRangeEnd": 5
          }
        }
      ],
      "functionAggregation": "sum"
    }
  ]
}

Test Data¶

{
  "value": [
    {
      "id": "1",
      "name": "Mario's Italian Kitchen",
      "cuisine": "Italian",
      "description": "Authentic Italian cuisine with fresh pasta",
      "location": {
        "type": "Point",
        "coordinates": [-122.131577, 47.678581]
      },
      "rating": 4.5,
      "priceRange": 3
    },
    {
      "id": "2",
      "name": "Sakura Sushi",
      "cuisine": "Japanese",
      "description": "Fresh sushi and traditional Japanese dishes",
      "location": {
        "type": "Point",
        "coordinates": [-122.135577, 47.680581]
      },
      "rating": 4.8,
      "priceRange": 4
    }
  ]
}

Location-Based Search¶

POST https://[service].search.windows.net/indexes/restaurant-index/docs/search?api-version=2024-07-01
Content-Type: application/json
api-key: [query-key]

{
  "search": "italian",
  "scoringProfile": "location_relevance",
  "scoringParameters": ["userLocation:-122.133577,47.679581"],
  "select": "id,name,cuisine,rating,location",
  "top": 10
}

Validation¶

Restaurants closer to user location rank higher
High-rated restaurants get additional boost
Distance function works correctly with geographic coordinates

Exercise 6: Performance Testing and Optimization¶

Objective¶

Measure and optimize analyzer and scoring profile performance.

Performance Test Setup¶

Create Large Test Dataset: Generate 10,000+ documents
Measure Indexing Performance: Time document indexing with different analyzers
Measure Query Performance: Test query latency with different scoring profiles

Performance Testing Script (Python)¶

import time
import statistics
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

def measure_indexing_performance(search_client, documents, analyzer_name):
    """Measure indexing performance for specific analyzer"""
    start_time = time.time()

    try:
        result = search_client.upload_documents(documents)
        end_time = time.time()

        duration = end_time - start_time
        docs_per_second = len(documents) / duration

        return {
            'analyzer': analyzer_name,
            'duration': duration,
            'docs_per_second': docs_per_second,
            'success_count': len([r for r in result if r.succeeded])
        }
    except Exception as e:
        return {'error': str(e)}

def measure_query_performance(search_client, query, scoring_profile=None, iterations=10):
    """Measure query performance"""
    latencies = []

    for _ in range(iterations):
        start_time = time.time()

        search_params = {'search_text': query}
        if scoring_profile:
            search_params['scoring_profile'] = scoring_profile

        results = search_client.search(**search_params)
        list(results)  # Force execution

        end_time = time.time()
        latencies.append((end_time - start_time) * 1000)  # Convert to ms

    return {
        'query': query,
        'scoring_profile': scoring_profile,
        'avg_latency_ms': statistics.mean(latencies),
        'min_latency_ms': min(latencies),
        'max_latency_ms': max(latencies),
        'std_dev_ms': statistics.stdev(latencies) if len(latencies) > 1 else 0
    }

# Example usage
def run_performance_tests():
    # Test different analyzers
    analyzers = ['standard.lucene', 'en.microsoft', 'custom_analyzer']

    for analyzer in analyzers:
        perf = measure_indexing_performance(client, test_docs, analyzer)
        print(f"Analyzer {analyzer}: {perf['docs_per_second']:.2f} docs/sec")

    # Test scoring profiles
    profiles = [None, 'content_relevance', 'location_relevance']

    for profile in profiles:
        perf = measure_query_performance(client, "machine learning", profile)
        print(f"Profile {profile}: {perf['avg_latency_ms']:.2f}ms avg")

Optimization Strategies¶

Analyzer Optimization:
Use simpler analyzers for less critical fields
Implement separate index/search analyzers
Remove unnecessary token filters
Scoring Profile Optimization:
Reduce number of scoring functions
Use appropriate interpolation methods
Balance boost values
Index Design Optimization:
Selective field analysis
Appropriate field attributes
Efficient data types

Exercise 7: A/B Testing Framework¶

Objective¶

Implement A/B testing to compare different analyzer and scoring configurations.

A/B Testing Implementation¶

import random
import json
from datetime import datetime

class ABTestFramework:
    def __init__(self, search_client):
        self.search_client = search_client
        self.test_results = []

    def run_ab_test(self, query, config_a, config_b, test_queries, user_sessions=100):
        """Run A/B test comparing two configurations"""

        results = {
            'config_a': {'queries': [], 'metrics': {}},
            'config_b': {'queries': [], 'metrics': {}}
        }

        for session in range(user_sessions):
            # Randomly assign to A or B group
            config = config_a if random.random() < 0.5 else config_b
            group = 'config_a' if config == config_a else 'config_b'

            # Run test query
            query_result = self.execute_search(query, config)
            results[group]['queries'].append(query_result)

        # Calculate metrics
        for group in ['config_a', 'config_b']:
            results[group]['metrics'] = self.calculate_metrics(results[group]['queries'])

        return results

    def execute_search(self, query, config):
        """Execute search with specific configuration"""
        search_params = {
            'search_text': query,
            'top': 10
        }

        if 'scoring_profile' in config:
            search_params['scoring_profile'] = config['scoring_profile']

        if 'scoring_parameters' in config:
            search_params['scoring_parameters'] = config['scoring_parameters']

        start_time = time.time()
        results = list(self.search_client.search(**search_params))
        end_time = time.time()

        return {
            'query': query,
            'results': results,
            'latency': (end_time - start_time) * 1000,
            'result_count': len(results),
            'timestamp': datetime.now()
        }

    def calculate_metrics(self, query_results):
        """Calculate performance metrics"""
        latencies = [r['latency'] for r in query_results]
        result_counts = [r['result_count'] for r in query_results]

        return {
            'avg_latency': statistics.mean(latencies),
            'avg_results': statistics.mean(result_counts),
            'total_queries': len(query_results),
            'success_rate': len([r for r in query_results if r['result_count'] > 0]) / len(query_results)
        }

# Example A/B test
def run_scoring_ab_test():
    ab_tester = ABTestFramework(search_client)

    config_a = {'scoring_profile': 'content_relevance'}
    config_b = {'scoring_profile': 'enhanced_relevance'}

    test_queries = [
        "machine learning",
        "data science",
        "artificial intelligence",
        "python programming"
    ]

    for query in test_queries:
        results = ab_tester.run_ab_test(query, config_a, config_b, test_queries)

        print(f"Query: {query}")
        print(f"Config A - Avg Latency: {results['config_a']['metrics']['avg_latency']:.2f}ms")
        print(f"Config B - Avg Latency: {results['config_b']['metrics']['avg_latency']:.2f}ms")
        print(f"Config A - Success Rate: {results['config_a']['metrics']['success_rate']:.2%}")
        print(f"Config B - Success Rate: {results['config_b']['metrics']['success_rate']:.2%}")
        print("---")

Validation and Assessment¶

Completion Checklist¶

[ ] Exercise 1: Successfully compared built-in analyzers
[ ] Exercise 2: Created and tested custom analyzer
[ ] Exercise 3: Implemented n-gram analyzer for autocomplete
[ ] Exercise 4: Built basic scoring profile with field weights and functions
[ ] Exercise 5: Implemented location-based scoring
[ ] Exercise 6: Conducted performance testing and optimization
[ ] Exercise 7: Set up A/B testing framework

Assessment Criteria¶

Technical Implementation (40%)
Correct analyzer and scoring profile syntax
Proper use of Azure AI Search APIs
Error handling and validation
Performance Optimization (30%)
Measured performance impact
Applied optimization strategies
Balanced functionality vs. performance
Testing and Validation (20%)
Comprehensive test coverage
Proper use of Analyze API
A/B testing implementation
Documentation and Analysis (10%)
Clear documentation of configurations
Analysis of results and trade-offs
Recommendations for production use

Next Steps¶

After completing these exercises:

Apply to Real Data: Implement analyzers and scoring for your actual use case
Monitor Production: Set up monitoring for performance and relevance
Iterate and Improve: Use A/B testing to continuously optimize
Advanced Topics: Explore semantic search and vector search capabilities

These practical exercises provide hands-on experience with the core concepts of text analysis and scoring in Azure AI Search, preparing you for real-world implementation scenarios.