Module 3: Index Management¶
Overview¶
This module teaches you the fundamentals of index management in Azure AI Search. You'll learn how to create, configure, and maintain search indexes, design effective schemas, and implement robust data ingestion strategies. By the end of this module, you'll be able to build and manage production-ready search indexes that scale with your application needs.
Hands-On Learning Available
This module includes comprehensive Code Samples with interactive Jupyter notebooks, complete Python scripts, and advanced examples. The code samples are designed to complement this documentation with practical, runnable examples you can use immediately.
⚠️ IMPORTANT: Run Prerequisites Setup First!
Before using any examples, run the Prerequisites Setup to configure your environment and create sample indexes.
Quick Start Options:
- 🔧 Prerequisites Setup: Run setup_prerequisites.py - REQUIRED FIRST STEP
- 📓 Interactive Learning: Jupyter Notebook with step-by-step examples
- 🐍 Python Examples: Complete Python Scripts with all index operations
- 🔷 C# Examples: .NET Implementation for enterprise applications
- 🟨 JavaScript Examples: Node.js/Browser Code for web integration
- 🌐 REST API Examples: Direct HTTP Calls for any language
Learning Objectives¶
By completing this module, you will be able to:
- Create and configure search indexes using the Azure AI Search Python SDK
- Design effective index schemas with appropriate field types and attributes
- Implement data ingestion strategies for different data sources
- Manage index lifecycle operations (create, update, delete, rebuild)
- Handle index versioning and schema evolution
- Optimize index performance and storage
- Troubleshoot common index management issues
- Apply best practices for production index management
Prerequisites¶
Before starting with index management, you need to complete the setup process and have a solid understanding of basic search operations.
📋 Complete Prerequisites Setup →
The prerequisites setup includes:
- ✅ Environment Configuration - Azure service and API keys
- ✅ Development Setup - Required packages and tools
- ✅ Sample Index Creation - Practice indexes for learning
- ✅ Functionality Testing - All index operations verified
⚠️ CRITICAL: You must complete the Prerequisites Setup before attempting any examples in this module!
Index Fundamentals¶
What is a Search Index?¶
A search index in Azure AI Search is a persistent collection of documents that enables fast, full-text search operations. Think of it as a specialized database optimized for search rather than transactional operations.
Key Index Components¶
Every search index consists of:
- Schema Definition: The structure defining fields, data types, and attributes
- Documents: The actual data stored in the index
- Analyzers: Text processing rules for searchable fields
- Scoring Profiles: Custom relevance scoring configurations
- CORS Options: Cross-origin resource sharing settings
- Encryption Keys: Customer-managed encryption (optional)
Index vs Database Table¶
| Aspect | Search Index | Database Table |
|---|---|---|
| Purpose | Optimized for search and retrieval | Optimized for transactions |
| Schema | Flexible, search-focused fields | Rigid, normalized structure |
| Queries | Full-text search, faceting, filtering | SQL queries, joins |
| Performance | Fast search, slower writes | Fast writes, variable read speed |
| Scaling | Horizontal scaling with partitions | Vertical/horizontal scaling |
Creating Your First Index¶
Basic Index Creation¶
Let's start with a simple index for a blog application:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
SearchIndex,
SimpleField,
SearchableField,
ComplexField,
SearchFieldDataType
)
from azure.core.credentials import AzureKeyCredential
# Initialize the index client
index_client = SearchIndexClient(
endpoint="https://your-service.search.windows.net",
credential=AzureKeyCredential("your-admin-api-key")
)
# Define the index schema
fields = [
SimpleField(name="id", type=SearchFieldDataType.String, key=True),
SearchableField(name="title", type=SearchFieldDataType.String),
SearchableField(name="content", type=SearchFieldDataType.String),
SimpleField(name="author", type=SearchFieldDataType.String, filterable=True),
SimpleField(name="publishedDate", type=SearchFieldDataType.DateTimeOffset,
filterable=True, sortable=True),
SimpleField(name="category", type=SearchFieldDataType.String,
filterable=True, facetable=True),
SimpleField(name="tags", type=SearchFieldDataType.Collection(SearchFieldDataType.String),
filterable=True, facetable=True)
]
# Create the index
index = SearchIndex(name="blog-posts", fields=fields)
result = index_client.create_index(index)
print(f"Index '{result.name}' created successfully!")
Complete Index Creation Examples
For comprehensive index creation examples with error handling, validation, and advanced configurations, see the code samples:
- Python: 01_create_basic_index.py
- C#: 01_CreateBasicIndex.cs
- JavaScript: 01_create_basic_index.js
- REST API: 01_create_basic_index.http
Advanced Topics Available
Beyond basic creation, explore advanced index management: - Schema Design: 02_schema_design.* - Complex fields and optimization - Index Operations: 04_index_operations.* - Lifecycle management - Performance: 05_performance_optimization.* - Batch sizing and parallel processing - Error Handling: 06_error_handling.* - Comprehensive troubleshooting
Understanding Field Types¶
Azure AI Search supports various field types for different use cases:
String Fields¶
# Simple string field
SimpleField(name="id", type=SearchFieldDataType.String, key=True)
# Searchable string field (full-text search enabled)
SearchableField(name="title", type=SearchFieldDataType.String, analyzer_name="en.microsoft")
# String collection (array of strings)
SimpleField(name="tags", type=SearchFieldDataType.Collection(SearchFieldDataType.String))
Numeric Fields¶
# Integer field
SimpleField(name="viewCount", type=SearchFieldDataType.Int32, filterable=True, sortable=True)
# Double field for ratings
SimpleField(name="rating", type=SearchFieldDataType.Double, filterable=True, sortable=True)
# Long field for large numbers
SimpleField(name="fileSize", type=SearchFieldDataType.Int64, filterable=True)
Date and Boolean Fields¶
# Date field
SimpleField(name="publishedDate", type=SearchFieldDataType.DateTimeOffset,
filterable=True, sortable=True)
# Boolean field
SimpleField(name="isPublished", type=SearchFieldDataType.Boolean, filterable=True)
Geographic Fields¶
# Geographic point for location-based search
SimpleField(name="location", type=SearchFieldDataType.GeographyPoint, filterable=True)
Field Attributes¶
Control field behavior with these attributes:
SearchableField(
name="content",
type=SearchFieldDataType.String,
searchable=True, # Enable full-text search
filterable=True, # Enable filtering
sortable=True, # Enable sorting
facetable=True, # Enable faceting
retrievable=True, # Include in search results
analyzer_name="en.microsoft" # Text analyzer
)
Attribute Combinations¶
| Use Case | Searchable | Filterable | Sortable | Facetable | Retrievable |
|---|---|---|---|---|---|
| Full-text search | ✅ | ❌ | ❌ | ❌ | ✅ |
| Exact match filter | ❌ | ✅ | ❌ | ❌ | ✅ |
| Sort results | ❌ | ❌ | ✅ | ❌ | ✅ |
| Faceted navigation | ❌ | ✅ | ❌ | ✅ | ✅ |
| Hidden metadata | ❌ | ✅ | ✅ | ❌ | ❌ |
Schema Design Best Practices¶
Designing for Performance¶
1. Choose Appropriate Field Types¶
# Good: Use specific types
SimpleField(name="price", type=SearchFieldDataType.Double)
SimpleField(name="quantity", type=SearchFieldDataType.Int32)
# Avoid: Using strings for numeric data
# SimpleField(name="price", type=SearchFieldDataType.String) # Don't do this
2. Minimize Searchable Fields¶
# Only make fields searchable if they need full-text search
SearchableField(name="title", type=SearchFieldDataType.String) # Good
SearchableField(name="content", type=SearchFieldDataType.String) # Good
SimpleField(name="category", type=SearchFieldDataType.String) # Good - exact match only
3. Use Collections Wisely¶
# Good: For multiple related values
SimpleField(name="tags", type=SearchFieldDataType.Collection(SearchFieldDataType.String))
# Consider: Complex fields for structured data
ComplexField(name="author", fields=[
SimpleField(name="name", type=SearchFieldDataType.String),
SimpleField(name="email", type=SearchFieldDataType.String)
])
Schema Evolution Strategies¶
1. Additive Changes (Safe)¶
# Adding new fields is safe and doesn't require rebuild
new_fields = existing_fields + [
SimpleField(name="newField", type=SearchFieldDataType.String, filterable=True)
]
# Update the index
updated_index = SearchIndex(name="existing-index", fields=new_fields)
index_client.create_or_update_index(updated_index)
2. Breaking Changes (Requires Rebuild)¶
# These changes require index rebuild:
# - Changing field type
# - Changing field attributes (searchable, filterable, etc.)
# - Removing fields
# - Changing analyzers
# Strategy: Create new index, migrate data, swap aliases
Data Ingestion Strategies¶
Single Document Upload¶
from azure.search.documents import SearchClient
# Initialize search client
search_client = SearchClient(
endpoint="https://your-service.search.windows.net",
index_name="blog-posts",
credential=AzureKeyCredential("your-api-key")
)
# Upload a single document
document = {
"id": "1",
"title": "Getting Started with Azure AI Search",
"content": "Azure AI Search is a powerful search service...",
"author": "John Doe",
"publishedDate": "2024-01-15T10:00:00Z",
"category": "Tutorial",
"tags": ["azure", "search", "tutorial"]
}
result = search_client.upload_documents([document])
print(f"Document uploaded: {result[0].succeeded}")
Batch Document Upload¶
# Upload multiple documents efficiently
documents = [
{
"id": "1",
"title": "Azure AI Search Basics",
"content": "Learn the fundamentals...",
"author": "John Doe",
"publishedDate": "2024-01-15T10:00:00Z",
"category": "Tutorial",
"tags": ["azure", "search"]
},
{
"id": "2",
"title": "Advanced Search Techniques",
"content": "Master complex queries...",
"author": "Jane Smith",
"publishedDate": "2024-01-20T14:30:00Z",
"category": "Advanced",
"tags": ["search", "advanced", "queries"]
}
# ... more documents
]
# Upload in batches (recommended: 100-1000 documents per batch)
batch_size = 100
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
result = search_client.upload_documents(batch)
successful = sum(1 for r in result if r.succeeded)
print(f"Batch {i//batch_size + 1}: {successful}/{len(batch)} documents uploaded")
Handling Large Datasets¶
import json
from typing import Iterator
def load_documents_from_file(file_path: str, batch_size: int = 100) -> Iterator[list]:
"""Load documents from JSON file in batches"""
with open(file_path, 'r') as file:
documents = json.load(file)
for i in range(0, len(documents), batch_size):
yield documents[i:i + batch_size]
def upload_large_dataset(file_path: str):
"""Upload large dataset with progress tracking and error handling"""
total_uploaded = 0
total_failed = 0
for batch_num, batch in enumerate(load_documents_from_file(file_path), 1):
try:
result = search_client.upload_documents(batch)
successful = sum(1 for r in result if r.succeeded)
failed = len(batch) - successful
total_uploaded += successful
total_failed += failed
print(f"Batch {batch_num}: {successful}/{len(batch)} uploaded")
# Log failed documents
for i, r in enumerate(result):
if not r.succeeded:
print(f"Failed to upload document {batch[i]['id']}: {r.error_message}")
except Exception as e:
print(f"Batch {batch_num} failed completely: {e}")
total_failed += len(batch)
print(f"Upload complete: {total_uploaded} successful, {total_failed} failed")
Complete Data Ingestion Examples
For production-ready data ingestion with comprehensive error handling, retry logic, and performance optimization, see the data ingestion examples:
- Python: 03_data_ingestion.py
- C#: 03_DataIngestion.cs
- JavaScript: 03_data_ingestion.js
- REST API: 03_data_ingestion.http
Complete Code Sample Coverage
This module includes 6 comprehensive examples for each programming language:
📋 All Languages Include: 1. Basic Index Creation - Fundamentals and field types 2. Schema Design - Advanced patterns and best practices 3. Data Ingestion - Efficient upload strategies 4. Index Operations - Lifecycle management and maintenance 5. Performance Optimization - Batch sizing and parallel operations 6. Error Handling - Comprehensive troubleshooting and recovery
🔗 Quick Access:
- 🐍 Python Examples - 6 complete files
- 🔷 C# Examples - 6 complete files
- 🟨 JavaScript Examples - 6 complete files
- 🌐 REST API Examples - 6 complete files
- 📓 Interactive Notebook - All concepts in one place
Index Operations and Maintenance¶
Listing Indexes¶
# List all indexes in your service
indexes = index_client.list_indexes()
for index in indexes:
print(f"Index: {index.name}")
print(f" Fields: {len(index.fields)}")
print(f" Storage: {index.storage_size_in_bytes} bytes")
print(f" Documents: {index.document_count}")
Getting Index Information¶
# Get detailed information about a specific index
index = index_client.get_index("blog-posts")
print(f"Index Name: {index.name}")
print(f"Fields: {len(index.fields)}")
print(f"Analyzers: {len(index.analyzers) if index.analyzers else 0}")
print(f"Scoring Profiles: {len(index.scoring_profiles) if index.scoring_profiles else 0}")
# Display field information
for field in index.fields:
attributes = []
if field.searchable: attributes.append("searchable")
if field.filterable: attributes.append("filterable")
if field.sortable: attributes.append("sortable")
if field.facetable: attributes.append("facetable")
print(f" {field.name} ({field.type}) - {', '.join(attributes)}")
Updating Index Schema¶
# Get existing index
existing_index = index_client.get_index("blog-posts")
# Add new fields
new_fields = list(existing_index.fields) + [
SimpleField(name="viewCount", type=SearchFieldDataType.Int32,
filterable=True, sortable=True),
SimpleField(name="lastModified", type=SearchFieldDataType.DateTimeOffset,
filterable=True, sortable=True)
]
# Update the index
updated_index = SearchIndex(
name=existing_index.name,
fields=new_fields,
analyzers=existing_index.analyzers,
scoring_profiles=existing_index.scoring_profiles
)
result = index_client.create_or_update_index(updated_index)
print(f"Index '{result.name}' updated successfully!")
Index Statistics and Monitoring¶
# Get index statistics
stats = index_client.get_service_statistics()
print(f"Storage used: {stats.storage_size_in_bytes} bytes")
print(f"Document count: {stats.document_count}")
# Get index-specific statistics
index_stats = search_client.get_document_count()
print(f"Documents in index: {index_stats}")
Deleting Indexes¶
# Delete an index (be careful!)
try:
index_client.delete_index("old-index")
print("Index deleted successfully")
except Exception as e:
print(f"Failed to delete index: {e}")
Advanced Index Configuration¶
Custom Analyzers¶
from azure.search.documents.indexes.models import (
LexicalAnalyzer,
PatternAnalyzer,
CustomAnalyzer,
PatternTokenizer,
LowerCaseTokenFilter
)
# Define custom analyzer
custom_analyzer = CustomAnalyzer(
name="custom_analyzer",
tokenizer_name="pattern",
token_filters=["lowercase", "asciifolding"],
char_filters=[]
)
# Create index with custom analyzer
fields = [
SimpleField(name="id", type=SearchFieldDataType.String, key=True),
SearchableField(name="title", type=SearchFieldDataType.String,
analyzer_name="custom_analyzer")
]
index = SearchIndex(
name="custom-analyzer-index",
fields=fields,
analyzers=[custom_analyzer]
)
result = index_client.create_index(index)
Scoring Profiles¶
from azure.search.documents.indexes.models import (
ScoringProfile,
TextWeights,
ScoringFunction,
ScoringFunctionType,
ScoringFunctionInterpolation
)
# Define scoring profile
scoring_profile = ScoringProfile(
name="boost_recent",
text_weights=TextWeights(weights={"title": 2.0, "content": 1.0}),
functions=[
ScoringFunction(
type=ScoringFunctionType.FRESHNESS,
field_name="publishedDate",
boost=2.0,
interpolation=ScoringFunctionInterpolation.LINEAR,
freshness={"boosting_duration": "P30D"} # Boost documents from last 30 days
)
]
)
# Create index with scoring profile
index = SearchIndex(
name="scored-index",
fields=fields,
scoring_profiles=[scoring_profile]
)
CORS Configuration¶
from azure.search.documents.indexes.models import CorsOptions
# Configure CORS for web applications
cors_options = CorsOptions(
allowed_origins=["https://mywebsite.com", "https://localhost:3000"],
max_age_in_seconds=300
)
index = SearchIndex(
name="web-index",
fields=fields,
cors_options=cors_options
)
Error Handling and Troubleshooting¶
Common Index Creation Errors¶
from azure.core.exceptions import HttpResponseError
import logging
def create_index_safely(index_definition):
"""Create index with comprehensive error handling"""
try:
result = index_client.create_index(index_definition)
print(f"Index '{result.name}' created successfully!")
return result
except HttpResponseError as e:
if e.status_code == 400:
logging.error(f"Bad request - check index definition: {e.message}")
elif e.status_code == 409:
logging.error(f"Index already exists: {index_definition.name}")
elif e.status_code == 403:
logging.error("Access denied - check your admin API key")
else:
logging.error(f"HTTP error {e.status_code}: {e.message}")
return None
except Exception as e:
logging.error(f"Unexpected error creating index: {str(e)}")
return None
Data Upload Error Handling¶
def upload_documents_safely(documents):
"""Upload documents with error handling and retry logic"""
max_retries = 3
retry_delay = 1
for attempt in range(max_retries):
try:
result = search_client.upload_documents(documents)
# Check for partial failures
successful = [r for r in result if r.succeeded]
failed = [r for r in result if not r.succeeded]
if failed:
print(f"Partial failure: {len(successful)}/{len(documents)} uploaded")
for failure in failed:
print(f"Failed: {failure.key} - {failure.error_message}")
else:
print(f"All {len(documents)} documents uploaded successfully")
return result
except HttpResponseError as e:
if e.status_code == 503 and attempt < max_retries - 1:
print(f"Service unavailable, retrying in {retry_delay} seconds...")
time.sleep(retry_delay)
retry_delay *= 2 # Exponential backoff
else:
print(f"Upload failed: {e.message}")
return None
except Exception as e:
print(f"Unexpected error: {str(e)}")
return None
return None
Performance Optimization¶
Index Design for Performance¶
1. Field Optimization¶
# Optimize field attributes for your use case
fields = [
SimpleField(name="id", type=SearchFieldDataType.String, key=True),
# Only searchable if full-text search is needed
SearchableField(name="title", type=SearchFieldDataType.String),
# Use SimpleField for exact-match scenarios
SimpleField(name="category", type=SearchFieldDataType.String,
filterable=True, facetable=True),
# Minimize sortable fields (they use more storage)
SimpleField(name="publishedDate", type=SearchFieldDataType.DateTimeOffset,
filterable=True, sortable=True),
# Don't make large text fields sortable
SearchableField(name="content", type=SearchFieldDataType.String,
retrievable=False) # Don't return in results if not needed
]
2. Batch Size Optimization¶
# Optimal batch sizes for different scenarios
def get_optimal_batch_size(document_size_kb):
"""Determine optimal batch size based on document size"""
if document_size_kb < 1:
return 1000 # Small documents
elif document_size_kb < 10:
return 500 # Medium documents
elif document_size_kb < 100:
return 100 # Large documents
else:
return 50 # Very large documents
3. Parallel Upload Strategy¶
import concurrent.futures
import threading
def parallel_upload(documents, max_workers=4):
"""Upload documents in parallel for better performance"""
batch_size = get_optimal_batch_size(estimate_document_size(documents[0]))
batches = [documents[i:i + batch_size] for i in range(0, len(documents), batch_size)]
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_batch = {
executor.submit(search_client.upload_documents, batch): batch
for batch in batches
}
for future in concurrent.futures.as_completed(future_to_batch):
batch = future_to_batch[future]
try:
result = future.result()
results.extend(result)
print(f"Batch of {len(batch)} documents uploaded")
except Exception as e:
print(f"Batch upload failed: {e}")
return results
Best Practices Summary¶
Schema Design¶
- ✅ Use specific field types (Int32, Double) instead of strings for numeric data
- ✅ Only make fields searchable if they need full-text search
- ✅ Minimize sortable fields to reduce storage overhead
- ✅ Use collections for multi-value fields
- ✅ Plan for schema evolution with additive changes
Data Ingestion¶
- ✅ Use batch uploads (100-1000 documents per batch)
- ✅ Implement retry logic with exponential backoff
- ✅ Handle partial failures gracefully
- ✅ Monitor upload progress and performance
- ✅ Use parallel uploads for large datasets
Performance¶
- ✅ Choose appropriate batch sizes based on document size
- ✅ Use parallel uploads with thread pools
- ✅ Monitor index statistics and storage usage
- ✅ Optimize field attributes for your use case
- ✅ Consider index partitioning for very large datasets
Maintenance¶
- ✅ Regular monitoring of index health and performance
- ✅ Plan for index rebuilds when making breaking changes
- ✅ Implement proper error handling and logging
- ✅ Use staging indexes for testing schema changes
- ✅ Document your index schema and design decisions
Next Steps¶
After completing this module, you should be comfortable with:
- Creating and configuring search indexes
- Designing effective schemas for your use case
- Implementing robust data ingestion strategies
- Managing index lifecycle operations
- Optimizing performance and handling errors
Recommended Learning Path:
- ✅ Complete the theory (this documentation)
- 🔬 Practice with the code samples
- 📝 Work through the exercises (coming soon)
- 🚀 Move to Module 4: Simple Queries and Filters
In the next module, you'll learn about Simple Queries and Filters, where you'll discover how to construct effective search queries and apply filters to refine your search results.
Code Samples and Hands-On Practice¶
Ready to put your knowledge into practice? This module includes comprehensive code samples across multiple programming languages.
👨💻 Complete Code Samples Guide →
What's included:
- ✅ Multi-Language Support - Python, C#, JavaScript, REST API (6 files each)
- ✅ Focused Examples - Each file covers one specific concept
- ✅ Interactive Learning - Jupyter notebooks for hands-on practice
- ✅ Production-Ready - Comprehensive error handling patterns
- ✅ Learning Paths - Beginner, quick reference, and cross-language options
Quick Start Options:
- 🐍 Python:
cd code-samples/python/ && python 01_create_basic_index.py - 🔷 C#:
dotnet run 01_CreateBasicIndex.cs - 🟨 JavaScript:
node 01_create_basic_index.js - 🌐 REST API: Open
01_create_basic_index.httpin VS Code with REST Client - 📓 Interactive:
jupyter notebook code-samples/notebooks/index_management.ipynb
📊 Complete Coverage Matrix:
| Topic | Python | C# | JavaScript | REST |
|---|---|---|---|---|
| Basic Index Creation | ✅ | ✅ | ✅ | ✅ |
| Schema Design | ✅ | ✅ | ✅ | ✅ |
| Data Ingestion | ✅ | ✅ | ✅ | ✅ |
| Index Operations | ✅ | ✅ | ✅ | ✅ |
| Performance Optimization | ✅ | ✅ | ✅ | ✅ |
| Error Handling | ✅ | ✅ | ✅ | ✅ |