Best Practices - Module 5: Data Sources & Indexers¶

Data Source Configuration¶

Connection Management¶

Use Managed Identity: Prefer managed identity over connection strings for enhanced security
Secure Connection Strings: Store connection strings in Azure Key Vault when API keys are necessary
Test Connections: Always validate data source connectivity before creating indexers
Monitor Quotas: Be aware of service tier limits for data sources and indexers

Data Source Design¶

Single Responsibility: Create separate data sources for different types of data or environments
Descriptive Naming: Use clear, descriptive names that indicate the data source type and purpose
Environment Separation: Use different data sources for development, staging, and production

Indexer Configuration¶

Scheduling Strategy¶

Appropriate Frequency: Schedule indexers based on data change frequency, not arbitrarily
Off-Peak Hours: Run large indexing operations during low-traffic periods
Incremental Updates: Use change detection policies to minimize processing time
Batch Size Optimization: Configure batch sizes based on document size and complexity

Field Mapping Best Practices¶

Explicit Mappings: Define field mappings explicitly rather than relying on automatic mapping
Data Type Consistency: Ensure source and target field types are compatible
Null Handling: Plan for null values and missing fields in source data
Complex Type Mapping: Use output field mappings for complex data transformations

Performance Optimization¶

Index Schema Design: Design your index schema to minimize field mappings
Selective Field Extraction: Only extract and index fields that will be searched or filtered
Parallel Processing: Use multiple indexers for large datasets when possible
Resource Scaling: Consider scaling up your search service for large indexing operations

Change Detection¶

SQL Database¶

Integrated Change Tracking: Use SQL Integrated Change Tracking for optimal performance
High Water Mark: Use high water mark policy for append-only scenarios
Soft Delete: Implement soft delete patterns rather than hard deletes when possible

Blob Storage¶

LastModified Policy: Use LastModified change detection for file-based sources
Metadata Tracking: Leverage blob metadata for custom change detection logic
Container Organization: Organize blobs logically to optimize indexer performance

Cosmos DB¶

Change Feed: Utilize Cosmos DB change feed for real-time change detection
Partition Strategy: Align indexer queries with your Cosmos DB partition strategy
Query Optimization: Use efficient queries to minimize RU consumption

Error Handling and Monitoring¶

Robust Error Handling¶

Retry Policies: Configure appropriate retry policies for transient failures
Error Thresholds: Set reasonable error thresholds to prevent infinite retry loops
Graceful Degradation: Design indexers to continue processing despite individual document failures
Logging Strategy: Implement comprehensive logging for troubleshooting

Monitoring and Alerting¶

Status Monitoring: Regularly check indexer status and execution history
Performance Metrics: Monitor indexing duration and throughput
Error Alerting: Set up alerts for indexer failures or high error rates
Resource Utilization: Monitor search service resource usage during indexing

Security Best Practices¶

Authentication and Authorization¶

Principle of Least Privilege: Grant minimum necessary permissions to indexers
Regular Key Rotation: Rotate API keys regularly if using key-based authentication
Network Security: Use private endpoints and firewall rules to restrict access
Audit Logging: Enable audit logging for indexer operations

Data Protection¶

Sensitive Data Handling: Avoid indexing sensitive or personally identifiable information
Data Encryption: Ensure data is encrypted in transit and at rest
Access Controls: Implement proper access controls on both source and target systems
Compliance: Ensure indexer operations comply with relevant data protection regulations

Development and Testing¶

Development Workflow¶

Environment Isolation: Use separate search services for development and production
Version Control: Store indexer definitions in version control systems
Automated Testing: Implement automated tests for indexer configurations
Documentation: Document indexer configurations and dependencies

Testing Strategies¶

Unit Testing: Test individual components like field mappings and transformations
Integration Testing: Test end-to-end indexer workflows with sample data
Performance Testing: Test indexer performance with production-like data volumes
Failure Testing: Test error handling and recovery scenarios

Maintenance and Operations¶

Regular Maintenance¶

Index Rebuilding: Plan for periodic full index rebuilds when necessary
Schema Evolution: Design for schema changes and field additions
Cleanup Procedures: Implement procedures for cleaning up obsolete indexers and data sources
Backup Strategy: Maintain backups of indexer configurations

Operational Excellence¶

Documentation: Maintain up-to-date documentation for all indexers
Runbooks: Create operational runbooks for common maintenance tasks
Change Management: Implement proper change management processes
Disaster Recovery: Plan for disaster recovery scenarios

Common Anti-Patterns to Avoid¶

Configuration Anti-Patterns¶

❌ Over-Scheduling: Running indexers too frequently without considering data change patterns
❌ Monolithic Indexers: Creating single indexers that handle too many different data types
❌ Hardcoded Values: Embedding environment-specific values in indexer definitions
❌ Ignoring Errors: Not properly handling or monitoring indexer errors

Performance Anti-Patterns¶

❌ Full Rebuilds: Performing full index rebuilds when incremental updates would suffice
❌ Inefficient Queries: Using inefficient source queries that scan entire datasets
❌ Resource Contention: Running multiple resource-intensive indexers simultaneously
❌ Oversized Batches: Using batch sizes that are too large for available resources

Security Anti-Patterns¶

❌ Exposed Credentials: Storing connection strings or keys in code or configuration files
❌ Excessive Permissions: Granting broader permissions than necessary
❌ Unencrypted Connections: Using unencrypted connections to data sources
❌ Missing Monitoring: Not monitoring for security-related events or anomalies

Checklist for Production Deployment¶

Pre-Deployment¶

[ ] All connection strings and credentials are secured
[ ] Indexer schedules are appropriate for production workloads
[ ] Error handling and retry policies are configured
[ ] Monitoring and alerting are set up
[ ] Performance testing has been completed

Post-Deployment¶

[ ] Indexer execution is monitored and verified
[ ] Performance metrics are within expected ranges
[ ] Error rates are acceptable
[ ] Documentation is updated
[ ] Team is trained on operational procedures

Performance Tuning Guidelines¶

Indexer Performance¶

Batch Size: Start with default batch sizes and adjust based on performance
Parallel Execution: Use multiple indexers for large datasets when appropriate
Resource Allocation: Ensure adequate search service capacity during indexing
Network Optimization: Minimize network latency between services

Query Performance¶

Index Design: Design indexes to support your query patterns efficiently
Field Selection: Only make fields searchable, filterable, or sortable when necessary
Analyzer Selection: Choose appropriate analyzers for your content and language
Caching Strategy: Implement appropriate caching strategies for frequently accessed data

By following these best practices, you'll create robust, secure, and performant indexing solutions that scale with your needs and provide reliable data ingestion for your Azure AI Search implementation.