Elevate Your Search Experience by integrating Coveo into AEM Cloud implementation - Part 3

Table of contents
- 1: Quick Recap: Why Sitemap Connector?
- 2: Designing the Sitemap & Metadata Strategy
- 3: Creating and Configuring Coveo Source using Coveo Sitemap Connector
- 4: Creating Fields and Mapping Sitemap Metadata
- 5: Indexing Pipeline Extensions(IPE): Enhancing Metadata & Cleanup
- 6: Content Refresh Mechanism
- 8: Conclusion & Next Steps

This is my third blog post in this series - Elevate Your Search Experience by Integrating Coveo into AEM Cloud. It continues from my second blog in this series: Part 2: Practical Example of 5-Steps Coveo Implementation
In my previous blog, I mentioned that I would cover details on how we used the Coveo Sitemap connector for this implementation. In this blog post, I'll walk you through the practical implementation steps we took to successfully index AEM content using the Coveo Sitemap connector as part of our AEM Cloud replatforming project.
1: Quick Recap: Why Sitemap Connector?
As mentioned in my previous blog, after understanding our end users and their expectations, we determined that all the necessary content for search resided in AEM. The next crucial decision was selecting the right connector to index this content.
We chose the Coveo Sitemap connector over a Web connector for several compelling reasons:
Web sources explore websites by recursively following hyperlinks, which presents several challenges:
Pages not reachable through hyperlinks aren't indexed
The indexing process can take significant time
Sitemap sources, on the other hand, offer substantial advantages:
Faster indexing by directly accessing pages listed in the sitemap
Support for incremental refresh of content (using the
lastmod
tag)Greater control over content selection
Ability to include custom metadata as part of Sitemap(if you can control sitemap generation)- for better search experiences
This choice perfectly aligned with our project's requirements for a personalized and secure search experience, as it provided us with a structured approach to content indexing while supporting the metadata needed for role-based content visibility.
2: Designing the Sitemap & Metadata Strategy
Before implementing the connector, we needed to establish a well-structured sitemap strategy that would support our requirements for role-based access and content organization.
The sitemap file we designed followed the standard sitemap protocol but was enhanced with Coveo-specific metadata to support our unique requirements. Here's a simplified example of a sample sitemap structure:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:coveo="https://www.coveo.com/schemas/metadata">
<url>
<loc>https://our-aem-site.com/content/page1</loc>
<lastmod>2023-10-15T14:30:45+00:00</lastmod>
<changefreq>weekly</changefreq>
<coveo:metadata>
<contenttype>article</contenttype>
<department>marketing</department>
<accesslevel>public</accesslevel>
<language>en-us</language>
</coveo:metadata>
</url>
<url>
<loc>https://our-aem-site.com/content/restricted/page2</loc>
<lastmod>2023-11-02T09:15:32+00:00</lastmod>
<changefreq>monthly</changefreq>
<coveo:metadata>
<contenttype>documentation</contenttype>
<department>engineering</department>
<accesslevel>internal</accesslevel>
<userole>admin,developer</userole>
<language>en-us</language>
</coveo:metadata>
</url>
</urlset>
The key points in our sitemap design were:
Extension with Coveo namespace: We added
xmlns:coveo="
https://www.coveo.com/schemas/metadata
"
to enable Coveo-specific metadata.Role-based metadata: We included user role information within the
<coveo:metadata>
element to support our security requirements.Content categorization: Additional metadata like
contenttype
,department
, andlanguage
was included to enhance search faceting and filtering.Change frequency indicators: The
lastmod
andchangefreq
elements helped optimize the content refresh mechanism.
When dealing with special characters or complex values in metadata, we used CDATA sections:
<coveo:metadata>
<description><![CDATA[This content includes special characters & symbols < > " ' that need to be preserved]]></description>
</coveo:metadata>
Implementation of Sitemap Structure in AEM :
The sitemap generation was integrated into our AEM implementation. We created a custom servlet that generated the sitemap dynamically, pulling content from AEM's repository and enriching it with the necessary Coveo metadata. This ensured that as content was updated and published in AEM, our sitemap would automatically reflect those changes, along with the appropriate metadata.
3: Creating and Configuring Coveo Source using Coveo Sitemap Connector
Once our sitemap structure was defined, the next step was to configure the Coveo Sitemap connector in the Coveo platform.
Key Considerations & Best Practices:
Sitemap Management: For optimal performance:
Break down large sitemap files into multiple smaller files
Ensure your sitemap includes
lastmod
dates in W3C DateTime format for proper content refreshThe Sitemap connector doesn't support
robots.txt
files, so exclusions must be configured directly
Web Scraping Optimization:
Configure exclusions for headers, footers, and navigation elements to improve search relevance.
Use the Web Scraper Helper for Coveo Chrome extension to test your configurations
Performance Considerations(As part of Advanced Settings):
Set appropriate crawling speed by adjusting the "Time the crawler waits between requests"
For JavaScript-heavy sites, enable "Execute JavaScript on pages" but be aware this significantly increases crawling time
Set an appropriate value for "Add time for the crawler to wait" when JavaScript execution is enabled
4: Creating Fields and Mapping Sitemap Metadata
Once our Sitemap connector was configured, the next crucial step was to create fields in Coveo and establish mappings between our sitemap metadata and these fields. This step ensures that all the rich metadata we embedded in our sitemap becomes searchable and filterable in our search experience.
Understanding Fields and Mappings
When indexing content using the Sitemap connector, Coveo automatically extracts metadata but doesn't automatically map all of it to searchable fields. To make our role-based metadata and other custom attributes available for search, we needed to create corresponding fields and mappings.
Creating Fields in Coveo
We created several custom fields to support our specific metadata requirements:
contenttype
: To categorize different types of content (String, Facet-enabled)department
: To group content by organizational units (String, Facet-enabled)accesslevel
: To control visibility of content (String, Facet-enabled)userole
: To support role-based access (String, Multi-value facet)language
: To enable language-specific filtering (String, Facet-enabled)
For each field, we configured appropriate options:
Facet: Enabled for fields we wanted to use for filtering
Multi-value facet: Enabled for
userole
since content could be accessible to multiple rolesFree-text search: Enabled for fields we wanted users to be able to search directly
Sort: Enabled for fields we needed for result ordering
Establishing Metadata Mappings
Field mapping ensures that your data is structured correctly within the associated Coveo source.
Select Coveo Source and follow the UI instructions in Coveo console and select each field we created
Verifying Field Mappings
After creating our fields and establishing mappings, we rebuilt the source and used Coveo's Content Browser to verify that the metadata was correctly indexed:
We searched for specific content items.
Examined their field values to ensure the mapping was working and fields were populated correctly based on the defined mapping rules.
Key Considerations
Use Multi-value facet for fields that might contain multiple values separated by delimiters (For Hierarchical Facets..)
Choose field names that are intuitive and be consistent with your naming convention.
Rebuild your source after adding new fields or mappings to see the changes take effect.
Use Coveo API for batch operations for more efficient processing, including :
5: Indexing Pipeline Extensions(IPE): Enhancing Metadata & Cleanup
Once our fields and mappings were in place, we wanted to further enhance and clean up our metadata during the indexing process. For this, we leveraged Coveo's Indexing Pipeline Extensions (IPEs) - Python scripts that execute at specific stages of the indexing process.
Understanding Indexing Pipeline Extensions
Indexing Pipeline Extensions are powerful tools that allowed us to transform metadata dynamically during indexing. They provided us with the ability to:
Clean up and standardize metadata
Enhance metadata with additional information
Filter out unwanted content based on specific criteria
Format values consistently across all content
Pre-Conversion vs. Post-Conversion Extensions
Coveo offers two types of extensions that run at different stages in the indexing pipeline:
Pre-Conversion Extensions: Run after crawling but before metadata processing
Ideal for rejecting content early in the process
Used when you need to modify the original document data
Post-Conversion Extensions: Run after processing and mapping
Ideal for refining already mapped fields
Used when you need access to the processed item body or fields
For our implementation, we created extensions of both types to address different requirements.
Sample Examples:
Cleaning Title Metadata (Post-Conversion)
Standardize titles across our content with this post-conversion extension:
def get_safe_meta_data(meta_data_name):
safe_meta = ''
meta_data_value = document.get_meta_data_value(meta_data_name)
if meta_data_value:
safe_meta = meta_data_value[-1]
return safe_meta
title = get_safe_meta_data('title')
if title:
# Remove redundant branding from titles
if " | Company Name" in title:
title = title.replace(" | Company Name", "")
log(f'Standardized title to: {title}')
document.add_meta_data({'title': title})
Filtering Content by Status (Pre-Conversion)
pre-conversion extension to reject content with specific status values:
def get_safe_meta_data(meta_data_name):
safe_meta = ''
meta_data_value = document.get_meta_data_value(meta_data_name)
if meta_data_value:
safe_meta = meta_data_value[-1]
return safe_meta
content_status = get_safe_meta_data('content_status').lower()
if content_status == 'archived' or content_status == 'expired':
log(f'REJECT: Content is {content_status}')
document.reject()
Creating Reusable Extensions with Parameters
To maximize reusability, implement parametric extensions that could be used across different sources:
target_meta = parameters['target_meta_data']
meta_data_value = get_safe_meta_data(target_meta)
if meta_data_value == parameters['target_meta_data_value']:
log(f'REJECT: {target_meta} - {meta_data_value}')
document.reject()
Then configure this extension with source-specific parameters:
{
"target_meta_data": "content_status",
"target_meta_data_value": "archived"
}
Once all required extensions are created, associate it with corresponding source in Coveo admin console and specify-
whether they should run at pre-conversion or post-conversion stage
error handling behavior (Skip extension or Reject item)
Based on scenario - we can set IPE to run on specific content(based on associated attributes)
Key Considerations for IPEs:
Extensions have a 5-second execution limit - keep them efficient
IPEs operate on one document at a time (not in batch)
Use logging strategically to help with troubleshooting
Apply conditions to run extensions only on relevant content
Use pre-conversion extensions for early filtering to improve performance
Test extensions thoroughly before applying to production
6: Content Refresh Mechanism
After implementing all the previous steps - configuring the Sitemap connector, creating fields, establishing mappings, and adding Indexing Pipeline Extensions - one crucial step remains: rebuilding the source and setting up ongoing content refresh mechanisms.
Why Refreshing Content is Important?
Any updates to content in AEM - whether additions, modifications, or deletions - need to be reflected in Coveo's index. This ensures that users always find the most current and relevant content in their search results.
Coveo offers three primary ways to update indexed content, each serving a distinct purpose:
Refresh
What it does: The most lightweight operation
How it works: Updates only items that have changed since the last refresh (based on
lastmod
timestamps in the sitemap)Advantages: Fastest operation, minimal resource impact
Limitations: Does not detect deleted items
Management: Can be scheduled on Source level or triggered manually
Best for: Frequent updates (per few hours or daily - depending on use case) to capture regular content changes
Rescan
What it does: Recrawls all URLs in the sitemap
How it works: Updates modified items and deletes those that no longer exist in the sitemap
Advantages: More thorough than refresh while still being efficient
Limitations: More resource-intensive than refresh
Management: Can be scheduled on Source level or triggered manually
Best for: Weekly updates to ensure complete data synchronization
Rebuild
What it does: The most comprehensive operation
How it works: Deletes all indexed items and performs a full re-indexing from scratch
Advantages: Most comprehensive update, ensures complete synchronization
Limitations: Most resource-intensive and time-consuming
Management: Can only be triggered manually
Best for: After configuration changes are applied on Source or for monthly maintenance
Key Considerations for Optimal Content Refreshes
Sitemap
lastmod
Format: Must use W3C DateTime format (YYYY-MM-DDThh:mm:ss
) for proper change detectionSchedule Spacing: Ensure sufficient time between operations to prevent overlaps
Content Update Frequency: Align your refresh schedules with how frequently your AEM content changes
Resource Impact: More frequent updates consume more resources; balance frequency with need
By implementing this tiered approach to content updates, we ensured our search experience remained fresh and relevant, reflecting the latest content published in our AEM environment.
8: Conclusion & Next Steps
Implementing the Coveo Sitemap connector for our AEM Cloud project proved to be the right decision for achieving our goals of a personalized and secure search experience. The structured approach to content indexing, combined with rich metadata and intelligent refresh mechanisms, has created a foundation for an exceptional search experience for our users.
Key takeaways from our implementation:
Metadata is fundamental: The rich metadata framework we established through the sitemap drives both relevance and security in our search experience
Structured content refresh: Our tiered approach to content updates ensures freshness while minimizing resource usage
Extensibility matters: The ability to transform and enhance content during indexing through IPEs has been invaluable
In the next part of this series, I will cover how we leveraged the Coveo Atomic UI framework to create a seamless and intuitive search interface that takes advantage of all the groundwork we've laid with our Sitemap connector implementation.
Stay tuned for more insights from our Coveo-AEM Cloud implementation journey!
Further reading and references
Subscribe to my newsletter
Read articles from Saurabh Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Saurabh Kumar
Saurabh Kumar
Solutions Architect I Digital Platforms