Integrate Coveo with AEM Cloud Guide

This is my third blog post in this series - Elevate Your Search Experience by Integrating Coveo into AEM Cloud. It continues from my second blog in this series: Part 2: Practical Example of 5-Steps Coveo Implementation

In my previous blog, I mentioned that I would cover details on how we used the Coveo Sitemap connector for this implementation. In this blog post, I'll walk you through the practical implementation steps we took to successfully index AEM content using the Coveo Sitemap connector as part of our AEM Cloud replatforming project.

1: Quick Recap: Why Sitemap Connector?

As mentioned in my previous blog, after understanding our end users and their expectations, we determined that all the necessary content for search resided in AEM. The next crucial decision was selecting the right connector to index this content.

We chose the Coveo Sitemap connector over a Web connector for several compelling reasons:

Web sources explore websites by recursively following hyperlinks, which presents several challenges:

Pages not reachable through hyperlinks aren't indexed
The indexing process can take significant time

Sitemap sources, on the other hand, offer substantial advantages:

Faster indexing by directly accessing pages listed in the sitemap
Support for incremental refresh of content (using the lastmod tag)
Greater control over content selection
Ability to include custom metadata as part of Sitemap(if you can control sitemap generation)- for better search experiences

This choice perfectly aligned with our project's requirements for a personalized and secure search experience, as it provided us with a structured approach to content indexing while supporting the metadata needed for role-based content visibility.

2: Designing the Sitemap & Metadata Strategy

Before implementing the connector, we needed to establish a well-structured sitemap strategy that would support our requirements for role-based access and content organization.

The sitemap file we designed followed the standard sitemap protocol but was enhanced with Coveo-specific metadata to support our unique requirements. Here's a simplified example of a sample sitemap structure:


<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" 
        xmlns:coveo="https://www.coveo.com/schemas/metadata">
  <url>
    <loc>https://our-aem-site.com/content/page1</loc>
    <lastmod>2023-10-15T14:30:45+00:00</lastmod>
    <changefreq>weekly</changefreq>
    <coveo:metadata>
      <contenttype>article</contenttype>
      <department>marketing</department>
      <accesslevel>public</accesslevel>
      <language>en-us</language>
    </coveo:metadata>
  </url>
  <url>
    <loc>https://our-aem-site.com/content/restricted/page2</loc>
    <lastmod>2023-11-02T09:15:32+00:00</lastmod>
    <changefreq>monthly</changefreq>
    <coveo:metadata>
      <contenttype>documentation</contenttype>
      <department>engineering</department>
      <accesslevel>internal</accesslevel>
      <userole>admin,developer</userole>
      <language>en-us</language>
    </coveo:metadata>
  </url>
</urlset>

The key points in our sitemap design were:

Extension with Coveo namespace: We added xmlns:coveo="https://www.coveo.com/schemas/metadata" to enable Coveo-specific metadata.
Role-based metadata: We included user role information within the <coveo:metadata> element to support our security requirements.
Content categorization: Additional metadata like contenttype, department, and language was included to enhance search faceting and filtering.
Change frequency indicators: The lastmod and changefreq elements helped optimize the content refresh mechanism.

When dealing with special characters or complex values in metadata, we used CDATA sections:

<coveo:metadata>
  <description><![CDATA[This content includes special characters & symbols < > " ' that need to be preserved]]></description>
</coveo:metadata>

Implementation of Sitemap Structure in AEM :

The sitemap generation was integrated into our AEM implementation. We created a custom servlet that generated the sitemap dynamically, pulling content from AEM's repository and enriching it with the necessary Coveo metadata. This ensured that as content was updated and published in AEM, our sitemap would automatically reflect those changes, along with the appropriate metadata.

3: Creating and Configuring Coveo Source using Coveo Sitemap Connector

Once our sitemap structure was defined, the next step was to configure the Coveo Sitemap connector in the Coveo platform.

Key Considerations & Best Practices:

Sitemap Management: For optimal performance:
- Break down large sitemap files into multiple smaller files
- Ensure your sitemap includes lastmod dates in W3C DateTime format for proper content refresh
- The Sitemap connector doesn't support robots.txt files, so exclusions must be configured directly
Web Scraping Optimization:
- Configure exclusions for headers, footers, and navigation elements to improve search relevance.
- Use the Web Scraper Helper for Coveo Chrome extension to test your configurations
Performance Considerations(As part of Advanced Settings):
- Set appropriate crawling speed by adjusting the "Time the crawler waits between requests"
- For JavaScript-heavy sites, enable "Execute JavaScript on pages" but be aware this significantly increases crawling time
- Set an appropriate value for "Add time for the crawler to wait" when JavaScript execution is enabled

4: Creating Fields and Mapping Sitemap Metadata

Once our Sitemap connector was configured, the next crucial step was to create fields in Coveo and establish mappings between our sitemap metadata and these fields. This step ensures that all the rich metadata we embedded in our sitemap becomes searchable and filterable in our search experience.

Understanding Fields and Mappings

When indexing content using the Sitemap connector, Coveo automatically extracts metadata but doesn't automatically map all of it to searchable fields. To make our role-based metadata and other custom attributes available for search, we needed to create corresponding fields and mappings.

Creating Fields in Coveo

We created several custom fields to support our specific metadata requirements:

contenttype: To categorize different types of content (String, Facet-enabled)
department: To group content by organizational units (String, Facet-enabled)
accesslevel: To control visibility of content (String, Facet-enabled)
userole: To support role-based access (String, Multi-value facet)
language: To enable language-specific filtering (String, Facet-enabled)

For each field, we configured appropriate options:

Facet: Enabled for fields we wanted to use for filtering
Multi-value facet: Enabled for userole since content could be accessible to multiple roles
Free-text search: Enabled for fields we wanted users to be able to search directly
Sort: Enabled for fields we needed for result ordering

Establishing Metadata Mappings

Field mapping ensures that your data is structured correctly within the associated Coveo source.

Select Coveo Source and follow the UI instructions in Coveo console and select each field we created

Verifying Field Mappings

After creating our fields and establishing mappings, we rebuilt the source and used Coveo's Content Browser to verify that the metadata was correctly indexed:

We searched for specific content items.
Examined their field values to ensure the mapping was working and fields were populated correctly based on the defined mapping rules.

Key Considerations

Use Multi-value facet for fields that might contain multiple values separated by delimiters (For Hierarchical Facets..)
Choose field names that are intuitive and be consistent with your naming convention.
Rebuild your source after adding new fields or mappings to see the changes take effect.
Use Coveo API for batch operations for more efficient processing, including :
- Create Field
- Create Field Mappings associated with Source

5: Indexing Pipeline Extensions(IPE): Enhancing Metadata & Cleanup

Once our fields and mappings were in place, we wanted to further enhance and clean up our metadata during the indexing process. For this, we leveraged Coveo's Indexing Pipeline Extensions (IPEs) - Python scripts that execute at specific stages of the indexing process.

Understanding Indexing Pipeline Extensions

Indexing Pipeline Extensions are powerful tools that allowed us to transform metadata dynamically during indexing. They provided us with the ability to:

Clean up and standardize metadata
Enhance metadata with additional information
Filter out unwanted content based on specific criteria
Format values consistently across all content

Pre-Conversion vs. Post-Conversion Extensions

Coveo offers two types of extensions that run at different stages in the indexing pipeline:

Pre-Conversion Extensions: Run after crawling but before metadata processing
- Ideal for rejecting content early in the process
- Used when you need to modify the original document data
Post-Conversion Extensions: Run after processing and mapping
- Ideal for refining already mapped fields
- Used when you need access to the processed item body or fields

For our implementation, we created extensions of both types to address different requirements.

Sample Examples:

Cleaning Title Metadata (Post-Conversion)

Standardize titles across our content with this post-conversion extension:

def get_safe_meta_data(meta_data_name):
    safe_meta = ''
    meta_data_value = document.get_meta_data_value(meta_data_name)
    if meta_data_value:
        safe_meta = meta_data_value[-1]
    return safe_meta

title = get_safe_meta_data('title')
if title:
    # Remove redundant branding from titles
    if " | Company Name" in title:
        title = title.replace(" | Company Name", "")

    log(f'Standardized title to: {title}')
    document.add_meta_data({'title': title})

Filtering Content by Status (Pre-Conversion)

pre-conversion extension to reject content with specific status values:

def get_safe_meta_data(meta_data_name):
    safe_meta = ''
    meta_data_value = document.get_meta_data_value(meta_data_name)
    if meta_data_value:
        safe_meta = meta_data_value[-1]
    return safe_meta

content_status = get_safe_meta_data('content_status').lower()
if content_status == 'archived' or content_status == 'expired':
    log(f'REJECT: Content is {content_status}')
    document.reject()

Creating Reusable Extensions with Parameters

To maximize reusability, implement parametric extensions that could be used across different sources:

target_meta = parameters['target_meta_data']
meta_data_value = get_safe_meta_data(target_meta)
if meta_data_value == parameters['target_meta_data_value']:
    log(f'REJECT: {target_meta} - {meta_data_value}')
    document.reject()

Then configure this extension with source-specific parameters:

{
  "target_meta_data": "content_status",
  "target_meta_data_value": "archived"
}

Once all required extensions are created, associate it with corresponding source in Coveo admin console and specify-

whether they should run at pre-conversion or post-conversion stage
error handling behavior (Skip extension or Reject item)
Based on scenario - we can set IPE to run on specific content(based on associated attributes)

Key Considerations for IPEs:

Extensions have a 5-second execution limit - keep them efficient
IPEs operate on one document at a time (not in batch)
Use logging strategically to help with troubleshooting
Apply conditions to run extensions only on relevant content
Use pre-conversion extensions for early filtering to improve performance
Test extensions thoroughly before applying to production

6: Content Refresh Mechanism

After implementing all the previous steps - configuring the Sitemap connector, creating fields, establishing mappings, and adding Indexing Pipeline Extensions - one crucial step remains: rebuilding the source and setting up ongoing content refresh mechanisms.

Why Refreshing Content is Important?

Any updates to content in AEM - whether additions, modifications, or deletions - need to be reflected in Coveo's index. This ensures that users always find the most current and relevant content in their search results.

Coveo offers three primary ways to update indexed content, each serving a distinct purpose:

Refresh
- What it does: The most lightweight operation
- How it works: Updates only items that have changed since the last refresh (based on lastmod timestamps in the sitemap)
- Advantages: Fastest operation, minimal resource impact
- Limitations: Does not detect deleted items
- Management: Can be scheduled on Source level or triggered manually
- Best for: Frequent updates (per few hours or daily - depending on use case) to capture regular content changes
Rescan
- What it does: Recrawls all URLs in the sitemap
- How it works: Updates modified items and deletes those that no longer exist in the sitemap
- Advantages: More thorough than refresh while still being efficient
- Limitations: More resource-intensive than refresh
- Management: Can be scheduled on Source level or triggered manually
- Best for: Weekly updates to ensure complete data synchronization
Rebuild
- What it does: The most comprehensive operation
- How it works: Deletes all indexed items and performs a full re-indexing from scratch
- Advantages: Most comprehensive update, ensures complete synchronization
- Limitations: Most resource-intensive and time-consuming
- Management: Can only be triggered manually
- Best for: After configuration changes are applied on Source or for monthly maintenance

Key Considerations for Optimal Content Refreshes

Sitemap lastmod Format: Must use W3C DateTime format (YYYY-MM-DDThh:mm:ss) for proper change detection
Schedule Spacing: Ensure sufficient time between operations to prevent overlaps
Content Update Frequency: Align your refresh schedules with how frequently your AEM content changes
Resource Impact: More frequent updates consume more resources; balance frequency with need

By implementing this tiered approach to content updates, we ensured our search experience remained fresh and relevant, reflecting the latest content published in our AEM environment.

8: Conclusion & Next Steps

Implementing the Coveo Sitemap connector for our AEM Cloud project proved to be the right decision for achieving our goals of a personalized and secure search experience. The structured approach to content indexing, combined with rich metadata and intelligent refresh mechanisms, has created a foundation for an exceptional search experience for our users.

Key takeaways from our implementation:

Metadata is fundamental: The rich metadata framework we established through the sitemap drives both relevance and security in our search experience
Structured content refresh: Our tiered approach to content updates ensures freshness while minimizing resource usage
Extensibility matters: The ability to transform and enhance content during indexing through IPEs has been invaluable

In the next part of this series, I will cover how we leveraged the Coveo Atomic UI framework to create a seamless and intuitive search interface that takes advantage of all the groundwork we've laid with our Sitemap connector implementation.

Stay tuned for more insights from our Coveo-AEM Cloud implementation journey!

Further reading and references

Elevate Your Search Experience by integrating Coveo into AEM Cloud implementation - Part 3

Table of contents