Mastering Website Content Indexing with​ Sitecore Search

Amit KumarAmit Kumar
11 min read

👋Introduction

Enhancing search functionality and user experience on your website require effective content indexing. This article will show you how to use Sitecore Search to effortlessly index content from documents and websites, giving you the tools you need to become an expert in content retrieval and discovery. 🔝

Sitecore Search is a headless content discovery platform powered by AI that helps you build predictive and custom search experiences across various content sources. To extract and index your material, the platform offers generic connections that you can configure.

Sitecore Search and Sitecore Discover are different products, but they have an overlapping feature set and are built on ReflektionAI. Sitecore Discover is best suited for e-commerce-based applications (product results with personalization and recommendations), and on the other hand, Sitecore Search is best suited for content-driven applications.

You can find more details about What is Sitecore Search, Sitecore Search Features, and Benefits at What is Sitecore Search?: A Definitive Introduction🔝

🔎Website Content Indexing

Sitecore Search is a powerful SaaS product that enables you to index content from any source, including documents, and allows you to search indexed content from any application build using any tech stack using the Sitecore Search Provided API endpoint.

When you are working with the Sitecore Content Management system (traditional or headless), you can use the default search provider Solr, for internal Sitecore CMS search and for end-user website search, or you can use Sitecore Search for end-user website search.

In the case of Sitecore XM Cloud based Headless Application Implementation, you have two options: either use the Sitecore Experience Edge GraphQL (GQL) endpoint for simple search use cases or utilize Sitecore Search or other third-party search providers for end-user website search. 🔝

The indexing of content is very important so that the user can easily search for the required content, which improves the website performance and user experience.

The Sitecore Search requires the source of data so that it can ingest data or content into the Sitecore Search System. When you are using Sitecore Search as a search provider, while searching, it looks for the content in the indexed data for searched keywords and returns the AI-based personalised (and recommendations) search results for the end-user.

The Sitecore Search provides flexibility to index different types of contents in different ways as per your requirements. With Sitecore Search, you can index content from HTML pages, API endpoints, documents, etc.

Please find below a diagram that explains the details of indexed content within the Sitecore Search system: 🔝

You can index your Sitecore XP, Sitecore XM, or Sitecore XM Cloud (XMC) website content into the Sitecore Search system in the following ways:

If you are using Sitecore XP or Sitecore XM based topologies to build your website, either ASP.NET MVC or Headless, then you can utilise Sitecore’s SaaS-based search provider, Sitecore Search, to index your website's data.

You can index data in the following ways: 🔝

  1. The Sitecore Search provides an API Push source in the form of the Sitecore Search Ingestion API, which can be used in custom pipeline code base (Sitecore Publish Pipeline > Publish End Event) in Sitecore XP or Sitecore XMCMS Role, which will send data to the Sitecore Search system for indexing.

  2. Sitecore Search provides the following pull sources, which can be used to crawl the data from your website or API endpoint:

    • API Crawler: If your content can only be accessed by an API endpoint, and the API returns JSON

    • Feed Crawler: Crawls feed files (CSV or JSON)

    • Web Crawler: If you have content in one locale, and all the content is accessible through a webpage

    • Advanced Web Crawler: If you need to index content in multiple languages or want to use JavaScript to extract attributes

It's recommended to use the Advance Web Crawler to crawl the data (or content) from your website.

You also need to define the scope of the source that is used by the crawler, and for this, you need to define the domains that the crawler is allowed to access, the URLs it must avoid, the deepest URL the crawler should go to, and more. 🔝

The XM Cloud SaaS solution’s Content Management role uses Solr based search instance provision by Sitecore SaaS solution for internal search. This Solr based search provider is not available for XM Cloud Website Search (front-end search). The XM Cloud SaaS does not provide the Content Delivery role, and instead of using the Sitecore Experience Edge for Content Delivery, there is no possibility to use CM Role Search indexes in the Content Delivery role.

The XM Cloud SaaS solution’s content delivery happens using the Sitecore Experience Edge, so you can build front-end applications in any tech stack and, preferably, using the Sitecore JSS Next.js SDK. You can't utilise the Solr search provider available for the CMS role as you used to do in traditional Sitecore, which updates web indexes on publish so that updated content is available for website search. 🔝

If you are using Sitecore XM Cloud to build your headless website, then you can utilise Sitecore’s SaaS based search provider, Sitecore Search, to index your website's data.

You can index data in the following ways:

  1. You can utilize the Sitecore XM Cloud Workflow notification to decide which content state has been changed and send specific data to Sitecore Search via the Webhook.

    The Sitecore XM Cloud provides, three types of webhooks:

  2. You can utilise Sitecore Search Advance Web Crawler to crawl the data (or content) from your website.

  3. You can also utilise the Sitecore PowerShell Extensions (SPE)task scheduler to run scripts at a specific time. This means you can create the SPE script, which will check the state of items and push those items to Sitecore Search using the Sitecore Search Ingestion API.

    💡
    Sitecore XM Cloud is a SaaS platform, so we need to avoid any custom code deployment to the Sitecore XM CMS instance.

    You can check out more details about Sitecore XM Cloud Search options at Sitecore XM Cloud Search Options🔝

Yes, you heard correctly that Sitecore Search can also index PDF files using the Sitecore Search Document Extractors, which only support the parsing of HTML or JSON content. For this, you should know about the HTML structure of your PDF files.

You can check more details at Sitecore Search Indexing PDFs | Sitecore Documentation🔝

Please find below some tips and best practices for effective indexing

🎯 You can utilize Single Source to crawl the content from different domains by using the Sitecore Search Web Crawler Settings > ALLOWED DOMAINS attribute

🎯 You can utilize Sitecore Search Trigger Settings to define the multiple triggers with different URLs

🎯 Use larger number for MAX URLS

🎯 You can utilize Single Source to define multiple trigger type with different URLs in the Trigger settings to get content from defined URL in the Trigger setting

🎯 Use different tagger to handle different set of attributes within the single document extractor

🎯 Sitecore Search isn't a replacement of Solr for Sitecore CMS, it's similar to other Search providers like Coveo or Algolia or SearchStax, and used to index the content for end-users, and not meant for Sitecore’s internal search (backend search) 🔝

🎯 The reusable codeSitecore Search Starter Kit base with Sitecore Search Widgets present at GitHub and same open source Sitecore Search Starter Kit code base hostedhere to validate the Sitecore Search widgets and functionalities

🎯 Good to use canonical URL as the ID to avoid the duplicate items

🎯 One document cannot have more than one extractor. However, if they are set up using the URLs to Match field and are aimed at separate documents, you can set up more than one extractor.

Otherwise, the last configured document extractor "wins" and is the one that will be run. 🔝

🤷‍♂️Sitecore Search FAQ

  • Is Experience Edge compatible with Sitecore Search?

    The Sitecore Experience Edge (XE) and Sitecore Search are different Sitecore SaaS products, and there is no relationship between these two products, so there is no point of compatibility between these two products.

    You can check more details about Sitecore Experience Edge at Quickstart guide - All about Sitecore Experience Edge ~ Amit's Blog and for Sitecore Search at What is Sitecore Search?: A Definitive Introduction🔝

  • Can I use Sitecore Search with Sitecore XM Cloud?

    Yes, you can, to implement the end-user based search functionality and not for Sitecore XM Cloud CMS internal search

  • Can I use Solr to implement the search in Sitecore XM Cloud?

    The XM Cloud SaaS solution’s Content Management role uses Solr-based search instance provision by Sitecore SaaS solution for internal search.

    You can also setup your own Solr instance and ingest data for indexing and can utilize to implement the Search functionality (front-end search) for end-user at Head application (Front-end).

  • Can I use the Sitecore XM Cloud provided GraphQL (GQL) endpoint to implement the front-end search or end-user search?

    Yes, you can but simple search not advance search functionalities with dynamic facets, search recommendation, boosting, etc.. 🔝

  • What type of capabilities do you need to validate while selecting the external search provider for Sitecore XM Cloud?

    While selecting the external search provider, you should look for the following capabilities in your third-party search provider: Required security compliance and certifications, Easy integration, Multi-lingual support, Fit into your organizational tech stack, AI/ML capability for content boosting, Ability to integrate content indexing from different content sources, Out-of-the-box front-end search components, Personalized search results, Analytics reporting, and last but not least, good customer support 🔝

💡Conclusion

In this blog, we discussed the available options for Sitecore XP, Sitecore XM, and Sitecore XM Cloud (XMC) to index the website content in detail.

Also, details about the available search options are present in the Sitecore XM Cloud (XMC)-based implementation.

By doing the content indexing with Sitecore Search, you can empower users to discover and retrieve relevant content effortlessly. Implementing robust indexing strategies for both website and document content ensures a seamless search experience that enhances user engagement and satisfaction. 🔝

In upcoming blog posts, I will try to explain the different types of content indexing options in detail.

🙏Credit/References

🏓Pingback

index Sitecore contentsitecore index pdf contenthow to create custom index in sitecore
sitecore content editor jobsindex sitecore_master_index was not foundindex sitecore_marketingdefinitions_master was not found
sitecore indexingget sitecore/indexsitecore search index 🔝
sitecore content searchsitecore indexes not showingsitecore indexing role
sitecore searchsitecore search documentationsitecore search api
sitecore search sdksitecore searchstaxsitecore search ai
sitecore search vs coveositecore search enginesitecore search logo
sitecore search by field valuesitecore search pricingsitecore search implementation
sitecore search analyticssitecore search api crawlersitecore search architecture
sitecore search autocompletesitecore azure searchsitecore azure search deprecated
sitecore azure search index configurationsitecore azure search compatibilitysitecore search boost 🔝
sitecore search by templatesitecore search by idsitecore search blacklist
sitecore build search querysitecore bucket searchsitecore solr search boosting
sitecore search item by field valuesitecore search item by idsitecore search index
sitecore search clisitecore search cecsitecore search component
sitecore search costsitecore content search apisitecore content search
sitecore content search linqsitecore custom search filtersitecore content search filter 🔝
sitecore content search facetssitecore search demositecore search document extractor
sitecore search discoversitecore multilist with search datasource querysitecore multilist with search datasource
sitecore search examplesitecore edge searchsitecore solr search example c#
sitecore icon search extensionsitecore solr facet search examplesitecore content editor search
elastic search sitecoresitecore content editor search not workingsitecore graphql search query example
sitecore elasticsearchsitecore search featuressitecore search facets 🔝
sitecore search functionalitysitecore fuzzy searchsitecore solr fuzzy search
sitecore multilist with search filtersitecore sxa search filterfind sitecore version
sitecore icon findersitecore search githubsitecore graphql search
sitecore graphql search examplessitecore jss graphql search querysitecore jss graphql search
sitecore geoipsitecore solr search highlightingsitecore content hub search api
sitecore content hub search componentsearch site vs search enginesitecore search ingestion api
sitecore search js sdksitecore search jsssitecore-jss
sitecore search loginsitecore lucenesitecore search meaning 🔝
1
Subscribe to my newsletter

Read articles from Amit Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Amit Kumar
Amit Kumar

My name is Amit Kumar. I work as a hands-on Solution Architect. My experience allows me to provide valuable insights and guidance to organizations looking to leverage cutting edge technologies for their digital solutions.As a Solution Architect, I have extensive experience in designing and implementing robust and scalable solutions using server-side and client-side technologies. My expertise lies in architecting complex systems, integrating various modules, and optimizing performance to deliver exceptional user experiences. Additionally, I stay up-to-date with the latest industry trends and best practices to ensure that my solutions are always cutting-edge and aligned with business objectives.