Introduction

Navigating complex documentation can often feel like deciphering an intricate maze. But what if you could simply ask a question and get an instant, relevant answer? I embarked on a journey to make this a reality using TemporalIO's Java documentation as my playground. Through the amalgamation of StreamLit's interactive prowess, OpenAI's advanced embedding techniques, and Pinecone's robust vector store, I've utilized LangChain to bring it all together and craft an intuitive Q&A bot. This tool not only understands your queries but fetches the most relevant chunks of documentation in response. Dive in as we unravel the step-by-step creation of this innovative solution, and discover how you can implement a similar tool over any set of markdown documents.

This is a three part series where I will break down each step into its own blog post. If you are interested in just the final product, I've provided the link to the repo at the end of each post. As a general overview, the three main steps that will go into creating this StreamLit app are as follows:

Downloading and Formatting TemporalIO's Documentation
Chunking Documentation Using LangChain and OpenAI and creating Embeddings with OpenAI
Crafting the StreamLit App - The Q&A Bot

Enough talk, let's diveeeee in!

1. Downloading and Formatting TemporalIO's Documentation

In this section, we will dive into the Python code responsible for downloading and formatting TemporalIO's Java documentation. The code leverages several libraries, including BeautifulSoup, selenium, and html2text, to efficiently extract and transform the documentation into markdown files.

Step A: Setting Up the Required Libraries

Before diving into the code, you need to ensure you have the following Python libraries installed:

requests
BeautifulSoup
selenium
html2text

You can typically install these via pip with:

pip install requests beautifulsoup4 selenium html2text

Step B: Define the Target Domain and URL

domain = "docs.temporal.io/dev-guide/java"
full_url = "https://docs.temporal.io/dev-guide/java/"

Here, we specify the domain and the full_url which is the main landing page of the TemporalIO's Java documentation that we aim to scrape.

Step C: Initialize the Selenium Web Driver

options = webdriver.ChromeOptions()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get(full_url)

The above code initializes a headless Chrome browser using selenium. A headless browser is like a regular browser, but without a graphical user interface. This is useful for automated tasks and scripts, like web scraping.

Step D: Define the Web Crawling Function

The crawl() function is the heart of the operation. This function:

Initializes a queue to manage the URLs we want to crawl and a seen set to keep track of URLs we've already visited.
Creates directories (text/ and processed) for storing the markdown files and processed CSV files respectively.
Pops a URL from the queue and navigates to the page using selenium.
Waits for the desired content to load using WebDriverWait.
Parses the loaded page using BeautifulSoup to extract the content within a specific div that contains the documentation text.
Converts this HTML content to Markdown using html2text and saves it to a file.

Step E: Execute the Crawl

Finally, we call the crawl() function, initiating the process:

crawl(full_url)

Note: Ensure you have the ChromeDriver properly set up in your PATH for Selenium to work. You might also want to consider error-handling mechanisms or add recursive crawling to fetch documentation from multiple pages, if necessary.

Conclusion

We've covered how to crawl one of TemporalIO's Java documentation pages to create a markdown file. Next, we will be covering how to chunk and embed these documents using OpenAI, LangChain, and Pinecone. If you're interested in the full code, check out my GitHub Repo and give it a star!

Part 1/3: Creating a Markdown Q&A ChatBot with LangChain, OpenAI, and Pinecone: A Step-by-Step Guide