Let Computers Talk to Computers

Ben KadishBen Kadish
3 min read

Older computer systems required humans to adapt their behavior to computers. Instead of describing news articles you were interested in, to search a news index you might have to write Boolean keyword searches: ((“disco” OR “bell bottoms“) AND “Ronald Reagan”).

GenAI allows a much richer ‘do what I mean’ experience for users - “News articles about dance crazes in the 1980s”.

But what happens when your GenAI system has to dumb itself down to integrate with these older systems?

In a recent project we were identifying journalists who might be interested in writing about specific petitions. We were doing this by looking for articles that covered topics aligned with each petition.

The problem with this approach is that many news article search indexes were built in the era of transistor radios and neon sunglasses. If the right term doesn’t appear in the query, a perfectly relevant article wouldn’t turn up.

For example, if a petition focuses on cell phone usage at a local middle school, you might search for articles within 50 kilometers of San Francisco containing the keyword “school.” However, this method would fail in a couple ways:

  1. Irrelevant Results: An article about Cal Football could still contain the word “school,” making it show up in search results even though it’s unrelated to the petition.

  2. Missing Relevant Articles: Some articles that actually discuss cell phone policies might not explicitly include the word “school.”

A GenAI search index, obviously, would ‘understand’ the meaning of the search, and find much better results. But these articles are behind a brick wall built in the 1980s. How can we find good keywords to use for search?

One technique is to use a library like newspaper which helps you find topic keywords and use them in the boolean search. For many articles however, despite careful massaging, this didn’t yield relevant results.

So to address this we used an LLM to generate Boolean search queries from the petition text. For instance, for a petition about unbanning cell phones in school, we generated the following query:

("education policy" OR "technology in education" OR "student rights" OR "school administration" OR "parental communication")

The articles this approach found were almost all very relevant. However, there were often very few results. Rather than guess what system built when disco was a recent memory might prefer, I set up an LLM to experiment until it found a good approach.

I started with the labels generated by the above approach. Then, I used a prompt something like the one below to iterate toward more results. The prompt looked something like this:

REFINE_BOOLEAN_QUERY_PROMPT = """

We have a database of news articles and we're trying to find more articles. The current query isn't finding enough results.


Original Query: {original_query}

Results Found With Original Query: {num_results} articles

Number of relevant articles found from that query: {relevant_articles}

A thing to note here is that we want a higher percentage of relevant articles to be found.

Your task is to generate a broader, more inclusive search query that will find more articles while maintaining relevance.

Pick one of these strategies:

1. Break down compound terms and use OR

   Example: "climate change" -> ("climate change" OR (climate AND change))


2. Add common synonyms and related terms

   Example: "student" -> ("student" OR "students" OR "pupil" OR "learner")


3. Use broader categorical terms

   Example: "Tesla electric cars" -> ("Tesla" OR "electric vehicle" OR "EV" OR "automotive")


4. Extract key concepts and search them independently

   Example: "renewable energy policy" -> ("renewable" OR "sustainable") AND ("energy" OR "power")


IMPORTANT:

- Always use parentheses to group related terms

- Keep quotes around exact phrases

- Use OR between similar terms and AND between different concepts

- Make the query significantly broader than the original


Return only the new boolean query string with proper operators and formatting.

"""

This process was applied iteratively, with some limits around quality and number of iterations.

Result? We could generate an arbitrary number of good results without having to tease out our hair or put on rollerblades.

0
Subscribe to my newsletter

Read articles from Ben Kadish directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ben Kadish
Ben Kadish