As the amount of text data we interact with on a daily basis continues to grow, the ability to quickly identify meaningful connections between pieces of information becomes increasingly valuable. This is where semantic similarity models can be incredibly useful, by capturing the underlying meaning of text, rather than just looking at surface-level similarities.

One powerful tool for building semantic similarity models is the Universal Sentence Encoder, provided by the TensorFlow Hub library. In this article, I'll walk through how you can leverage this pre-trained model to uncover interesting relationships in your own text data.

Getting Started with the Universal Sentence Encoder

The Universal Sentence Encoder is a machine learning model that has been trained on a large corpus of text to produce high-quality vector representations of sentences and phrases. These vector embeddings encode the semantic meaning of the input, allowing you to easily compare the relatedness of different pieces of text.

To get started, you'll first need to import the necessary libraries:

from absl import logging

import tensorflow as tf

import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

With the imports set up, you can then load the pre-trained Universal Sentence Encoder model from TensorFlow Hub:

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

This model can now be used to encode your text data into semantic embeddings, which you can then use to compute similarity scores. An embed function also initiated.

Next up is defining our functions necessary for the plotting. We plot in order to visualize the similarities in the semantics calculated by the model. Code snippet is shown below:

def plot_similarity(labels, features, rotation):
  corr = np.inner(features, features)
  sns.set(font_scale=1.2)
  g = sns.heatmap(
      corr,
      xticklabels=labels,
      yticklabels=labels,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
  g.set_xticklabels(labels, rotation=rotation)
  g.set_title("Semantic Textual Similarity")

def run_and_plot(messages_):
  message_embeddings_ = embed(messages_)
  plot_similarity(messages_, message_embeddings_, 90)

Computing Semantic Similarity

Let's say you have a dataset of customer messages, and you want to identify which messages are discussing similar topics. You can use the Universal Sentence Encoder to generate embeddings for each message, and then calculate the pairwise cosine similarity between those embeddings.

The resulting similarity matrix will contain values between 0 and 1, where 1 indicates that two messages are semantically identical, and 0 indicates they are completely unrelated.

You can then use this matrix to cluster messages, identify outliers, or visualize the semantic relationships between your data points.

Exploring and Evaluating the Model

One way to get a better understanding of how the Universal Sentence Encoder is capturing semantic meaning is to examine the similarity scores for a few sample messages. For example:

messages = [
    # Smartphones
    "I like my phone",
    "My phone is not good.",
    "Your cellphone looks great.",

    # Weather
    "Will it snow tomorrow?",
    "Recently a lot of hurricanes have hit the US",
    "Global warming is real",

    # Food and health
    "An apple a day, keeps the doctors away",
    "Eating strawberries is healthy",
    "Is paleo better than keto?",

    # Asking about age
    "How old are you?",
    "what is your age?",
]

run_and_plot(messages)

By reviewing these examples, you can start to get a sense of how well the model is performing and where it may be struggling. Additionally, you can manually review high and low similarity pairs to further evaluate the model's effectiveness for your specific use case.

Conclusion

The Universal Sentence Encoder is a powerful tool that can help unlock the semantic insights hidden within your text data. By leveraging this pre-trained model, you can quickly generate high-quality vector representations of your content and uncover meaningful relationships that would be difficult to spot through manual review alone.

Of course, as with any machine learning model, it's important to carefully evaluate its performance and understand its limitations. But with a bit of experimentation and exploration, you can harness the power of semantic similarity to drive valuable discoveries in your data.

P.S. Let's Build Something Cool Together!

Drowning in data? Pipelines giving you a headache? I've been there – and I actually enjoy fixing these things. I'm that data engineer who: - Makes ETL pipelines behave - Turns data warehouse chaos into zen - Gets ML models from laptop to production.

If you find this blog interesting, connect with me on Linkedin and make sure to leave a message!

Uncovering Semantic Relationships with the Universal Sentence Encoder