Removing Redundant User Highlights Efficiently

How did I come across this?

Recently I started working on a project titled Kindle-Clippings: A tool that makes organizing, accessing and using kindle highlights efficient and easy.
While developing the tool, I came across the way a kindle device stores user highlights.
- The good thing, it takes the highlight, it’s page number, loc(line of content) and timestamp information.
- The bad thing, it stores all the highlights, and edits to the highlights.
- This means if you highlight once, change the highlight, or edit it. All the different versions of highlights will be saved.

This would result in user highlights containing redundant, half-highlighted quotes and data. Bad user experience XC.

Problem Statement : Remove Redundant Highlights from user highlights file

Now, I had to figure out how to remove the highlights while saving backend resources as I was limited by compute power on the render instance I was using.
I came up with a couple of ideas
- Checking each highlight with other highlights
- Doing a plag-check and removing similar ones
- checking overlapping intervals and adding a time constraint

Solution 1 - Check each Highlight with other highlights

This approach is very naive.
The second I thought this, I knew it wasn’t going to work.
Given I have on average let’s say 3000 highlights, I would be processing 9000000 highlights in total (n*n).
This time complexity was a big no.
But we got 1 improvement in this stage, instead of comparing each highlight with each other, what we will be doing is compare each highlight with other highlights for the same book!
- This would drop the time to O(n * m²) where M was smaller

Solution 2 - Doing a Plagiarism Check and removing similar highlights

The roots of this idea go to a professor from our college who used to plag-check our assignments.
- This works for 180-200 students, but checking for each highlight with the other even in the same book is time consuming as the previous approach mentioned.
- To reduce time complexity even more, I set a location limit for the plag-check by
  - sorting the quotes by start location
  - plagiarism checking only within a fixed range of location around highlights own location. i.e. highlight.location ± 5 .
  - If the similarityScore was above a certain threshold value, the highlight would be considered duplicate and will be removed.
- The problem with this approach were
  - Deciding the similarityScore
    - too less and we would have false positive getting valuable knowledge lost
    - too more and we would have false negatives, adding redundant highlights, leading to a bad user experience.
    - There was still considerable time being spent on processing all the quotes against each other.

Solution 3 - Overlapping Sub-intervals

Then it hit me, this seemed like a leetcode problem.

I had location values, which was a string, it could either be a string number, or a string range.

I parsed the value into start and end

  function parseLocation(location) {
      if (!location) return { start: 0, end: 0 };
      if (typeof location === 'number') return { start: location, end: location };
      const parts = location.split('-').map(x => parseInt(x.trim(), 10));
      if (parts.length === 2 && !isNaN(parts[0]) && !isNaN(parts[1])) {
          return { start: parts[0], end: parts[1] };
      }
      if (parts.length === 1 && !isNaN(parts[0])) {
          return { start: parts[0], end: parts[0] };
      }
      return { start: 0, end: 0 };
  }

This gave me the start and end line number for every quote.

Now, most; if not all the times, what users do is mess their highlights up by a word or two missing or additional one being highlighted.
- When this happens the user highlights the content again till they finally get it right.
- This gave me the first big assumption, temporal value i.e. timestamp field was valuable.
  - If I could get all the highlights done by the user around a single intended highlight. I could decide the winner here by choosing the latest timestamp value.

Now figuring out which highlights are same and which are different.

This is where the problem of overlapping sub intervals came in.

      // sort highlights by locStart here
      highlights.sort((a, b) => a.locStart - b.locStart);

      let current = highlights[0];
      for (let i = 1; i < highlights.length; i++) {
          const next = highlights[i];
          if (current.locEnd >= next.locStart) {
              // Overlapping or touching intervals, keep the latest one out of the two
              let currentTime = new Date(current.timestamp).getTime();
              let nextTime = new Date(next.timestamp).getTime();
              if (nextTime > currentTime) {
                  // If next highlight is more recent, update current and forget about it ;)
                  current = next
              }
              // current.locEnd = Math.max(current.locEnd, next.locEnd);
          } else {
              // No overlap, push the current highlight and move to the next
              solution.push(current);
              current = next;
          }
      }
      return solution;

Here, we start by sorting the highlights for the current book based on their start locations. O(n*mlogm)
Now, the algorithm finds if at any point the current-highlight being compared is longer or equal to the next one.
- Checking for overlap here. If at any point next→start is less than current→end; it means next is starting within current.
- And just like that we found an overlapping interval.
  - This is where the leetcode algorithm differs from the requirement here.
  - In the leetcode algorithm you would go on to merge the interval
  - But here, we would instead update the current highlight with whichever one is the latest among the two being compared.
  - If the current and the next highlight differs and does not overlap, we save the current highlight in the result and move on to the next highlight.

Accepting the 3rd Solution

A major reason for accepting the 3rd solution is it’s time complexity.
- That small difference of m*logm instead of m² in O(n*m*logm) can lead to a world of difference when the data grows.
But there are more reasons for sticking with the third solution such as:
- Relying on a similarityScore can be dangerous as it would lead to false-positives and false-negatives.
- The overlapping intervals is easy to read and understand than plagiarism checker giving higher maintainability and supporting better for the current use case.

Real-time results

For the first check, the location based approach dropped a total of 1710 highlights (27% of total highlights)

It dropped the size of the final archive file from 1.8MB to 1.5MB, a 17% drop.

Thoughts

The reason I am writing this would be to remind you, and myself in the future that leetcode problems have meaning when you can remember them out of context of the leetcode.
If you are developing anything similar this might help you, if it helps I am happy ;)
If you have read this article so far, do try out the website https://kindle-clippings.pages.dev/.
Want to read more, unpublished blogs of mine? books | computer Science

Overlapping intervals to remove Redundant User Highlights