How A Niche Algorithm Helped Me Build My App
What was the problem?
Throughout my college courses, I learned a lot about various algorithms. However, I never got the opportunity to examine any business or practical use case for them. Thus, when I discovered an opportunity to use an algorithm for a speciality case in my very own app, I was excited.
A problem I was encountering involed taking the data from the Google Civic API to create Official
entities in my app, and saving information from other API's to the relevant Official
objects. For example, if I login as a user with a Georgia address and create an Official
object for Jon Ossoff, how do I get info about Jon Ossoff from the OpenSecrets API, which doesn't use an address to determine which elected official to retrieve info for? To answer that question, I needed a way to uniquely identify elected officials in my app. This has (unsurprisingly) been solved already.
The Open Civic Data Identifier
Hashnode, I present to you the OCD ID! To quote the Center for Tech & Civic Life:
"OCD ID's are semi-hierarchical, human and machine-readable unique identifiers for political geographies. OCD-IDs help eliminate problems caused by agencies, organizations, or systems using slightly different colloquial names to refer to the same geography (e.g. North Carolina’s 9th Congressional District, NC-9, NC09, etc.). OCD-IDs can be used for political geographies as large as countries or as small as precinct splits, and everything in between."
(If you want to learn more from the Center for Tech & Civic life, here is a link to a PDF that goes into more depth).
In the case of LegisLink, OCD ID's are given to every federally elected official. For example, Jon Ossof's is ocd-person/4e48da38-17ab-5580-bced-2ea00b9b2843. As it turns out, a number of API's and applications use OCD ID's for their core functionality, and LegisLink will be no different. Unfortunately, while many API's supply OCD ID's in one way or another, the Google Civic Information API does not. So the question is: how do I find the OCD ID's for the elected officials returned from the Google Civic API calls when the user logs in to the app?
The issue at hand
Now that I had a better understanding of what I could do to stitch together information about elected officials from different APIs I needed to figure out how to go from really only having their name, to getting their OCD ID. To start, you can check out my first blog post, My Exciting And Harrowing Journey With The Google Civic Information API to learn more about what exactly the Google Civic Information API gives me with each call. Long story short, it does not return anything relating to an OCD ID. That is where the algorithm would come in handy.
Yet Another Open Source Solution
Ultimately, I know the starting point is the part of my application that gets the info from the Google Civic Info API, and the end goal is acquiring the OCD ID. I needed some way to cross reference the personally-identifying info I was receiving with some assortment of OCD IDs. The best I could come up with in regards to which info to use from the API was the names of congresspeople. Okay, that's a start. But where would I get the OCD IDs? After some perusing the internet, I found this open-source repository.
The Open States People repository is a GitHub repo "[which] contains YAML files with official information on state legislators, governors, and some municipal leaders." In short, the Open States People repo houses a trove of YAML files corresponding to each individual member of Congress, past and present! Not only that, but the data is incredibly thorough and very valuable, and happens to contain exactly what I need: an OCD ID.
Additionally, it is based on a different GitHub repo that also houses similar data, called congress-legislators. That repo is also officially maintained by the United States government, neat! Now that I have an inkling of an idea for where I could get this data from, you might be wondering what the rest of the data provided by the repo actually looks like. Here is a sample file I grabbed from a random congressman:
id: ocd-person/8ea538b9-6af5-5683-8b4b-9061f89691c1
name: Emanuel Cleaver
given_name: Emanuel
family_name: Cleaver
gender: M
birth_date: 1944-10-26
image: https://theunitedstates.io/images/congress/450x550/C001061.jpg
party:
- start_date: '2005-01-04'
end_date: '2025-01-03'
name: Democratic
roles:
- start_date: '2005-01-04'
end_date: '2007-01-03'
type: lower
jurisdiction: ocd-jurisdiction/country:us/government
district: MO-5
- start_date: '2007-01-04'
end_date: '2009-01-03'
type: lower
jurisdiction: ocd-jurisdiction/country:us/government
district: MO-5
- start_date: '2009-01-06'
end_date: '2011-01-03'
type: lower
jurisdiction: ocd-jurisdiction/country:us/government
district: MO-5
- start_date: '2011-01-05'
end_date: '2013-01-03'
type: lower
jurisdiction: ocd-jurisdiction/country:us/government
district: MO-5
- start_date: '2013-01-03'
end_date: '2015-01-03'
type: lower
jurisdiction: ocd-jurisdiction/country:us/government
district: MO-5
- start_date: '2015-01-06'
end_date: '2017-01-03'
type: lower
jurisdiction: ocd-jurisdiction/country:us/government
district: MO-5
- start_date: '2017-01-03'
end_date: '2019-01-03'
type: lower
jurisdiction: ocd-jurisdiction/country:us/government
district: MO-5
- start_date: '2019-01-03'
end_date: '2021-01-03'
type: lower
jurisdiction: ocd-jurisdiction/country:us/government
district: MO-5
- start_date: '2021-01-03'
end_date: '2023-01-03'
type: lower
jurisdiction: ocd-jurisdiction/country:us/government
district: MO-5
- start_date: '2023-01-03'
end_date: '2025-01-03'
type: lower
jurisdiction: ocd-jurisdiction/country:us/government
district: MO-5
offices:
- classification: capitol
address: 2217 Rayburn House Office Building Washington DC 20515-2505
voice: 202-225-4535
- classification: district
address: 1923 Main St; Higginsville, MO 64037
voice: 660-584-7373
fax: 660-584-7227
name: 'District Office #1'
- classification: district
address: 411 W Maple Ave F; Independence, MO 64050-2840
voice: 816-833-4545
fax: 816-833-2991
name: 'District Office #2'
- classification: district
address: 101 W. 31 St.; Kansas City, MO 64108
voice: 816-842-4545
fax: 816-471-5215
name: 'District Office #3'
links:
- url: https://cleaver.house.gov
note: website
ids:
twitter: RepCleaver
youtube: UCVC-7CLFjXczoihr5FxDLsw
facebook: emanuelcleaverii
other_identifiers:
- scheme: bioguide
identifier: C001061
- scheme: thomas
identifier: 01790
- scheme: govtrack
identifier: '400639'
- scheme: opensecrets
identifier: N00026790
- scheme: votesmart
identifier: '39507'
- scheme: fec
identifier: H4MO05234
- scheme: cspan
identifier: '10933'
- scheme: wikipedia
identifier: Emanuel Cleaver
- scheme: house_history
identifier: '11786'
- scheme: maplight
identifier: '645'
- scheme: icpsr
identifier: '20517'
- scheme: wikidata
identifier: Q1334654
- scheme: google_entity_id
identifier: kg:/m/04cbbm
- scheme: ballotpedia
identifier: Emanuel Cleaver
sources:
- url: https://theunitedstates.io/
Everything from contact information and party membership, to their office locations and a host of other identifiers from third parties, like C-SPAN and VoteSmart. I would eventually come to use this trove of data for other parts of the app as well. Of course, for this particular problem the most important bit of information I need is the first line in the YAML file, the id. Not only that, but their names are also in the files. Maybe I could use that to match the names from the Google Civic Info API to the correct YAML file? This was perfect, and the next few steps became pretty clear.
Entering the maze
Now that I had my starting point as well as another piece of the puzzle. Those being the full name of a congressperson provided by the Google Civic Information API, as well as a file that I could parse and grab the OCD ID for that very congressperson. Fantastic! The next step is figuring out how to get this data into the project. Admittedly, I chose a low-tech solution for this problem: I downloaded the folder containing the YAML files for the current congress from the repo and loaded it into the Xcode project locally. In the future, this data should probably be stored outside of the app (perhaps in the cloud) and the data can be downloaded/cached when needed. Additionally, there will need to be some mechanism for automatically updating my copy of the data whenever the parent repository is updated. For now, this will do. The folder name is "US Congress Directory" and will be the resting place for data on all 535 members of the United States Congress, see below.
Okay, great! Now I have the data. I can now clearly identify the following baby steps to ultimately solve this problem:
Access the data in the YAML files for a particular congressperson. Essentially using a string representing their name in the app to then match with the correct file.
Once I can identify which file to use, I need to parse the YAML into something Swift can understand.
After Swifty-ing the YAML file, how do I then attach the OCD ID (and other data) to the correct congressperson and their Swift-y representation in my app?
I must admit, at this point I felt like a real software developer. Here I was, staring at a complex problem that needed some sort of algorithmic solution.
Feeling warm and fuzzy Inside
I quickly realized this problem would be much more difficult than simply matching strings when I realized the names in the YAML files could be ever-so-slightly different than the names given to me by the Google Civic Info API. For example, Senator Raphael Warnock is referred as "Raphael Warnock" on Google's side, while in the YAML file from OpenStates he is referred to as "Raphael G. Warnock". This means there could be any number of discrepancies in the names could throw off my efforts, as Senator Warnock is far from the only person who has such a difference. How can I overcome this problem? I'm trying to use inconsistent, non-unique values to find unique items.
In going back to the drawing board, I realized I didn't need the name from Google's API and the name in the YAML file to be exactly the same, I just needed to be able to determine that they are close enough. So I started doing some research, and I found a particular technique that could do just the trick: Fuzzy Search!
Of course, you might be wondering what that is. Fuzzy Search "is a technique that uses search algorithms to find strings that match patterns approximately." Once I realized I could use a Fuzzy Search algorithm, I set out on finding a Fuzzy Search Swift package, which I found! Additionally, I found another Swift package called Yams that will help me work with the YAML files.
With an algorithm in mind and packages I could use in my app to implement it, I set out to do just that.
Fuzzy yams and Swifty shenanigans
Time for some code! Woo!
First, we need to load the YAML files into iterable Swift objects. That way, we can loop through the list and search for the correct information using the response from the Google Civic Info API. Here is some of the code from the function that loads the data from the YAML files:
for item in items {
let filePath = folderPath! + item
let contents = try String(contentsOfFile: filePath)
let decoder = YAMLDecoder()
let decoded = try decoder.decode(CandidateYAML.self, from: contents)
membersOfCongress.append(decoded)
...
We loop through the files in the folder (A.K.A each item
in items
), and for each file we pass its content to the YAMLDecoder()
object provided by Yams. The content is then transformed into a custom CandidateYAML
object, (referred to locally in the for-loop as decoded
) which is just the Swift-y representation of a YAML file. This object is then added to an array called membersOfCongress
. The membersOfCongress
array is the Swift-y representation of the entire list of congresspeople. This allows me to easily access their properties.
Next, we're going to talk about a very specific function, which I titled bundleOpenStatesData
. This function is responsible for binding the data from the correct YAML file for a user's congressperson to the actual in-app object that represents a user's official. It's first action is calling the function discussed above that creates an iterable list representing the YAML files. Next, it passes this list to a function I created, appropriately titled fuzzySearch
. We'll go over that next.
First, I save the user's three congresspeople to an array called currentLawmakers
(dramatic, I know). The rest of the code is presented below:
var ocdIDArray = [String]()
var best_score = 0.0
var best_match_id = ""
let removeCharacters: Set<Character> = [".", ","]
let fuse = Fuse()
for var lawmaker in currentLawmakers {
best_score = 150.0
lawmaker.name.removeAll(where: { removeCharacters.contains($0)})
let lawmakerPattern = fuse.createPattern(from: lawmaker.name)
candidates.forEach {
let result = fuse.search(lawmakerPattern, in: $0.name)
if (result?.score != nil){
if (result!.score < best_score){
best_score = result!.score
best_match_id = $0.id
}
}
}
ocdIDArray.append(best_match_id)
}
return ocdIDArray
The ocdIDArray is going to be the list of OCD IDs that this function will return, which we will then save to their appropriate congressperson. The best_score
and best_match_id
variables are declared arbitrarily, and removeCharacters
is a set of characters I am opting to remove from each lawmaker's name before doing anything for the sake of simplicity. Then I declare a Fuse
object, which allows me to access the Fuzzy Search package's features.
Next, I start a loop that goes through the array containing a user's elected officials. For each official in that list, we do the following:
The
best_score
variable is a way of assessing how similar two strings are. The closer to 0, the more similar they are. For each lawmaker, I am making it obscenely high to start.I remove the unwanted characters from the current lawmaker's name.
Next, I pass the lawmaker's name to the
createPattern
function from Fuse, which helps improve performanceNow, we start looping through the master list of congresspeople we passed in earlier.
This line compares the lawmaker's name to that of the current entry in the YAML list, and saves the outcome to the
result
variable.This score is then saved and then compared to
best_score
. If it is lower (i.e. closer to 0) then we save the OCD ID from that YAML data to thebest_match_id
variable. If the score is not lower, we keep going.The OCD ID of the closest-matching lawmaker from the list of YAML data is then saved to the
ocdIDArray
variable.
After this point, the list of OCD IDs is returned to the function that called fuzzySearch()
and the OCD IDs are saved to the user's congresspeople. Then, the list of YAML data is parsed again using the OCD IDs and other data is saved to the user's congresspeople. Here is some of the code:
for candidate in candidates {
if senatorOne.ocdID == candidate.id {
for other_id in candidate.otherIdentifiers {
if other_id.scheme == "bioguide" {
senatorOne.bioguideID = other_id.identifier
} else if other_id.scheme == "govtrack" {
senatorOne.govtrackID = other_id.identifier
} else if other_id.scheme == "opensecrets" {
senatorOne.opensecretsID = other_id.identifier
} else if other_id.scheme == "votesmart" {
senatorOne.votesmartID = other_id.identifier
} else if other_id.scheme == "fec" {
senatorOne.fecID = other_id.identifier
} else if other_id.scheme == "wikidata" {
senatorOne.wikidataID = other_id.identifier
} else {
continue
}
}
}
}
After this code is ran, we have finally accomplished our goal: OCD IDs and various other identifiers are now saved to the object's representing the user's congresspeople.
Final thoughts
First, thank you so much for reading my most in-depth and technical piece on this blog thus-far. It was a blast writing it and I hope you took something away from reading it!
This has been the first time I have really reviewed this code in months since writing it, and I can see a few improvements I could make. Namely, fixing variable names to be consistently camelCase. Next, I could also refactor the for-loop code for grabbing the other IDs to be its own function, as I have 3 in total for all 3 politicians.
Honestly, I am really glad I was able to re-examine some complex code after months of not having to read it. I had to review the Yams and Fuse documentation, but I am glad I am taking time to re-read the code and documentation, because I consider it a positive that my code is still decently readable many months after initially writing it.
Until next time friends!
Subscribe to my newsletter
Read articles from Mason Cochran directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Mason Cochran
Mason Cochran
Hi! I am a software developer hailing from Atlanta, Georgia! While I presently work in the insurance industry, I am keenly interested in how we can leverage technology to improve civic knowledge and engagement!