Preface

After watching the first lecture on Search and organizing my own notes, I now have a solid understanding of the course content. It’s time to try a problem set to reinforce what I’ve learned.

Problem Set 0: Degrees

Requirements Analysis

The problem provides some data for analysis, including actor names, movie titles, and which movies each actor has appeared in.

The goal is to find the shortest connection chain between two actors through shared movie collaborations.

For example, if Actor A and Actor C have never worked together, but A and B collaborated in Movie 1, and B and C collaborated in Movie 2, then the degrees of separation between Actor A and Actor C is 2:

Actor A → Movie 1 → Actor B → Movie 2 → Actor C

We can define this as a search problem: our states are people, and our actions are movies, which take us from one actor to another (a single movie can indeed connect to multiple actors, but that works fine for this problem).

The initial state and goal state are defined by the two actors we are trying to connect. By using Breadth-First Search (BFS), we can find the shortest path from one actor to another. Yes, finding the shortest path requires BFS.

Data Source

The CSV data to be analyzed has been provided:

people.csv contains information about the actors.

id,name,birth
102,"Kevin Bacon",1958
129,"Tom Cruise",1962
144,"Cary Elwes",1962
158,"Tom Hanks",1956
1597,"Mandy Patinkin",1952
163,"Dustin Hoffman",1937
1697,"Chris Sarandon",1942
193,"Demi Moore",1962
197,"Jack Nicholson",1937
200,"Bill Paxton",1955
398,"Sally Field",1946
420,"Valeria Golino",1965
596520,"Gerald R. Molen",1935
641,"Gary Sinise",1955
705,"Robin Wright",1966
914612,"Emma Watson",1990

movies.csv contains information about the movies.

id,title,year
112384,"Apollo 13",1995
104257,"A Few Good Men",1992
109830,"Forrest Gump",1994
93779,"The Princess Bride",1987
95953,"Rain Man",1988

stars.csv records which actors appeared in which movies.

person_id,movie_id
102,104257
102,112384
129,104257
129,95953
144,93779
158,109830
158,112384
1597,93779
163,95953
1697,93779
193,104257
197,104257
200,112384
398,109830
420,95953
596520,95953
641,109830
641,112384
705,109830
705,93779

Data Structure Design

We need to read the above CSV data into Python and construct data structures that allow for fast lookups. To achieve this, we create three dictionaries:

# Maps names to a set of corresponding person_ids
names = { name.lower(): set_of_person_ids }

# Maps person_ids to a dictionary of: name, birth, movies (a set of movie_ids)
people = { person_id: {"name": name, "birth": birth, "movies": set_of_movie_ids} }

# Maps movie_ids to a dictionary of: title, year, stars (a set of person_ids)
movies = { movie_id: {"title": title, "year": year, "stars": set_of_person_ids} }

Using a person’s name, we can find the corresponding ID(s). Since multiple people can have the same name, this returns a set of IDs, but each individual has a unique person_id.
Using a person_id, we can find all the movies that person has appeared in.
Using a movie_id, we can find all the person_id of actors who starred in that movie.

Problem Abstraction

This problem can be viewed as a graph search:

Nodes: contain an actor’s person_id and a reference to their parent node.
Frontier: are movie_ids, representing the movie through which the node was reached; different actors are connected via different movies.
Goal: find the shortest path from the source actor to the target actor. To achieve this, we need:
- Use BFS to ensure the shortest path is found.
- Each node should store its parent information to allow backtracking and reconstructing the path.

Input & Output

Input example:

Actor Name：A, Actor Name：C

Output example:

Degrees of separation: 2 1: Actor A and Actor B starred in Movie 1 2: Actor B and Actor C starred in Movie 2

Code Implementation

The code is as follows. There isn’t much to explain as the comments are quite clear; the core functionality is implemented in the shortest_path function.

util.py

# Each node stores the person_id, a reference to its parent node, and the action 
# (i.e., the movie_id through which this node was reached).
class Node():
    def __init__(self, state, parent, action):
        self.state = state
        self.parent = parent
        self.action = action

class StackFrontier():
    def __init__(self):
        self.frontier = []

    def add(self, node):
        self.frontier.append(node)

    def contains_state(self, state):
        return any(node.state == state for node in self.frontier)

    def empty(self):
        return len(self.frontier) == 0

    def remove(self):
        if self.empty():
            raise Exception("empty frontier")
        else:
            node = self.frontier[-1]
            self.frontier = self.frontier[:-1]
            return node

# The BFS algorithm uses a queue, and nodes are removed 
# following the First-In-First-Out (FIFO) principle.
class QueueFrontier(StackFrontier):
    def remove(self):
        if self.empty():
            raise Exception("empty frontier")
        else:
            node = self.frontier[0]
            self.frontier = self.frontier[1:]
            return node

degrees.py

import csv
import sys

from util import Node, StackFrontier, QueueFrontier

# Maps names to a set of corresponding person_ids
names = {}

# Maps person_ids to a dictionary of: name, birth, movies (a set of movie_ids)
people = {}

# Maps movie_ids to a dictionary of: title, year, stars (a set of person_ids)
movies = {}

"""
Read the provided CSV files and extract the information into the dictionaries we created.
Each person_id maps to all the movies that person has appeared in.
Each movie_id maps to all actors who starred in that movie.
The names dictionary handles duplicate names: multiple people can have the same name, 
but each person_id is unique.
"""
def load_data(directory):
    """
    Load data from CSV files into memory.
    """
    # Load people
    with open(f"{directory}/people.csv", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            people[row["id"]] = {
                "name": row["name"],
                "birth": row["birth"],
                "movies": set()
            }
            if row["name"].lower() not in names:
                names[row["name"].lower()] = {row["id"]}
            else:
                names[row["name"].lower()].add(row["id"])

    # Load movies
    with open(f"{directory}/movies.csv", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            movies[row["id"]] = {
                "title": row["title"],
                "year": row["year"],
                "stars": set()
            }

    # Load stars
    with open(f"{directory}/stars.csv", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            try:
                people[row["person_id"]]["movies"].add(row["movie_id"])
                movies[row["movie_id"]]["stars"].add(row["person_id"])
            except KeyError:
                pass

def main():
    if len(sys.argv) > 2:
        sys.exit("Usage: python degrees.py [directory]")
    directory = sys.argv[1] if len(sys.argv) == 2 else "large"

    # Load data from files into memory
    print("Loading data...")
    load_data(directory)
    print("Data loaded.")

    source = person_id_for_name(input("Name: "))
    if source is None:
        sys.exit("Person not found.")
    target = person_id_for_name(input("Name: "))
    if target is None:
        sys.exit("Person not found.")

    path = shortest_path(source, target)

    if path is None:
        print("Not connected.")
    else:
        degrees = len(path)
        print(f"{degrees} degrees of separation.")
        path = [(None, source)] + path
        for i in range(degrees):
            person1 = people[path[i][1]]["name"]
            person2 = people[path[i + 1][1]]["name"]
            movie = movies[path[i + 1][0]]["title"]
            print(f"{i + 1}: {person1} and {person2} starred in {movie}")

def create_path(node, movie_id=None, person_id=None):
    """
    Already found the connections, then build the path from source to target.
    Return path
    """
    path = []
    if movie_id is not None and person_id is not None:
        path.append((movie_id, person_id))

    while node.parent is not None:
        path.append((node.action, node.state))
        node = node.parent
    path.reverse()
    return path

def shortest_path(source, target):
    """
    Returns the shortest list of (movie_id, person_id) pairs
    that connect the source to the target.

    If no possible path, returns None.
    """

    start = Node(state=source, parent=None, action=None)
    frontier = QueueFrontier()
    frontier.add(start)

    explored = set()

    while True:
        if frontier.empty():
            return None
        node = frontier.remove()

        if node.state == target:
            return create_path(node)

        explored.add(node.state)

        neighbors = neighbors_for_person(node.state)
        for movie_id, person_id in neighbors:
            if person_id == target:
                return create_path(node, movie_id, person_id)
            if person_id not in explored and not frontier.contains_state(person_id):
                child = Node(state=person_id, parent=node, action=movie_id)
                frontier.add(child)
    return None

def person_id_for_name(name):
    """
    Returns the IMDB id for a person's name,
    resolving ambiguities as needed.
    """
    person_ids = list(names.get(name.lower(), set()))
    if len(person_ids) == 0:
        return None
    elif len(person_ids) > 1:
        print(f"Which '{name}'?")
        for person_id in person_ids:
            person = people[person_id]
            name = person["name"]
            birth = person["birth"]
            print(f"ID: {person_id}, Name: {name}, Birth: {birth}")
        try:
            person_id = input("Intended Person ID: ")
            if person_id in person_ids:
                return person_id
        except ValueError:
            pass
        return None
    else:
        return person_ids[0]

# Find all neighboring nodes through a given person_id.
def neighbors_for_person(person_id):
    """
    Returns (movie_id, person_id) pairs for people
    who starred with a given person.
    """
    movie_ids = people[person_id]["movies"]
    neighbors = set()
    for movie_id in movie_ids:
        for person_id in movies[movie_id]["stars"]:
            neighbors.add((movie_id, person_id))
    return neighbors

if __name__ == "__main__":
    main()

Afterthoughts

This project is not particularly difficult and follows the same logic as the Maze example explained in the video lectures. The framework is already set up, and you only need to implement the core shortest_path function. Once you understand the Maze code, completing degrees becomes much easier. The specific code syntax is not the key here—the important part is grasping the search algorithm’s logic.

Next, I will take on the challenge of the next search algorithm: adversarial search.

CS50 AI with Python – Lecture 0: Search – Problem Set 0: Degrees