Preface

This lesson was difficult to understand because it involves probability concepts, which are quite abstract, and the formulas are numerous. I spent quite a bit of time organizing notes to clarify the core concepts, though I didn’t fully master some of the deeper points. So when I started the assignment, I didn’t have much confidence. That’s okay—just start, and solve problems as they arise.

Assignment Requirements

Implement a core idea similar to Google’s early search algorithm to measure webpage importance. The importance of a page depends not only on the number of people visiting it but also on the importance of pages linking to it.

Goal: Calculate the PageRank of a set of webpages, representing their relative importance.
Input: A collection of webpages, each possibly containing links to other pages.
Output: PageRank values for each page, representing the probability that a random surfer would land on that page.

Problem Analysis

Some pages may have no outgoing links. To ensure the surfer can always reach any page in the set, we introduce a damping factor d. With probability d (typically 0.85), the random surfer follows a link from the current page. With probability 1 - d, the surfer randomly jumps to any page in the corpus (including the current page).

Random Surfer Model:

The surfer clicks on links at random.

With probability d, they follow an outgoing link, distributed evenly across all links.

With probability 1 - d, they jump randomly to any page.

We need to compute PageRank using both sampling and iterative formula methods.

Model Transformation

The transition_model function should return a dictionary representing the probability distribution of the next page a random surfer will visit, given the current page set, current page, and damping factor.

The algorithm follows the Random Surfer Model described above.

👉 Sampling Method

The sample_pagerank function takes three parameters: the corpus of webpages corpus, the damping factor damping_factor, and the number of samples n.

Formula:
PageRank = P(page) = (number of times the page is visited in the samples) / (total number of samples)

Procedure:

Randomly select an initial page to visit.
Use transition_model repeatedly to determine the probability distribution of the next page.
Randomly choose the next page according to this probability distribution.
Increment the visit counter for the selected page.
Repeat steps 2–4 until the desired number of samples is reached.
Calculate the PageRank for each page based on the sample counts.

flowchart TD
    A[Randomly select initial page] --> B[Loop for n samples]
    B --> C[Call transition_model to generate probability distribution]
    C --> D[Select next page based on distribution and increment counter]
    D --> B
    B --> E[Compute PageRank = visit_count / n]
    E --> F[Return pageRank dictionary]

👉 Iterative Formula Method

The iterate_pagerank function takes the webpage corpus corpus and damping factor damping_factor, and computes PageRank iteratively until the change in PageRank for each page is less than 0.001.

Formula:

d = damping_factor
N = total number of pages
M(p) = set of pages linking to p
NumLinks(i) = number of outgoing links on page i

[!caution]

For pages with no outgoing links, assume they link to all pages (including themselves).

Algorithm Analysis:

PageRank PR(p) represents the probability that a random surfer eventually lands on page p. There are two ways to reach p:

Random jump to page p
- Probability 1 - d
- Equally distributed across all pages: 1−d /N
Jump via links from other pages
- Probability d
- For each page $i$ linking to p:
  - PageRank of i: PR(i)
  - Number of outgoing links: NumLinks(i)
  - Probability of moving from i to p: PR(i) / NumLinks(i)

Code Implementation

import os
import random
import re
import sys

DAMPING = 0.85
SAMPLES = 10000

def main():
    if len(sys.argv) != 2:
        sys.exit("Usage: python pagerank.py corpus")
    corpus = crawl(sys.argv[1])
    ranks = sample_pagerank(corpus, DAMPING, SAMPLES)
    print(f"PageRank Results from Sampling (n = {SAMPLES})")
    for page in sorted(ranks):
        print(f"  {page}: {ranks[page]:.4f}")
    ranks = iterate_pagerank(corpus, DAMPING)
    print(f"PageRank Results from Iteration")
    for page in sorted(ranks):
        print(f"  {page}: {ranks[page]:.4f}")

def crawl(directory):
    """
    Parse a directory of HTML pages and check for links to other pages.
    Return a dictionary where each key is a page, and values are
    a list of all other pages in the corpus that are linked to by the page.
    """
    pages = dict()

    # Extract all links from HTML files
    for filename in os.listdir(directory):
        if not filename.endswith(".html"):
            continue
        with open(os.path.join(directory, filename)) as f:
            contents = f.read()
            links = re.findall(r"<a\s+(?:[^>]*?)href=\"([^\"]*)\"", contents)
            pages[filename] = set(links) - {filename}

    # Only include links to other pages in the corpus
    for filename in pages:
        pages[filename] = set(
            link for link in pages[filename]
            if link in pages
        )

    return pages

def transition_model(corpus, page, damping_factor):
    """
    Return a probability distribution over which page to visit next,
    given a current page.

    With probability `damping_factor`, choose a link at random
    linked to by `page`. With probability `1 - damping_factor`, choose
    a link at random chosen from all pages in the corpus.
    """
    prob_dist = {}

    # Initialize probability
    for a in corpus:
        prob_dist[a] = (1 - damping_factor) / len(corpus)

    # Probability is evenly distributed among all outgoing pages.
    if corpus[page]:
        for p in corpus[page]:
            prob_dist[p] += damping_factor / len(corpus[page])

    # No outgoing links from the page, pretend it has links to all pages
    else:
        for p in corpus:
            prob_dist[p] += damping_factor / len(corpus)

    return prob_dist

def sample_pagerank(corpus, damping_factor, n):
    """
    Return PageRank values for each page by sampling `n` pages
    according to transition model, starting with a page at random.

    Return a dictionary where keys are page names, and values are
    their estimated PageRank value (a value between 0 and 1). All
    PageRank values should sum to 1.
    """
    # Pick a random page to start
    InitPage = random.choice(list(corpus.keys()))
    page = InitPage
    pageCount = {}
    pageRank = {}

    # Initialize each page's Rank to 0
    for p in corpus:
        pageRank[p] = 0

    for i in range(n):
        # Generate a sample from transition model
        sample = transition_model(corpus, page, damping_factor)
        # Randomly choose a page based on the sample's possibility distribution
        possiblePage = random.choices(list(sample.keys()), list(sample.values()))[0]

        # Accumulate the count of pages visited
        if possiblePage not in pageCount:
            pageCount[possiblePage] = 1
        else:
            pageCount[possiblePage] += 1
        # Move to next page
        page = possiblePage

    # Calaculate page rank according to the visited counter
    for p in pageCount:
        pageRank[p] = pageCount[p] / n

    return pageRank

def iterate_pagerank(corpus, damping_factor):
    """
    Return PageRank values for each page by iteratively updating
    PageRank values until convergence.

    Return a dictionary where keys are page names, and values are
    their estimated PageRank value (a value between 0 and 1). All
    PageRank values should sum to 1.
    """
    pageRank = {}
    delta = 0.001

    # Initialize each page rank evenly
    for p in corpus:
        pageRank[p] = 1 / len(corpus)

    # PR(p)=(1 − damping_factor)/n + damping_factor * q∈M(p)∑PR(q)/ L(q)
    """e.g. There are 3 pages: A, B, C
    {
     "A.html": {"B.html", "C.html"},
     "B.html": {"C.html"},
     "C.html": {"A.html"}
    }

    PR(A) = ((1 - damping_factor ) / 3) + damping_factor * ((PR(B) / L(B)) + (PR(C) / L(C)))

    L(q) is the number of outgoing links from page q.

    """
    while True:
        newRank = {}
        for p in corpus:
            newRank[p] = (1 - damping_factor) / len(corpus)
            for links in corpus:
                # if a page has no outgoing links, pretend it has links to all pages including itself
                if not corpus[links]:
                    newRank[p] += damping_factor * pageRank[links] / len(corpus)
                # if a page links to p, add rank as the formula
                else:
                    if p in corpus[links]:
                        newRank[p] += damping_factor * pageRank[links] / len(corpus[links])

        # If the delta less than threshold 0.001 for all pages, we can stop the loop and return the rank
        if all(abs(newRank[k] - pageRank[k]) < delta for k in newRank):
            return newRank

        pageRank = newRank.copy()

if __name__ == "__main__":
    main()

Reflection

Although the course content involves abstract probability theory and formulas, implementing the code itself is not too difficult. The key to the assignment is carefully reading the problem statement and then translating the formulas into code step by step. Understanding the model assumptions and calculation logic is harder than programming. If the requirements are not fully understood at the start, it’s easy to go off track.

CS50 AI with Python – Lecture 2: Uncertainty – Problem Set 0: PageRank