Research has shown that being quizzed and actively engaging with the material is the best way to learn and retain the content that was learnt. But that doesn’t mean that all quizzes are created equal. There is a “Goldilocks” zone of sorts for the difficulty of the questions, where the questions are neither so easy that you can answer them without thinking, nor so hard that you get demotivated to even try answering them.

That sweet spot of difficulty keeps learners on their toes, coming back for more questions, which will ultimately help their understanding of the material.

Generating questions of different difficulty levels

RAG for getting relevant subject matter

Using graphs to track subject matter coverage and concept hierarchies

Generating questions from a text corpus involves a rich interplay of natural language processing (NLP), large language models (LLMs), and graph-based methods. This process can be tailored to produce various types of questions—factual, inferential, multiple-choice, and open-ended—depending on the intended use case, such as educational assessment, conversational AI, or content enrichment.

The process begins with preprocessing the corpus. Text is segmented into sentences and paragraphs, cleaned to remove noise, and analyzed to identify key entities and concepts using techniques like Named Entity Recognition (NER), coreference resolution, and keyword extraction. These steps help isolate the most informative parts of the text.

Next, candidate sentences are selected based on their semantic richness. Sentences containing definitions, causal relationships, or important facts are prioritized using scoring methods such as TF-IDF or contextual embeddings from models like BERT. These sentences serve as the foundation for question generation.

For factual and multiple-choice questions, several approaches can be used. Rule-based systems apply syntactic parsing to extract subject-verb-object structures and transform them into questions using predefined templates. For example, from “The Eiffel Tower is in Paris,” a rule might generate “Where is the Eiffel Tower located?” Neural models like T5 and BART, fine-tuned for question generation, can produce more flexible and abstractive questions from input passages. Additionally, models trained on datasets like SQuAD can extract question-answer pairs directly from text. Multiple-choice questions require an additional step: generating distractors. These can be selected using semantic similarity measures (e.g., WordNet or embedding distances) or entity type matching to ensure plausible alternatives.

Open-ended question generation, however, demands a deeper semantic understanding. These questions aim to elicit reasoning, interpretation, or synthesis rather than recall. To generate them, one must first identify conceptual anchors—themes, causal links, or contrasting ideas—using topic modeling (e.g., LDA or BERTopic), semantic clustering, or graph-based representations. Concept graphs, where nodes represent ideas and edges represent relationships, help pinpoint areas suitable for deeper inquiry.

LLMs play a central role in crafting open-ended questions. By prompting models like GPT-4 or T5 with context-rich passages and directives such as “Generate a question that encourages critical thinking,” one can produce questions that invite analysis or reflection. Few-shot prompting, where examples of open-ended questions are provided, can further guide the model’s output. For instance, from a passage on climate change, the model might generate “How might local communities adapt to the long-term effects of rising sea levels?”—a question that encourages exploration of implications and strategies.

Graph-based methods can also be used to refine and diversify the question set. By constructing graphs of generated questions and analyzing their semantic similarity, one can cluster and prune redundant questions. Centrality measures help identify questions that touch on key ideas or bridge different concepts, ensuring broad and meaningful coverage.

Finally, generated questions are evaluated and ranked based on clarity, relevance, and depth. Readability metrics, semantic alignment with the source text, and classification based on Bloom’s taxonomy can be used to assess the quality of each question.

In practice, this integrated approach allows for scalable and context-aware question generation. For example, given a corpus of science articles, one could extract factual sentences, use T5 to generate “What” or “Why” questions, rank them by relevance, and optionally generate distractors for multiple-choice formats. For open-ended questions, concept graphs and LLMs can be used to produce prompts that encourage critical thinking and discussion.

Identifying the difficulty level of questions

The original idea was to generate pairs for each question in the set with another question from the set, and then compare if question A was tougher than question B. If it was, then the tougher question would be assigned a point. Ultimately, the questions could be ranked by the number of points they would have accumulated. However, even for a small data set of 100 questions, that amounts to a comparison of

The Bradley-Terry Model

The Bradley-Terry model is one way to rank questions based on pairwise comparisons of their difficulty. Suppose you have a set of items (e.g., questions) $Q_1, Q_2, Q_3,\dots,Q_n$The Bradley-Terry model assigns each item a difficulty score $\theta_i$ The probability that item $i$ is judged more difficult than item $j$ is:

$$P(i \text{ beats } j) = \frac{\theta_i}{\theta_i + \theta_j}$$

Where:

$\theta_i > 0$ is the latent difficulty of item $i$
The model assumes that comparisons are independent and based only on the relative scores.

The Thurstone Model

The Thurstone model, specifically the Thurstone Case V model, is another approach for ranking items based on pairwise comparisons, similar in spirit to the Bradley-Terry model but based on psychometric theory and Gaussian assumptions.

Thurstone's model assumes that each item (e.g., question) has a latent difficulty value drawn from a normal distribution. When comparing two items, the probability that one is judged more difficult than the other depends on the difference in their latent values.

Let:

$\delta_1$ be the latent difficulty of item $i$
$\delta_j$ be the latent difficulty of item $j$

Then the probability that item $i$ is judged more difficult than item $j$ is:

$$P(i>j) = \frac{\Phi(\delta_i-\delta_j)}{\sqrt{2} \sigma}$$

Where:

$\Phi$ is the cumulative distribution function (CDF) of the standard normal distribution
$\sigma$ is the standard deviation of the latent values (often assumed equal across items)

Feature	Thurstone Model	Bradley-Terry Model
Assumption	Normal distribution of latent traits	Logistic distribution of latent traits
Probability function	Uses normal CDF	Uses logistic function
Interpretation	Psychometric, used in scaling attitudes	Probabilistic, used in sports, ranking
Data requirement	Pairwise comparisons	Pairwise comparisons
Extensions	Can model variance across items	Easier to extend with covariates

Given the above comparison, we will use the Bradley-Terry model, with a subset of manually scored questions to get the difficulty ratings for our entire set of generated questions.

Ensuring the quality of scores

Using a graph to track the distance and connectivity between questions is useful to ensure that the pairs that are manually scored are the right choices for the model to generate unbiased scores.

Each question is treated as a node in a graph, and every pairwise comparison between questions forms an edge connecting two nodes. This graph structure serves multiple purposes.

First, it helps track coverage by ensuring that each question is involved in a sufficient number of comparisons. Questions with fewer comparisons—those with a low degree—should be prioritized when selecting new pairs to compare.

Second, the graph must remain connected to avoid isolated questions or disconnected clusters. Algorithms such as Breadth-First Search (BFS) or Union-Find can be used to verify and maintain this connectivity.

Third, the selection of pairs should aim to be informative. Comparisons between questions that span distant regions of the graph—such as those differing significantly in difficulty—are more valuable than repeated comparisons between similar or adjacent questions.

Finally, the graph should be updated dynamically as more comparisons are collected. This allows for adaptive sampling, where the next pair to compare is chosen based on criteria such as uncertainty in estimated scores. Techniques like entropy or variance can guide this process to focus on the most informative comparisons.

Understanding the competence of the student

Once you have a list of questions sorted by difficulty, you can apply Item Response Theory (IRT) to estimate a student's competence level, commonly referred to as ability or proficiency in IRT terminology.

IRT is a family of statistical models that describe how the probability of a correct response to a question depends on both the student's latent ability and the characteristics of the question itself. In its simplest form—the one-parameter logistic model (1PL), also known as the Rasch model—the probability that a student with ability level θ answers a question correctly is given by:

$$P(\text{correct}) = \frac{1}{1 + e^{-(\theta - b_i)}}$$

Here, θ represents the student's ability, and bibi is the difficulty of question ii. More complex models extend this by adding parameters: the two-parameter logistic model (2PL) introduces a discrimination factor aiai, which reflects how well a question distinguishes between students of different abilities, and the three-parameter logistic model (3PL) adds a guessing factor cici, accounting for the chance of answering correctly by guessing.

To use IRT in practice, you begin by assigning difficulty scores to your questions, which can be derived from models like Bradley-Terry or empirical performance data. You then administer a subset of these questions to a student and record their responses. By fitting an IRT model to this data, you can estimate the student's ability level θ.

This estimated ability can be used to place the student on a proficiency scale, recommend the next most suitable questions, and make comparisons across students. IRT is particularly powerful because it simultaneously accounts for both question difficulty and student ability, enabling adaptive testing and yielding more accurate and individualized assessments than raw scores alone.

Generative AI for learning by quizzing