Monte Carlo Insights on NS&I Premium Bonds

In Part 1 of this series, we explored the characteristics of NS&I Premium Bonds and created a function to simulate the results of a single monthly prize draw for a given bond holding. In this article, we will build on that foundation by using this function in a Monte Carlo experiment to generate the random data for a statistical analysis of outcomes for various bond holdings and holding periods.

As a reminder, we have the following prize_draw function from Part 1:

def prize_draw(holding_size: int) -> list[tuple[int, int]]:
    """ Simulate single prize draw with holding_size bonds
        returns list of winning (bond, prize) """
    # Check each bond for winning condition (value = 0)
    outcomes: np.ndarray = RNG.choice(WIN_CHOICES, size=holding_size)
    winners: np.ndarray = np.where(outcomes == 0)[0]

    # Randomly sample prizes (without replacement) for each winner
    prizes: np.ndarray = RNG.choice(
        PRIZES, replace=False, size=len(winners))
    # Return list of winning bonds and assigned prizes
    return list(zip(winners.astype(np.int32), prizes))

The output of the function is a list of tuples containing bond and prize pairs. To perform multiple simulations efficiently, we should avoid collecting these individual lists within another iterable, as it complicates the analysis. Instead, using a DataFrame is a more effective approach. By organising the data in a DataFrame with columns for simulation number, bond, and prize in a flat structure, we can leverage the powerful grouping, filtering, and aggregation methods available in DataFrames. This allows us to analyse the data for various scenarios, such as grouping simulations by different holding periods.

A small modification to the prize_draw function is therefore needed to allow the separate np.ndarray objects to be directly returned.

def prize_draw(holding_size: int) -> tuple[np.ndarray, np.ndarray]]:
    ...
    # Return list of winning bonds and assigned prizes
    return winners, prizes

We can then define a simulation function that calls prize_draw with a suitable holding size (100,000) and processes the results into a pandas DataFrame.

def monte_carlo_sim(sim: int) -> pd.DataFrame:
    """ Perform one simulation of a monte carlo experiment """
    winners, prizes = prize_draw(100_000)
    df = pd.DataFrame(zip(winners, prizes), columns=('bond', 'prize'))
    df['sim'] = sim
    return df

The sim argument is included to facilitate running the Monte Carlo experiment in parallel, for example, by using the Pool class from the multiprocessing module. We can create a Pool instance and use it to distribute tasks to multiple worker processes as shown below:

from multiprocessing import Pool

pool = Pool(processes=6)
monte_carlo_results = pool.map(monte_carlo_sim, range(60_000))

It is worth experimenting with the number of processes. More isn't always better, as there is a balance between the overhead of additional processes and the benefits of distributing tasks to those extra threads. I found that 6 processes were the sweet spot on my 8 logical cores. Running 60,000 simulations (which could represent 100 simulations for a holding period of 50 years - about 1% of what would be desirable) took my hardware 11.8 seconds.

The output monte_carlo_results is a list of DataFrames that need to be combined into a single DataFrame using pd.concat. This process took 3.2 seconds. These results suggest that a full-scale experiment could take 25 minutes. Therefore, it seems worthwhile to explore if this can be improved.

Polars is an alternative data library written in Rust that has been shown to have considerable performance benefits over Pandas. Adapting the monte_carlo_sim function to use Polars involves some slightly different syntax.

import polars as pl

def monte_carlo_sim(sim: int) -> pl.DataFrame:
    """ Perform one simulation of a monte carlo experiment """
    winners, prizes = prize_draw(100_000)
    schema = ({'bond': pl.Int32, 'prize': pl.Int32})
    df = pl.DataFrame(zip(winners, prizes), schema=schema)
    df = df.with_columns(pl.lit(sim).alias('sim').cast(pl.Int32))
    return df

Now, the same 60,000 simulations execute in 6.3 seconds, and the DataFrame concatenation takes just 0.08 seconds — nearly a three-fold improvement overall.

All that remains is to run a full-scale 6 million simulation experiment, which will hopefully provide enough data to evaluate any reasonable scenario. This process took just under 11 minutes and produced a DataFrame with 28.6 million rows. The estimated_size method suggests it is taking up 327 MB of RAM. Here is a sample of the final dataframe:

In [24]:df
Out[24]: 
shape: (28_552_482, 3)
┌───────┬───────┬─────────┐
│ bond  ┆ prize ┆ sim     │
│ ---   ┆ ---   ┆ ---     │
│ i32   ┆ i32   ┆ i32     │
╞═══════╪═══════╪═════════╡
│ 8694  ┆ 100   ┆ 0       │
│ 12054 ┆ 50    ┆ 0       │
│ 68250 ┆ 25    ┆ 0       │
│ 25242 ┆ 25    ┆ 1       │
│ 31515 ┆ 100   ┆ 1       │
│ …     ┆ …     ┆ …       │
│ 45768 ┆ 50    ┆ 5999999 │
│ 47988 ┆ 25    ┆ 5999999 │
│ 52533 ┆ 25    ┆ 5999999 │
│ 67160 ┆ 25    ┆ 5999999 │
│ 82412 ┆ 100   ┆ 5999999 │
└───────┴───────┴─────────┘

Having generated this valuable data, the next step is to save it for future analysis. Polars DataFrames can be serialised and saved in various formats. Research indicates that the Parquet format is more efficient for storage of large DataFrames compared to Pickle. Therefore, we will save it in Parquet format.

df.write_parquet('premium_bond_6M_sim.parquet',compression_level=22)

This generates a 108 MB output file. When we want to load this file into a DataFrame for analysis, we can use the pl.read_parquet function.

Conclusion

We have now taken the prize_draw function created in Part 1, adapted it for a large Monte Carlo experiment using parallel processing, loaded the resulting data into a Polars DataFrame, and saved it for future analysis. The final Python code is available in the premium_bond_sim repository on my GitHub page. When NS&I updates the winning odds and prize allocations, these can be updated, and the experiment can be rerun.

In the final article of the series, we will analyse this data to understand the distribution of outcomes in various scenarios.

Monte Carlo Simulation of NS&I Premium Bonds: Crunching the numbers

Conclusion

Subscribe to my newsletter

Jason Shiers

Jason Shiers