Learning Objectives

Complex neural network structures
How does a convolutional neural networks function?
Using neural networks for optical character recognition
Integrating neural networks into red team tooling

Introduction

In today’s task, we’ll get our first look at how red teams can use ML to help them attack systems. But before we can start attacking the admin portal, we’ll need to expand on some of the ML concepts taught in the previous tasks. Let’s dive in!

Convolutional Neural Networks

In the previous tasks, we talked about neural network structures. However, most of these structures were fairly basic in nature. Today, we will cover an interesting structure called a convolutional neural network (CNN).

CNNs are incredible ML structures that have the ability to extract features that can be used to train a neural network. In the previous task, we used the garbage-in, garbage-out principle to explain the importance of our inputs having good features. This ensures that the output from the neural network is accurate. But what if we could actually have the neural network select the important features itself? This is where CNN comes into play!

In essence, CNNs are normal neural networks that simply have the feature-extraction process as part of the network itself. This time, we’re not just using maths but combining it with linear algebra. Again, we won’t dive too deep into the maths here to keep things simple.

We can divide our CNN into three main components:

Feature extraction
Fully connected layers
Classification

We’ve actually already covered the last two components in the previous tasks as a simple neural network structure, so our main focus for today will be on the feature-extraction component.

Feature Extraction

CNNs are often used to classify images. While we can use them with almost any data type, images are the simplest for explaining how a CNN works. This is the CAPTCHA that we are trying to crack:

Single CAPTCHA

Since we’ll be using a CNN to crack CAPTCHAs, let’s use a single letter in the CAPTCHA as our image:

Single CAPTCHA letter

Image Representation

The first question to answer is how does the CNN actually perceive this image? The simplest way for a computer to perceive an image is as a 2D array of pixels. A pixel is the smallest area that can be measured in an image. Together, these pixels are what create the image. A pixel’s value describes the colour that you are seeing. There are two popular formats for pixel values:

RGB: The pixel is represented by three numbers from 0 to 255. These three numbers describe the intensity of the red, blue, and green colours of the pixel.
Greyscale: The pixel is represented by a single number from 0 to 255. 0 means the pixel is fully black, and 255 means the pixel is fully white. Any value in between is a shade of grey.

Convolution

There are two steps in the CNN feature extraction process that are performed as many times as needed. The first step is convolution. The maths is about to get slightly hectic here, so take a deep breath and let’s dive in!

During the convolution step of the CNN’s feature extraction, we want to reduce the size of the input. Images often have several thousand pixels, and while we can train a neural network to consider all of these pixels, it will be incredibly slow without really adding any additional accuracy. Therefore, we perform convolution to “summarise” the image. To do this, we move a kernel matrix across the entire image, calculating the summary. The kernel matrix is a smaller 2D array that tells us where in the image we are currently creating our summary. This kernel slides across the height and width of the image to create a summary image.

Pooling

The second step performed in the CNN feature extraction process is pooling. Similar to convolution, the pooling step aims to further summarise the data using a statistical method.

Fully Connected Layers

Now that we have our features, the next stage is really very similar to the basic neural network structure that we used back in the introduction to machine learning task. We’ll create a simple neural network that takes inputs (the summary slices from our last pooling layer), run them from the hidden layers, and then finally provide an output. This is called the fully connected layers portion of the CNN, as this is the part of the neural network where each node is re-connected to all the other nodes in the next layer.

Classification

Lastly, we need to talk about the classification portion of the CNN. This is the output layer from the fully connected layers portion. In the previous tasks, our neural networks only had one output to determine whether or not a toy was defective or whether or not an email was a phishing email. However, to crack CAPTCHAs, a simple binary output won’t do, as we need the network to tell us what the character (and, later, the sequence of characters) is. Therefore, we’ll need an output node for each potential character. Our CAPTCHA example only contains numbers, not letters. So, we need an output node for 0 to 9, totalling 10 output nodes.

Having multiple output nodes creates a new interesting feature for our neural network. Instead of simply getting one answer now, all 26 outputs will have a decimal value between 0 and 1. We’ll then summarise this by taking the highest value as the answer from the network. However, nothing is stopping us from reviewing the top 5 answers, for instance. This can help us identify areas where our neural network might be having issues.

For example, there could be a little confusion between the characters of M and N as they look fairly similar. Reviewing the output from the top 5 nodes will show us that this might be a problem. While we may not be able to solve this confusion directly, we could actually use this to our advantage and increase our brute force accuracy. We can do this by simply discarding the CAPTCHA if it has an M or N and requesting another to avoid the problem entirely!

Training our CNN

Now that we’ve covered the basics, let’s take a look at what will be required to train and use our own CNN to crack the CAPTCHAs. Please note that the following steps have already been performed for you. The next steps will be to perform in the Hosting the Model section. However, understanding how training works is an important aspect so please follow along and attempt the commands given.

We will be making use of the Attention OCR for our CNN model. This CNN structure has a lot more going on, such as LSTMs and sliding windows, but we won’t dive deeper into these steps in this instance. The only thing to note is that we have a sliding window, which allows us to read one character at a time instead of having to solve the entire CAPTCHA in one go.

We’ll be making use of the same steps followed to create CAPTCHA22, which is a Python Pip package that can be used to host a CAPTCHA-cracking server. If you’re interested in understanding how this works, you can have a read here. While you can try to run all this software yourself, most of the ML component runs on a very specific version of TensorFlow. Therefore, making use of the VM attached to the task is recommended.

In order to crack CAPTCHAs, we will have to go through the following steps:

Gather CAPTCHAs so we can create labelled data
Label the CAPTCHAs to use in a supervised learning model
Train our CAPTCHA-cracking CNN
Verify and test our CAPTCHA-cracking CNN
Export and host the trained model so we can feed it CAPTCHAs to solve
Create and execute a brute force script that will receive the CAPTCHA, pass it on to be solved, and then run the brute force attack

What key process of training a neural network is taken care of by using a CNN?

Answer: feature extraction

What is the name of the process used in the CNN to extract the features?

Answer: convolution

What is the name of the process used to reduce the features down?

Answer: pooling

What off-the-shelf CNN did we use to train a CAPTCHA-cracking OCR model?

Answer: attention ocr

What is the password that McGreedy set on the HQ Admin portal?
```
 cd ~/Desktop/bruteforcer && python3 bruteforce.py
```
Answer: ReallyNotGonnaGuessThis

What is the value of the flag that you receive when you successfully authenticate to the HQ Admin portal?

Answer: THM{Captcha.Can't.Hold.Me.Back}

thanks for reading!!

TryHackme’s Advent of Cyber 2023 — Day 16 Writeup

Subscribe to my newsletter

Anuj Singh Chauhan

Anuj Singh Chauhan