Introduction

As part of my Computer Science dissertation, I’m developing a multi-resident, multimodal Human Activity Recognition (HAR) system tailored for privacy-sensitive environments like shared kitchens. This system is designed to operate on resource-constrained edge devices, using a mix of low-resolution audio, IMU sensors, and RFID tags.

You can find a more detailed literature review and the motivation behind the project at this link.

In this first installment of the series, I’ll walk you through the prototype implementation of one of the most challenging features: audio-based activity recognition on a Raspberry Pi.

Prototype Architecture

The feature prototype focuses on processing environmental audio to predict activity using a CNN model trained on the Kitchen20 dataset. Kitchen20 is a rich dataset for environmental audio in kitchen settings.

Key components:

Raspberry Pi 4 Model B (1GB RAM)
- Although this model of the Raspberry Pi can come in higher versions of RAM, I decided to use the lowest version possible so the project is aware and built-upon a resource-constrained device to aim for efficiency and cost-effective solutions on HAR.
USB omnidirectional microphone
ONNX exported model
Google TTS for audio output

Model Training with Kitchen20

The original PyTorch implementation of Kitchen20 relied on outdated torchaudio APIs. This is evident in the following extraction of the code:

audio_set = Kitchen20(
        root='/media/data/dataest/kitchen20/',
        folds=[1, 2, 3, 4],
        transforms=transforms.Compose([ # transforms.Compose is no longer available in torchaudio
            transforms.RandomStretch(1.25),
            transforms.Scale(2 ** 16 / 2),
            transforms.Pad(input_length // 2),
            transforms.RandomCrop(input_length),
            transforms.RandomOpposite()]),
        overwrite=False,
        use_bc_learning=False,
        audio_rate=audio_rate)

    audio_loader = DataLoader(audio_set, batch_size=2,
                              shuffle=True, num_workers=4)

The use of the Compose function for audio transformations was available in the very first release version of torchaudio, which was removed as a breaking change on the next release of torchaudio, version 0.3.0

I re-implemented the dataset accessor using modern torch and torchaudio versions, which you can find on the Github public repository:🔗 Kitchen20 PyTorch Accessor (Updated)

I trained a four-layer CNN, then exported the model to ONNX format for lightweight inference on the edge device. The current results are:

Approximately 30% accuracy on the training data: The current model achieves around 30% accuracy when evaluated on the training dataset. This indicates that while the model is learning to some extent, there is significant room for improvement. The relatively low accuracy suggests that the model may not be capturing all the necessary patterns in the data effectively.
Plans for improvement include integrating IMU and RFID modalities, as well as refining preprocessing: To enhance the model's performance, we plan to incorporate additional data sources, such as IMU (Inertial Measurement Unit) and RFID (Radio Frequency Identification) modalities. These additional data streams can provide more context and features for the model to learn from, potentially leading to better accuracy.

Preprocessing Pipeline

The preprocessing pipeline of audio includes:

Sample rate adjustment

 def preprocess_audio(
       waveform,
       original_sample_rate,
       target_sample_rate=16000,
       target_length=4,
 ):
     """
     Preprocess the audio waveform to have a consistent length and sample rate.

     Args:
         waveform (Tensor): The audio waveform.
         original_sample_rate (int): The original sample rate of the audio.
         target_sample_rate (int, optional): The target sample rate. Defaults to 16000.
         target_length (int, optional): Target number of samples.

     Returns:
         Tensor: The preprocessed audio waveform.
     """
     # Convert to mono if stereo
     if waveform.size(0) > 1:
         waveform = torch.mean(waveform, dim=0, keepdim=True)

     # Resample the audio if the sample rate is different
     if original_sample_rate != target_sample_rate:
         waveform = torchaudio.transforms.Resample(
             orig_freq=original_sample_rate,
             new_freq=target_sample_rate
         )(waveform)

     # Adjust length
     current_length = waveform.shape[1]
     # Trim or pad the waveform to the target length
     if current_length > target_length:
         waveform = waveform[:, :target_length]
     else:
         waveform = torch.nn.functional.pad(waveform, (0, target_length - current_length))

     return waveform

Feature extraction using Pytorch (as opposed to Moreaux, 2019 work that used Librosa library):
- MFCC
- MelSpectrogram

if feature_type == 'melspectrogram':
            self.transform = torchaudio.transforms.MelSpectrogram(
                sample_rate=sample_rate,
                n_fft=n_fft,
                hop_length=hop_length,
                n_mels=n_mels,
                f_min=f_min,
                f_max=f_max,
            )
            self.db_transform = torchaudio.transforms.AmplitudeToDB()
elif feature_type == 'mfcc':
            self.transform = torchaudio.transforms.MFCC(
                sample_rate=sample_rate,
                n_mfcc=n_mfcc,
                melkwargs={
                    'n_fft': n_fft,
                    'hop_length': hop_length,
                    'n_mels': n_mels,
                    'f_min': f_min,
                    'f_max': f_max,
                }
            )
            self.db_transform = None

This preprocessing service is shared between the cloud (for training) and the edge (for prediction) layers to ensure consistent input formats. Some of the work that is considered to improve the preprocessing pipeline includes:

Refactor the existing codebase to function as a standalone service. This involves decoupling the preprocessing logic from the main application, allowing it to operate independently.
Extend the current preprocessing capabilities by adding support for IMU and RFID data. This will involve developing new transformation pipelines tailored to the specific characteristics of IMU and RFID data.
Conduct a thorough parameter tuning process to optimize the system's performance. This is probably the most important improvement needed during the project's development phase. Due to time limits, the feature prototype was created quickly to show that the ONNX model could work with the edge device and kitchen environment, without focusing much on the preprocessing service details. The first goal was to copy the work of Moreaux et al. (2019) with minimal changes to make it run on modern libraries, mainly fixing compile and runtime errors. I believe this is why the model's accuracy and confidence are low. I need to explore and find the best audio preprocessing practices for CNN to greatly improve this key part of the project.

Edge Device Inference Loop

The Raspberry Pi (edge device layer) performs the following in a loop:

Records 5 seconds of audio
Uses the shared preprocessing pipeline
Feeds features to the ONNX model using onnxruntime
Outputs:
- Prediction label
- Confidence score
Plays the outputs using an audible feedback using gTTS + playsound

The following short video, showcase the result of the prototype.

https://youtu.be/LX5LW3Y-ut4

This setup confirms feasibility of real-time HAR prediction on low-power hardware, crucial for assisted living scenarios where privacy and efficiency are essential. The improvements to be made on the edge device layer includes:

Replace the full operating system with a lightweight Linux distribution like Alpine Linux or Raspberry Pi OS Lite. This change will use fewer system resources, speed up boot times, and improve overall performance by cutting down on unnecessary background processes and services.
Automate updating the model from the cloud using SSH and ONNX replacement. This will keep the device running the latest model version, improving prediction accuracy and reliability. By securely connecting to the cloud, updates can happen automatically without needing manual work, reducing downtime and maintenance efforts.

Conclusion

Developing a privacy-focused Human Activity Recognition (HAR) system on limited-resource edge devices is a promising way to improve privacy and efficiency in shared spaces. By using low-resolution audio, IMU sensors, and RFID tags, this system aims to recognize activities accurately while keeping user privacy intact. The prototype on a Raspberry Pi shows that real-time HAR prediction is possible on low-power hardware. Although the current model's accuracy is moderate, ongoing improvements, like adding more data types and improving preprocessing, should boost performance. Future work will focus on optimizing system design, organizing datasets, and automatic deployments on edge devices to make the system practical and usable in real-world situations. This project aims as a step toward creating energy-efficient and privacy-aware HAR solutions for edge devices.

Code & Resources

🔗 GitHub Repo: fp-audio-service - Modern Pytorch accessor for the Kitchen20 dataset.
🔗 Kitchen20 Dataset - Moreaux et al. (2019) original Kitchen20 implementation.
🧠 Literature Review and project motivations

Follow the Series

This post is part of a series documenting my dissertation project on Energy-Efficient and Privacy-Aware Multimodal HAR for Edge Devices. Future posts will cover:

System Design & Architecture: Tiers, sensors, and design principles
Dataset Structuring: From Kitchen20 to custom multi-resident datasets
Edge Device Deployment: Optimization, updates, and latency analysis
Final Evaluation & Results: Precision, recall, and real-world usability

Subscribe to typo.hashnode.dev and follow along!

Privacy-Aware Multimodal HAR System on the Edge

Table of contents