The Memory-Efficient Guide to Blazing Fast CSV Filtering with PHP

Juan MillanJuan Millan
2 min read

When it comes to processing large CSV files, memory consumption can quickly become a bottleneck. Loading a 10-million-line CSV file into memory? That's a recipe for an out-of-memory error. But what if I told you that you could Filter such a file using just 2MB of RAM and process it in few seconds.

Before diving into performance testing, you'll need a substantial dataset. Here's how to generate a 10-million-line CSV using the Faker library

composer require fakerphp/faker

Generate dataset script whit Faker

Now that we have a dataset, we must start by reading it, the most efficient way is to use a Generator which is also an Iterator, this provide elegant approach to memory-efficient CSV processing. They're perfect for processing large datasets because they maintain their state between iterations without keeping the entire dataset in memory.

function readLines() : Generator
{
    $file = fopen(__DIR__.'/data.csv', 'r');

    while (($data = fgetcsv($file)) !== false) {
            yield Record::fromArray($data);
    }
    fclose($file);
}

Now the next key piece is Filters, PHP provides the built-in FilterIterator class, which efficiently implements the filtering logic while maintaining memory efficiency when working with large datasets.

lets create a some filters:

class AgeFilter extends FilterIterator {

    public function __construct(
        Iterator $iterator,
        private readonly int $age
    ) {
        parent::__construct($iterator);
    }

    public function accept(): bool
    {
        return $this->current()->age == $this->age;
    }
}
class GenderFilter extends FilterIterator {

    public function __construct(
        Iterator $iterator,
        private readonly string $gender
    ) {
        parent::__construct($iterator);
    }

    public function accept(): bool
    {
        return $this->current()->gender == $this->gender;
    }
}

Next, we need to create a filter chain to process the previous result. To simplify this task, we can leverage the PipelineFilterIterator library.

composer require millancore/pipeline-iterator
use Millancore\PipelineIterator\PipelineFilterIterator;

$iterator = PipelineFilterIterator::create(readLines())
    ->filter(GenderFilter::class, 'F')
    ->filter(AgeFilter::class, 30);

foreach($iterator as $item) {
   echo $item->name.PHP_EOL;
}

/**
* Records: 10.000.000
* File size: 227 MiB
* Time: 7.65 seconds
* Memory: 2 MB
*/

It's that simple! This approach ensures our filters are reusable, testable, and easily combinable to suit your needs.

Bonus

You can use built-in Filters as CallbackFilterIterator or RegexIterator

 # Regex filter names containing 'ana'
 ->filter(RegexIterator::class, '/ana/')

To use RegexIterator the class must implement __toString to apply the filter.

Thanks

10
Subscribe to my newsletter

Read articles from Juan Millan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Juan Millan
Juan Millan