The Memory-Efficient Guide to Blazing Fast CSV Filtering with PHP


When it comes to processing large CSV files, memory consumption can quickly become a bottleneck. Loading a 10-million-line CSV file into memory? That's a recipe for an out-of-memory error. But what if I told you that you could Filter such a file using just 2MB of RAM and process it in few seconds.
Before diving into performance testing, you'll need a substantial dataset. Here's how to generate a 10-million-line CSV using the Faker library
composer require fakerphp/faker
Generate dataset script whit Faker
Now that we have a dataset, we must start by reading it, the most efficient way is to use a Generator which is also an Iterator, this provide elegant approach to memory-efficient CSV processing. They're perfect for processing large datasets because they maintain their state between iterations without keeping the entire dataset in memory.
function readLines() : Generator
{
$file = fopen(__DIR__.'/data.csv', 'r');
while (($data = fgetcsv($file)) !== false) {
yield Record::fromArray($data);
}
fclose($file);
}
Now the next key piece is Filters, PHP provides the built-in FilterIterator class, which efficiently implements the filtering logic while maintaining memory efficiency when working with large datasets.
lets create a some filters:
class AgeFilter extends FilterIterator {
public function __construct(
Iterator $iterator,
private readonly int $age
) {
parent::__construct($iterator);
}
public function accept(): bool
{
return $this->current()->age == $this->age;
}
}
class GenderFilter extends FilterIterator {
public function __construct(
Iterator $iterator,
private readonly string $gender
) {
parent::__construct($iterator);
}
public function accept(): bool
{
return $this->current()->gender == $this->gender;
}
}
Next, we need to create a filter chain to process the previous result. To simplify this task, we can leverage the PipelineFilterIterator library.
composer require millancore/pipeline-iterator
use Millancore\PipelineIterator\PipelineFilterIterator;
$iterator = PipelineFilterIterator::create(readLines())
->filter(GenderFilter::class, 'F')
->filter(AgeFilter::class, 30);
foreach($iterator as $item) {
echo $item->name.PHP_EOL;
}
/**
* Records: 10.000.000
* File size: 227 MiB
* Time: 7.65 seconds
* Memory: 2 MB
*/
It's that simple! This approach ensures our filters are reusable, testable, and easily combinable to suit your needs.
Bonus
You can use built-in Filters as CallbackFilterIterator or RegexIterator
# Regex filter names containing 'ana'
->filter(RegexIterator::class, '/ana/')
To use RegexIterator the class must implement __toString to apply the filter.
Thanks
Subscribe to my newsletter
Read articles from Juan Millan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
