When it comes to processing large CSV files, memory consumption can quickly become a bottleneck. Loading a 10-million-line CSV file into memory? That's a recipe for an out-of-memory error. But what if I told you that you could Filter such a file using just 2MB of RAM and process it in few seconds.

Before diving into performance testing, you'll need a substantial dataset. Here's how to generate a 10-million-line CSV using the Faker library

composer require fakerphp/faker

Now that we have a dataset, we must start by reading it, the most efficient way is to use a Generator which is also an Iterator, this provide elegant approach to memory-efficient CSV processing. They're perfect for processing large datasets because they maintain their state between iterations without keeping the entire dataset in memory.

function readLines() : Generator
{
    $file = fopen(__DIR__.'/data.csv', 'r');

    while (($data = fgetcsv($file)) !== false) {
            yield Record::fromArray($data);
    }
    fclose($file);
}

Now the next key piece is Filters, PHP provides the built-in FilterIterator class, which efficiently implements the filtering logic while maintaining memory efficiency when working with large datasets.

lets create a some filters:

class AgeFilter extends FilterIterator {

    public function __construct(
        Iterator $iterator,
        private readonly int $age
    ) {
        parent::__construct($iterator);
    }

    public function accept(): bool
    {
        return $this->current()->age == $this->age;
    }
}

class GenderFilter extends FilterIterator {

    public function __construct(
        Iterator $iterator,
        private readonly string $gender
    ) {
        parent::__construct($iterator);
    }

    public function accept(): bool
    {
        return $this->current()->gender == $this->gender;
    }
}

Next, we need to create a filter chain to process the previous result. To simplify this task, we can leverage the PipelineFilterIterator library.

composer require millancore/pipeline-iterator

use Millancore\PipelineIterator\PipelineFilterIterator;

$iterator = PipelineFilterIterator::create(readLines())
    ->filter(GenderFilter::class, 'F')
    ->filter(AgeFilter::class, 30);

foreach($iterator as $item) {
   echo $item->name.PHP_EOL;
}

/**
* Records: 10.000.000
* File size: 227 MiB
* Time: 7.65 seconds
* Memory: 2 MB
*/

It's that simple! This approach ensures our filters are reusable, testable, and easily combinable to suit your needs.

Bonus

You can use built-in Filters as CallbackFilterIterator or RegexIterator

 # Regex filter names containing 'ana'
 ->filter(RegexIterator::class, '/ana/')

To use RegexIterator the class must implement __toString to apply the filter.

The Memory-Efficient Guide to Blazing Fast CSV Filtering with PHP

Bonus

Thanks

Subscribe to my newsletter

Juan Millan

Juan Millan