Effortless CSV Parsing in Golang: A Hands-On Approach


Handling big CSV files can take a lot of time. But with Golang, it becomes much quicker. Unlike traditional methods, Golang makes the process more efficient, cutting down the time needed to deal with large CSV files. Golang achieves this efficiency through its optimized performance and concurrent processing capabilities, helping developers simplify data tasks and improve overall speed. Using Golang turns the once burdensome job of managing extensive CSV datasets into a smoother and more time-saving process. It's a valuable tool for developers looking to increase productivity in data-heavy workflows.
1. Setting up the project and Installing the required packages
Let's first start with the setting up the go project. First we will initializes a new module in the current directory using the go mod init moduleName
command. The module name can be your repository name. I am going to run the following command.
go mod init github.com/shaileshhb/go-file-parser
Once the command is executed two files will be create go.mod and go.sum. Now we will install the package that we are going to use for reading through the csv file.
go get -u github.com/gocarina/gocsv
Now we are ready to start writing our code. Create a main.go file where we will write our code.
2. Opening and Reading a CSV File
For reading the csv file we will have to open the file first. There are multiple approaches to open a file and we will be using OpenFile
function from the os
package
readFilePath := "process.csv"
// Open the CSV readFile
readFile, err := os.OpenFile(readFilePath, os.O_RDONLY, os.ModePerm)
if err != nil {
panic(err)
}
defer readFile.Close()
OpenFile
method takes three parameters:
readFilePath
: The path to the file you want to open.os.O_RDONLY
: Flag indicating that the file should be opened for reading only.os.ModePerm
: File permission mode, indicating that the file should have the default permissions for its type (e.g., 0666 for a regular file).
OpenFile
function returns file and error object. Here since OpenFile
returns us error we need to handle the error. Here the program will panic and stop the program execution but in real world scenario you will mostly return the error. And at last we need to close the file once our execution is completed.
3. Defining Structs to Match CSV Columns
Since we are using github.com/gocarina/gocsv package it has certain rule for defining the struct which will be filled while parsing the file.
When defining the struct each field has a tag called csv
which will have the name of the column from the csv file.
For example: Say we have a csv with a column named 'Full Name'. For this we will define the struct as follow
type User struct {
FullName string `csv:"Full Name"`
}
The csv file that I am going to use has following columns:
Organization Name, LinkedIn, Website, Total Funding Amount, Total Funding Amount Currency, Headquarters Location. So my struct will look something like this
type Industry struct {
CompanyName string `csv:"Organization Name"`
LinkedIn string `csv:"LinkedIn"`
Website string `csv:"Website"`
TotalFundingAmount int `csv:"Total Funding Amount"`
TotalFundingAmountCurrency string `csv:"Total Funding Amount Currency"`
HeadquartersLocation string `csv:"Headquarters Location"`
}
4. Parsing CSV Data into Structs using the UnmarshalToChan() Function
UnmarshalToChan
enables memory-efficient processing of large datasets by streaming records into channels instead of loading them all at once.
Calling the UnmarshalToChan is very simple. It just takes 2 arguments, first one is the file and second parameter the the channel. The channel has to be the type of the struct we just created.
readChannel := make(chan Industry, 1)
err := gocsv.UnmarshalToChan(file, c)
if err != nil {
panic(err)
}
Here we create readChannel
where we will find our data. Once each record is read the entire data is pushed into the channel. And file
is the file that we had opened previously. The UnmarshalToChan
function returns the error so we need to handle it once again.
The package provides us option to set our csv reader wherein we can set few options for reading the csv file. It can be done as follows
gocsv.SetCSVReader(func(r io.Reader) gocsv.CSVReader {
reader := csv.NewReader(r)
reader.Comma = ','
reader.FieldsPerRecord = -1
return reader
})
Breakdown of the function is as followgocsv.SetCSVReader
: This function is used to set a custom CSV reader for parsing CSV data.
func(r io.Reader) gocsv.CSVReader
:The function takes an
io.Reader
as an argument and returns agocsv.CSVReader
.It defines a function literal (anonymous function) that specifies how the CSV reader should be configured.
csv.NewReader(r)
:- Creates a new CSV reader using the
csv
package from the Go standard library, and it takes anio.Reader
as its parameter.
- Creates a new CSV reader using the
reader.Comma = ','
:- Sets the comma used for field separation in the CSV file. In this case, it's set to the standard comma (
,
).
- Sets the comma used for field separation in the CSV file. In this case, it's set to the standard comma (
reader.LazyQuotes = true
:- Configures the CSV reader to allow lazy quotes. Lazy quotes allow quotes in a field to span multiple lines.
reader.FieldsPerRecord = -1
:FieldsPerRecord is the number of expected fields per record. If FieldsPerRecord is positive, Read requires each record to have the given number of fields. If FieldsPerRecord is 0, Read sets it to the number of fields in the first record, so that future records must have the same field count. If FieldsPerRecord is negative, no check is made and records may have a variable number of fields.
Sets
FieldsPerRecord
to -1, indicating that each record in the CSV file may have a different number of fields. This is useful for handling irregular CSV files where records might have varying numbers of fields.
The entire part up till here should something like this
package main
import (
"encoding/csv"
"fmt"
"io"
"os"
"time"
"github.com/gocarina/gocsv"
)
type Industry struct {
CompanyName string `csv:"Organization Name"`
LinkedIn string `csv:"LinkedIn"`
Website string `csv:"Website"`
TotalFundingAmount int `csv:"Total Funding Amount"`
TotalFundingAmountCurrency string `csv:"Total Funding Amount Currency"`
HeadquartersLocation string `csv:"Headquarters Location"`
}
func main() {
readChannel := make(chan Industry, 1)
readFilePath := "process.csv"
// Open the CSV readFile
readFile, err := os.OpenFile(readFilePath, os.O_RDONLY, os.ModePerm)
if err != nil {
panic(err)
}
defer readFile.Close()
count := 0
readFromCSV(readFile, readChannel)
}
func readFromCSV(file *os.File, c chan Industry) {
gocsv.SetCSVReader(func(r io.Reader) gocsv.CSVReader {
reader := csv.NewReader(r)
reader.Comma = ','
reader.LazyQuotes = true
reader.FieldsPerRecord = -1
return reader
})
// Read the CSV file into a slice of Record structs
go func() {
err := gocsv.UnmarshalToChan(file, c)
if err != nil {
panic(err)
}
}()
}
Here I have extracted the code which is going to read from the csv file and UnmarshalToChan
is wrapped inside anonymous go routine function, expect it everything looks the same.
Now the only missing part is reading from the channel which could be done using the for range as follow
for r := range readChannel {
fmt.Println("========================================")
fmt.Println(r)
fmt.Println("========================================")
fmt.Println()
}
Here I am just printing out the record that has been read but in your case you can do process the record based on your requirement.
The entire code is below
package main
import (
"encoding/csv"
"fmt"
"io"
"os"
"time"
"github.com/gocarina/gocsv"
)
type Industry struct {
CompanyName string `csv:"Organization Name"`
LinkedIn string `csv:"LinkedIn"`
Website string `csv:"Website"`
TotalFundingAmount int `csv:"Total Funding Amount"`
TotalFundingAmountCurrency string `csv:"Total Funding Amount Currency"`
HeadquartersLocation string `csv:"Headquarters Location"`
}
// 5.57ms -> 600 records (read)
func main() {
now := time.Now()
readChannel := make(chan Industry, 1)
readFilePath := "process.csv"
// Open the CSV readFile
readFile, err := os.OpenFile(readFilePath, os.O_RDONLY, os.ModePerm)
if err != nil {
panic(err)
}
defer readFile.Close()
count := 0
readFromCSV(readFile, readChannel)
// Print the records
for r := range readChannel {
fmt.Println("========================================")
fmt.Println(r)
fmt.Println("========================================")
fmt.Println()
count++
}
fmt.Println(time.Since(now), count)
}
func readFromCSV(file *os.File, c chan Industry) {
gocsv.SetCSVReader(func(r io.Reader) gocsv.CSVReader {
reader := csv.NewReader(r)
reader.Comma = ','
reader.LazyQuotes = true
reader.FieldsPerRecord = -1
return reader
})
// Read the CSV file into a slice of Record structs
go func() {
err := gocsv.UnmarshalToChan(file, c)
if err != nil {
panic(err)
}
}()
}
I have pushed this code here you can use it for reference. The repository also has a sample csv file which you can use for testing.
Here is the link for the package that I have used
Go package: https://pkg.go.dev/github.com/gocarina/gocsv
Github: https://github.com/gocarina/gocsv
Thanks for reading through the blog. Please let me know if I have missed something or something that I have done is incorrect so that I could make those changes and learn from my mistakes.
Subscribe to my newsletter
Read articles from Shailesh B directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
