Command Line for Data Analysts

Jon TaylorJon Taylor
8 min read

Understanding your data is a crucial part of data analysis and using the right tools can make your job much simpler. By the end of this post, you'll see one way I approach problems with the command line and hopefully introduce you to something new.

Throughout this post, we'll assume we've been sent a file containing a new dataset and our aim is to understand what's contained in it ready for ingest into our big data platform. I won't cover in detail the process of data cleansing, however, the tools I cover would be useful throughout the process.

Uncompressing the data

We'll assume we've been sent the data in an XZ compressed file format, however, to analyse the data we need to see it in plain text. It can be uncompressed and stored in a file or sent to stdout. I prefer not creating new files if I can avoid it, it helps minimises what I keep on my local system and reduces the likelihood of retaining data longer than I should.

tar -xOJf data.xz.     # Decompress the data and output to stdout
Option
-xExtract the compressed file
-OOutput to stdout rather than to a file
-JUse the XZ compression format
-fRead the data from the following file

Other useful commands

The following commands may also be useful to you when uncompressing files of different types. gunzip, zcat, zgrep, unzip.

Parsing JSON data

Running the tar command outputs what looks like JSON data. To verify this we can use the jq command. This can be installed on Linux with apt-get or yum or on OSX with brew.

tar -xOJf data.xz \  # Uncompress the file
| jq .               # Parse and validate JSON with jq

This will output the JSON data in a pretty printed and highlighted format. If the file wasn't in valid JSON format, jq would throw an error.

{
  "dns": [
    {
      "type": "A",
      "question": "localhost",
      "answers": [
        {
          "ttl": 60,
          "answer": "127.0.0.1"
        }
      ]
    },
    ...
  ]
}

We can use jq to help us start to understand the data. There is a dns property in the root object so let's use jq to explore what's inside the dns array. Let's get a list of all the types for the records.

tar -xOJf data.xz \     # Uncompress the file
| jq '.dns[].type' \    # Pull out the types from dns records
| sort \                # Sort the types
| uniq -c \             # Get unique values and count occurances
| sort -n               # Sort numerically

We use jq to extract the type field from all elements in the dns property. We then use the sort and uniq commands together to count how many entries of each type there are. Uniq works by comparing adjacent rows so sorting first is required. The -c flag is equivalent to a count and group by in SQL and the -n flag sorts numerically rather than lexicographically.

    12 "AAAA"
   253 "A"
 45232 "CNAME"

We can see from the above example output that this file contains mostly CNAME records.

CNAME records shouldn't ever have more than one entry per fully qualified domain name. However, if this data is historic, it could legitimately have multiple CNAME entries. For this reason, we want to do some further digging to understand the average number of CNAME records per FQDN.

Filtering the data and running aggregations

There are plenty of ways this can be done, I'll show you one of those. I'll use jq to filter the records to only CNAME records.

tar -xOJf data.xz \                       # Uncompress the file
| jq '.dns[] | select(.type == "CNAME")'  # Filter only CNAME records

Then I'll get the length of the answers array within each CNAME record and output a CSV with the domain name and the number of answers. Given the strings are quoted we can use the tr command to remove them. Jq can do this too but we'll use tr for this example.

tar -xOJf data.xz \                          # Uncompress the file
| jq -r '.dns[]                              # Select dns data
         | select(.type == "CNAME")          # Filter CNAME records
         | [.question, (.answers | length)]  # Get length of answers
         | @csv' \                           # Output as CSV
| tr -d '"'                                  # Remove double quotes

The output of this command will look something like

w55F0nxbkU4cM.com,4
ODz5TTRENBN4Gp0TyDaxwms.com,1
IIjmjwNpLjruoD.com,1
aZihawQYKUEXzu9YDn.com,2
TO5psWk65lRr4aOhNv2XLyjup3tH.com,3
JwbLTialG6.com,2
...

Domain names are case insensitive so we want to take these mixed case domains and lowercase them. Note: there are reasons for case-sensitive domain records, but we'll ignore these for now.

For this example, we're going to assume that two requests for the same domain with varying cases would likely return the same list of answers and as such we'll just pull out the max number of records associated with each. We're oversimplifying here but often this is good enough when initially exploring your data.

Convert uppercase to lowercase

We can convert the uppercase letters in the domain name using the tr command. You can also use commands such as sed to accomplish the same thing.

tar -xOJf data.xz \                          # Uncompress the file
| jq -r '.dns[]                              # Select dns data
         | select(.type == "CNAME")          # Filter CNAME records
         | [.question, (.answers | length)]  # Get length of answers
         | @csv' \                           # Output as CSV
| tr -d '"'                                  # Remove double quotes
| tr '[:upper:]' '[:lower:]'                 # Lowercase the domains

Aggregating the data

We can write simple awk commands to run calculations across our data. This will enable us to group by fields and calculate averages, minimums, maximums etc. We'll start by calculating the maximum number of answers per CNAME domain, then we'll calculate the average across them all.

Awk commands are written in the following structure.

awk 'BEGIN { } ;        # Something to do before processing records
     { }                # Process each record
     END { }'           # Something to do after processing records

In our case we'll set the file separator, then we'll calculate the maximum per FQDN.

tar -xOJf data.xz \                          # Uncompress the file
| jq -r '.dns[]                              # Select dns data
         | select(.type == "CNAME")          # Filter CNAME records
         | [.question, (.answers | length)]  # Get length of answers
         | @csv' \                           # Output as CSV
| tr -d '"'                                  # Remove double quotes
| tr '[:upper:]' '[:lower:]' \               # Lowercase the domains
| awk 'BEGIN { FS = "," } ;                  # Set the file separator
      { if($2 > v[$1]) v[$1] = $2 fi }       # Check max for each value
      END { for(k in v) { print v[k] } }'    # Print max for each value

Finally, we need to calculate the average over all the records.

tar -xOJf data.xz \                          # Uncompress the file
| jq -r '.dns[]                              # Select dns data
         | select(.type == "CNAME")          # Filter CNAME records
         | [.question, (.answers | length)]  # Get length of answers
         | @csv' \                           # Output as CSV
| tr -d '"'                                  # Remove double quotes
| tr '[:upper:]' '[:lower:]' \               # Lowercase the domains
| awk 'BEGIN { FS = "," } ;                  # Set the file separator
      { if($2 > v[$1]) v[$1] = $2 fi }       # Check max for each value
      END { for(k in v) { print v[k] } }'    # Print max for each value
| awk '{ sum += $1; num ++ }                 # Sum all the values
       END { avg = sum / num;                # Calculate the average
             print avg }'                    # Print the overall average

Conclusion

There are plenty more questions you'd want to ask of the data, especially if you're trying to understand data integrity and format ready for your big data platform. While we could write these using Python, Spark, Zeppelin notebooks etc. these tools often feel overkill. If the complexity were to increase or the scale of data was unreasonable for my local machine I'd certainly consider other tools.

Getting used to the different commands at your disposal will significantly increase the speed you can write these kinds of analytics and will vastly improve how you explore data. Knowing how the tools operate will also help you decide when it's time to move away from the command line into more bespoke data analysis tools.

Other tools I recommend

Finally, I showed you a few commands and examples of how I use them, but there are plenty more that I'd strongly recommend getting used to. Understanding regular expressions, grep searching, sed statements and awk queries has made my time using the command line so much more powerful.

ToolPurpose
grep / fgrep / egrep / ripgrepSearching/filtering contents of files
sedModifying the contents of files
awkText document manipulation, often used for aggregations across records
trTranslate / Delete characters
uniqFind unique elements. Remember to sort first. Use -c to do a grouped count
sortSort records. Use the -n flag to sort numerically
cutCut a string into pieces. Use -d to define the delimiter and -f to specify the fields you want
wcWord count, use the -l option to count lines
jqParse and manipulate JSON data
catPrint the outputs of one or more files to stdout
0
Subscribe to my newsletter

Read articles from Jon Taylor directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jon Taylor
Jon Taylor