Colorization Using Perl with Regular Expressions

TesserIdTesserId
6 min read

!!! Jump to “The Perl Script for Colorizing with Regular Expression on Syslog“ for a quick start

The following will allow a scripter to use colorization of parsed content to facilitate development of regular expressions (regexes).

This is part of a series. You may need these for reference and deeper explanation.

About Windows:

The last time I used Perl on Windows is when I was last able to use Cygwin at work (and I don’t have Windows at home). And, the example below does minimal parsing of Linux/UNIX syslog. That turns out to simplify the article. But, for Windows, I have two suggestions about how to cope: 1) try it on something in Windows (might work), 2) do some Linux virtualization (it’s not that hard). Sorry if this ruffles some feathers.

Motivation

I had been working on some fairly complex regular expressions on a large body of content (corpus). It was tedious. I found myself using tricks to insert delimiters to see just how much of a line the regex in question was grabbing. Again, it was very tedious.

Out of frustration, I decided I wanted something like syntax highlighting to make it easier to see just what I was matching. I ended up really pleased with how well it worked for me. Naturally, I wanted to document it. But, my first attempt at an article included too many variations and too much detail about how I built it up. I had to give up on that article. Hopefully, this works better.

About These Regular Expressions.

The technique being used in this script involves using a regular expression to populate an array, @F. This is done using plain parenthesis (usually) in the regex, which are capture groups. When grouping is needed without capture, you can use (?: …).

Generally, the order that the captures populate the array is in the order that, rule of thumb, the openning parenthesis appear (I think, been a long time since I proved that to myself). Captures can, with caveats, be nested, though I generally don’t like what I get when I do that. Play with it; see whether you like it or not.

If you don’t recognize all the features in the regex below, there are plenty of tutorials out there to cover these features. But, if you do find yourself seeking out tutorials, you may find this script useful for playing with them. I personally used a script like this to make my very first-time use of recursive parsing, which was for matched brackets of various types. But, I’ll save that for later. It deserves its own article.

Build your regexes from the top—beginning of the text to be parsed—down—proceeding into the text to be parsed. Notice beginning-of-line anchor, ’^’, at the beginning of the regex. You can take this out if you want to do middle-of-the-line matches that are not anchored to the beginning. But, I generally don’t do that.

One great feature of working with the ’x’ regex option (seen at the end of the regex) is the ability to add comments—and this is the kicker, to comment out regular expression elements that you suspect aren’t behaving as you wanted. So, if you’re building top-down, as I recommended above, get the parts that parse the beginning of the line first, build down, and when things go wonky, comment a chunk of your latest additions and work back to figure out what went wrong. From there, you can build it back up with your old, commented-out code, as a guide and fixing things along the way. Then delete the wonky code after it’s been properly replaced.

The Perl Script for Colorizing with Regular Expression on Syslog

Put this in a file (e.g parse-syslog.pl):

#!/usr/bin/perl -w

use strict;
use v5.16.3;

# Simple ANSI color code generation (these to lines can be made fancier as desired)
my @nrm = ( "\e[0m", reverse map { "\e[3" . $_ . "m" } (1..7) );
my @clr = map { ( $_, shift @nrm ) } map { "\e[9" . $_ . "m" } (0..7);

# Arrays used in loop
my (@clrs, @hi, @F);

while (<>) {
  # Regex
  @F = m{^  # Anchor to beginning of samples

    # Leading date, without being specific about localization
    ([^:]+ \d\d(?::\d\d){1,3}(?:\.\d+)?)\ 

    (\S+)\   # System name
    (\S+):\  # Process name, non-validating

    (.*) # Trailing characters
  }x
  or say($_, " ---<<{NO MATCH}>>---") and next;

  say "undefined" unless defined $F[0]; # Validate

  #Colorize
  @clrs = (@clrs, @clr) while scalar(@F) > scalar(@clrs);
  @hi = map { $_ . (shift @F || "") } @clrs[0..$#F];

  say join("\e[90m | ", @hi), "\e[0m";
}

# vi: set ts=8 sw=2:
# vim: set et sta:

Important: There are escaped space characters in the regex above; which if they don’t copy out nicely you may have to edit them manually—to make sure that it matches a space where a space exists.

This is a relatively minimal example, in that it only grabs the first three chunks/”columns” of syslog. Hopefully, it’ll play nicely regardless of localization. What I have there for the date format involves a lot of guessing, but it works for en_US.

The point here is that this is a really nice play ground for developing more complex regular expressions. You can do some really nice log parsing with this if regexes are your thing. So, go ahead; play.

Set the permissions (however you like, or…):

chmod ug+x parse-syslog.pl

Run this to parse a sample (head) of syslog (using Bash feature):

./parse-syslog.pl <( head -25 /var/log/syslog )

Experienced users will recognize the <( … ) notation; it’s a Bash feature. It allows the file argument of one command, parse-syslog.pl, to be generated from another command, head. If you see yourself as becoming a command-line warrior, you will come to find this useful (e.g. vim <( log-file )).

Then, finally, the command to see the whole log file (if it wasn’t obvious):

./syslog.pl /var/log/syslog

On Red Hat systems, you may have to use /var/log/messages, if you have permissions.

The the obligatory sample output:

Optional Colors

The two lines in the above script that generate the colors can be replaced by a variety of fancier schemes. The ones used above were kept minimal on purpose. They weren’t meant to detract from the rest of the script.

Commentary

While the terse coding keeps the space occupied in the script small, there is a drawback to it—one that many just don’t like. You have to be familiar with some very Perl’ish coding styles, which some would regard as obfuscated (sorry), and you really need the skill and familiarity with the idiom to make sense of it. To me, it’s not obfuscated; I can read it. But, I’m in no way claiming it’s easy.

Try replacing the two lines of code generation with:

# Names of colors--order matching that of ordinal values for ANSI colors
my @colorNames = qw(Black Red Green Yellow Blue Purple Cyan White);

# Generate color codes from names into hash
my %color;
@color{@colorNames} = map { "\e[3" . $_ . "m" } (0..$#colorNames);
@color{map { "bld$_" } @colorNames} = map { "\e[9" . $_ . "m" } (0..$#colorNames);
@color{map { "rvr$_" } @colorNames} = map { "\e[7;3" . $_ . "m" } (0..$#colorNames);

# Explictly organize the sequence of colors by name and hash
my @clr = @color{qw[ bldYellow rvrBlue bldGreen Yellow Blue bldCyan Cyan
                     rvrWhite Purple bldBlue rvrYellow Red Green bldPurple bldRed
                     bldWhite White rvrRed rvrGreen rvrPurple rvrCyan ]};

Niceties

I’ve gotten a lot of mileage out of this technique. I’m expecting there at least a few of you who will find it as wonderful as I do. And, hopefully, this style of presentation worked for you. And, if you take joy in coding as I do, then I glad to be able to share the experience.

Links (Repeated for Convenience):

Enjoy

0
Subscribe to my newsletter

Read articles from TesserId directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

TesserId
TesserId