Save Big with Logging Designed for Analysis by AI (and Humans)

Darwin SanoyDarwin Sanoy
7 min read

If you create any kind of software, you can reduce your support costs and improve your AI compatibility profile by enhancing your logging with AI Traceability in mind.

I recently had an interesting experience. During a customer demonstration I randomly picked the CI output log of a GitLab CI/CD Component I created that automatically manages new Semantic Version generation and I asked GitLab DUO to analyze it.

To my delightful surprise, it accurately described about 90% of the components functionality.

This was very interesting, because GitLab CI/CD Components use the concept of “includes:” which purposely abstract away the code they contain for a simpler, lower cognitive load experience for developers who just want to use helpful functionality without having to author it from scratch themselves.

However, if you get curious or for debugging need to understand that code better, having an explanation by AI is great.

Yep, That’s Right! You Could Have Benefitted From This Before AI

Like architecture, there are things that would have always been beneficial to the human intelligent agents, that now are non-negotiable in the age of AI. When we didn’t incorporate these cues for humans, we are over relying on their expansive ‘agency’ to “figure it out with scant evidence”, keep at it until it is done, and keep at it with background processes if it is taking longer.

The reason AI makes these enhancements even more relevant is that human agency is willing to go higher, longer farther to find and resolve the problem. In a very real way “human agency over a problem persists across very, very long contexts”.

Ever had an idea for improving some code 3 months after you last touched it? Yeah, that’s very long term subconscious agency - AI doesn’t do that yet and if we ever invent it, I’m gonna guess it will cost much more than the half-calorie it takes your brain to do it.

AI Value Multiplier From Logs Designed for Intelligent Analysis

This experience intersected with my past experiences in debugging traceability enhancement for humans through good logging practices. The small DevOps tooling team I led was doing a lot of expensive over-the-shoulder debugging of internal customer usage of our tools. A frequent debugging outcome was to identify insufficient or errant cloud networking configuration.

It struck me that these developers would not call us until they had gone through a lot of heart ache. It also struck me that many of this situations were detectable by software. So we set about enhancing logs with specific intelligence that could be sensed right when the problem occurred and emit it to the log - where many developers happily self-resolved external problems and more accurately identified when we did actually have bugs that needed addressing. They more frequently came with a solution in hand as well.

The simplest non-obvious practice was doing more success logging of “milestones in the code” because this allows humans to naturally narrow the range of code in which to look to find the problem. Obviously this practice allows AI to do the same when it has access to both the code and the log with an error. If you’ve never pursued explicit logging engineering for traceability - now is a good time to think about it.

Logging Practices to Consider

There are probably better lists of all the things one could log, I am going to articulate the ones I’ve had specific experience with.

  1. Assume “out of context” log analysis.

    Those who author logging content frequently envision the log being accessed and analyzed on the same system that it was generated on - because that is how we develop the logging itself. Unfortunately this causes the log data to assume that the troubleshooter has access to meta data about the system (because they are on the system that generated the log). Due to the rise of massive log aggregation systems and ephemeral compute like containers and serverless, it was already not the case that logs are primarily analyzed on the devices that generated them.
    Assuming our logs are primarily analyzed off the system where they were generated will automatically motivate us to include more system context, which also makes the log much more valuable to AI, which thrives on additional “context”.

  2. Log meta data at the start

    1. Log a short abstract of what the code does.

    2. Log the full location to the exact code version that was executed - this allows both AI and Humans to find and correlate the exact code to the errors. It also facilitates discovery of newer versions that might fix bugs being experienced.

    3. Log any non-security related parameter settings. Keep in mind it’s not just secrets that can reveal attack vectors for your code.

    4. Where this log was/is located on the system that generated it (because analysis is likely external to that system).

    5. Log the execution context of your code:

      1. logical locations, network location and/or physical execution location of your code - any safe details to be able to isolate the environment. Things like cloud account and region, machine name, domain name, ip or dns address.

      2. log process context - runtime environment (containers, instance, cluster) and execution user names and their permission level

  3. Support a verbose output mode for troubleshooting

    1. Which might also put other utilities into verbose mode or pipe their logs into any log that is more accessible by AI or Humans.

    2. If you can’t feasibly copy the logs of external calls, then log where they are located for further in depth analysis.

  4. Logging external calls.

    My career has been nearly 100% automation and IaC - automation is rife with external calls so it is especially close to my heart as a source of failure for which my code can end up taking the blame - and the support call.

    1. Log that you are “About to attempt” an external calls, if safe and applicable you can log the actually call string with parameters - or do that during verbose output if more appropriate.

    2. and then external call success logging that the attempt was completed, and if helpful, what was done or obtained successfully - including the data value received. Sometimes everything is working as designed, and you have parameter errors that mean you are not asking for the right thing or in the right way.

    3. Log timing of completed attempts at calls to external interfaces - frequently hidden or “eaten” errors can be detected by unexpectedly long execution times that hit timeouts.

  5. Precheck all known needed network endpoints before starting any execution - external network or cloud configuration problems can be easily recognized when these checks fail. Unlike ping, TCP Connect tests the full stack of possible breakdowns, such as DNS, Routing, SSL configuration (both sides), Firewalling and remote Listening Process availability. Essentially the application must connect this way, so using it as the primary connection testing method tests all things. Interestingly, it frequently results in much more specific errors from the failing sub-component in the chain.

    1. Here is some exceptionally versatile code for endpoint testing and logging that works on a very broad ranges of older and newer operating systems: Preflight TCP Connect Testing a List of Endpoints (Linux Shell and PowerShell)
  6. Log Formatting

    1. When possible use obvious key value pairs, such as “Version : 1.0” or “Version = 1.0” to enable AI to parse metadata and semantics more easily.

    2. Log timing information at the start of each line.

    3. Official log file formats help with machine readability - including AI - but don’t go so far as to make them hard for humans to read.

Human and AI Collaboration Synergies

It is interesting how many times what is good for AI is also good for old fashioned Human Intelligence. Frequently, we over rely on the resourcefulness of human intelligence and fail to reduce cognitive load through simple practices. Fortunately, when we make our logs contain more context for AI, they also become much more valuable to Human Intelligence and to legacy log analysis tooling - so it is a win all around.

Photo by Pramod Tiwari on Unsplash

0
Subscribe to my newsletter

Read articles from Darwin Sanoy directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Darwin Sanoy
Darwin Sanoy