The hidden cost of newlines

Table of contents

Earlier this week I faced an interesting issue.I was developing a prototype for a nextflow pipeline that will be used to manage a database.The prototype script was doing the following three things:

1. Get today's date in iso format(e.g. 2022.07.30)
2. Create a folder and name it by today's date, where today is the day this script has been run.
3. Finally, query the data API and create a file for adding data to the database.

The Script:

nextflow.enable.dsl=2

workflow {
   ch_infiles = Channel.fromPath(params.infiles,checkIfExists:true)
   ch_api = Channel.fromPath(params.api,checkIfExists:true)

   today() | view
   mk_today(ch_infiles,today.out) 
   query_api(ch_api,today.out,ch_infiles)
}

process today {
   tag "Get todays date"

   output:
   stdout emit: day

   script:
   """
   echo \$(date +"%Y.%m.%d")
   """
}

process mk_today {
   tag "Make todays folder"

   input:
   path seq_dir
   val day
   output:
   stdout

   script:
   """
   mkdir -p ${seq_dir}/${day}
   """
}
process query_api {

   input:
   path api
   val day
   val seq_dir

   output:
   file("${day}.data_source.csv")

   script:
   """
   python ${api} > ${seq_dir}/${day}/${day}.data_source.csv
   """
}

Problem: Running this nextflow script gave the following output:

N E X T F L O W  ~  version 22.05.0-edge
Launching `core_bioinfo.nf` [angry_montalcini] DSL2 - revision: a74f4aaa3a
executor >  local (3)
[d3/8df354] process > today (Get todays date)            [100%] 1 of 1 
[60/029801] process > mk_today (Make todays folder)        [100%] 1 of 1 
[d9/846551] process > query_api (Get data from API) [  0%] 0 of 1

Error executing process > 'query_api (Get data from API)'

Caused by:
  Process `query_api (Get data from API)` terminated with an error exit status (1)

Command executed:

executor >  local (3)
[d3/8df354] process > today (Get todays folder)            [100%] 1 of 1 
[60/029801] process > mk_today (Make todays folder)        [100%] 1 of 1 
[d9/846551] process > query_api (Get data from API) [100%] 1 of 1, failed: 1 

Error executing process > 'query_api (Get data from API)'

Caused by:
  Process `query_api (Get data from API)` terminated with an error exit status (1)

Command executed:

  python get_data_from_api.py > /home/bkutambe/data/infiles/2022.09.01
  /2022.09.01
  .data_source.csv

Command exit status:  1

Command output:
  (empty)

Command error:
  .command.sh: line 2: /home/bkutambe/data/infiles/2022.09.01: Is a directory

Work dir:
  /home/bkutambe/Documents/Core_Bioinfo/work/d9/846551a8367ec0c0ce724e83d91edb

Forensics

In the background nextflow generate a script called .command.sh and a look into it revealed something interesting.

#!/bin/bash -ue
python get_covid_cases_from_api.py > /home/bkutambe/data/seqbox/infiles/2022.09.01
/2022.09.01
.sample_source_sample_pcrs.csv

The command to query the data API was split into three separate lines and running it as is was causing an error. Bash interpreted and run them as three separate commands.This was supposedly the offending code!

Now the question was how to solve this? Enters observation bias. My preliminary assumption was that nextflow was splitting the command into three separate commands.What was causing this remained a mystery. A google search led to a dead-end. I then fired up slack and asked on both the official nextflow and microbial bioinformatics channels for colleagues in the bioinformatics space to give their input.

The arrival of the calvary

Colleagues in the microbial bioinformatics slack channel did help.After code examination it was pointed out that one of my variables had '\n', a character that adds newline to anything that comes before it.

The true offending code:

process today {
   tag "Get todays folder"

   output:
   stdout emit: day

   script:
   """
   echo \$(date +"%Y.%m.%d")
   """
}

Specifically this line:

echo \$(date +"%Y.%m.%d")

The intention was to capture date in a variable like this:

day="2022.09.01"

BUT this happened:

day="2022.09.01\n"

This is the value that was propagated to downstream processes and triggered the runtime error when querying the data API.

Solution:

It was quite simple. Just adding -n to the echo command suppressed appending a newline to the end and that worked like charm!

echo -n \$(date +"%Y.%m.%d")
1
Subscribe to my newsletter

Read articles from Belson Malcolm Kutambe directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Belson Malcolm Kutambe
Belson Malcolm Kutambe