The hidden cost of newlines
Table of contents
Earlier this week I faced an interesting issue.I was developing a prototype for a nextflow pipeline that will be used to manage a database.The prototype script was doing the following three things:
1. Get today's date in iso format(e.g. 2022.07.30)
2. Create a folder and name it by today's date, where today is the day this script has been run.
3. Finally, query the data API and create a file for adding data to the database.
The Script:
nextflow.enable.dsl=2
workflow {
ch_infiles = Channel.fromPath(params.infiles,checkIfExists:true)
ch_api = Channel.fromPath(params.api,checkIfExists:true)
today() | view
mk_today(ch_infiles,today.out)
query_api(ch_api,today.out,ch_infiles)
}
process today {
tag "Get todays date"
output:
stdout emit: day
script:
"""
echo \$(date +"%Y.%m.%d")
"""
}
process mk_today {
tag "Make todays folder"
input:
path seq_dir
val day
output:
stdout
script:
"""
mkdir -p ${seq_dir}/${day}
"""
}
process query_api {
input:
path api
val day
val seq_dir
output:
file("${day}.data_source.csv")
script:
"""
python ${api} > ${seq_dir}/${day}/${day}.data_source.csv
"""
}
Problem: Running this nextflow script gave the following output:
N E X T F L O W ~ version 22.05.0-edge
Launching `core_bioinfo.nf` [angry_montalcini] DSL2 - revision: a74f4aaa3a
executor > local (3)
[d3/8df354] process > today (Get todays date) [100%] 1 of 1 ✔
[60/029801] process > mk_today (Make todays folder) [100%] 1 of 1 ✔
[d9/846551] process > query_api (Get data from API) [ 0%] 0 of 1
Error executing process > 'query_api (Get data from API)'
Caused by:
Process `query_api (Get data from API)` terminated with an error exit status (1)
Command executed:
executor > local (3)
[d3/8df354] process > today (Get todays folder) [100%] 1 of 1 ✔
[60/029801] process > mk_today (Make todays folder) [100%] 1 of 1 ✔
[d9/846551] process > query_api (Get data from API) [100%] 1 of 1, failed: 1 ✘
Error executing process > 'query_api (Get data from API)'
Caused by:
Process `query_api (Get data from API)` terminated with an error exit status (1)
Command executed:
python get_data_from_api.py > /home/bkutambe/data/infiles/2022.09.01
/2022.09.01
.data_source.csv
Command exit status: 1
Command output:
(empty)
Command error:
.command.sh: line 2: /home/bkutambe/data/infiles/2022.09.01: Is a directory
Work dir:
/home/bkutambe/Documents/Core_Bioinfo/work/d9/846551a8367ec0c0ce724e83d91edb
Forensics
In the background nextflow generate a script called .command.sh and a look into it revealed something interesting.
#!/bin/bash -ue
python get_covid_cases_from_api.py > /home/bkutambe/data/seqbox/infiles/2022.09.01
/2022.09.01
.sample_source_sample_pcrs.csv
The command to query the data API was split into three separate lines and running it as is was causing an error. Bash interpreted and run them as three separate commands.This was supposedly the offending code!
Now the question was how to solve this? Enters observation bias. My preliminary assumption was that nextflow was splitting the command into three separate commands.What was causing this remained a mystery. A google search led to a dead-end. I then fired up slack and asked on both the official nextflow and microbial bioinformatics channels for colleagues in the bioinformatics space to give their input.
The arrival of the calvary
Colleagues in the microbial bioinformatics slack channel did help.After code examination it was pointed out that one of my variables had '\n', a character that adds newline to anything that comes before it.
The true offending code:
process today {
tag "Get todays folder"
output:
stdout emit: day
script:
"""
echo \$(date +"%Y.%m.%d")
"""
}
Specifically this line:
echo \$(date +"%Y.%m.%d")
The intention was to capture date in a variable like this:
day="2022.09.01"
BUT this happened:
day="2022.09.01\n"
This is the value that was propagated to downstream processes and triggered the runtime error when querying the data API.
Solution:
It was quite simple. Just adding -n to the echo command suppressed appending a newline to the end and that worked like charm!
echo -n \$(date +"%Y.%m.%d")
Subscribe to my newsletter
Read articles from Belson Malcolm Kutambe directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by