Task runner#
Having stressed the importance of automation, we now proceed by looking at task runners. Their goal is to run a set of pre-defined commands in the correct order. The archetype usecase is compiling software, where dependencies dictate in which order the libraries must be built, before linking them all into the final executable. A similar type of dependencies can also happen in data science pipelines, were the input data must first be cleaned, transformed and merged, before the actual analysis can take place. Given the dependency tree, the task runner can watch for changes of the input files, and only execute the necessary steps to re-generate the output files. Not having to run the whole pipeline each time can save time, especially when some long-running computations are only re-run when absolutely needed.
Make#
The GNU make software (and derivatives) is probably the most used build automation tool in Unix software development.
When running make
, it searches the current directory for a file named Makefile
, which contains the rules to build the individual targets, and how they depend on each other. Schematically
<target>: <prerequisites>
<TAB><commands to build target>
where it’s important to use TAB
for indentation and not spaces.
Consider the following example Makefile
combining several targets.
# Declaration of variables
INPUT_FILES := data1 data2
INTERMEDIATE_FILE := merged_data
# Targets that do not correspond to file names
.PHONY: all clean
# First target is the default goal when calling `make`
all: $(INPUT_FILES) $(INTERMEDIATE_FILE)
clean:
rm -f $(INPUT_FILES) $(INTERMEDIATE_FILE)
data1:
echo "1" > data1
data2:
echo "2" > data2
# The intermediate file depends on the input files
$(INTERMEDIATE_FILE): $(INPUT_FILES)
cat data1 data2 > $(INTERMEDIATE_FILE)
It’s sufficient to call make [all]
to generate the input and intermediate files, in the correct order.
$ make
echo "1" > data1
echo "2" > data2
cat data1 data2 > merged_data
When calling make
again, nothing will be done, as all targets already exist. However, when modifying one of the input files, the intermediate file will be regenerated, as make
notices that it’s out of date.
$ make
make: Nothing to be done for `all'.
$ echo "3" > data2
$ make
cat data1 data2 > merged_data
Finally, make clean
will delete all generated files.
$ make clean
rm -f data1 data2 merged_data
See also
There are many online ressources and tutorials available to help you dive deeper into Makefiles, for instance the official make manual or the makefile tutorial.
Alternatives#
There are many alternatives to Make, depending on personal taste, the programming language or surrounding community. We only mention a few.
snakemake#
The snakemake workflow management system is written in Python and targeted at data science pipelines.
doit#
Another task runner implemented in Python is doit.
task#
Task is implemented in Go and parsed a Taskfile.yml
file.