Reading and writing files#
Of course scientific computing relies on reading input parameters and exporting the computed results. But various things must be kept in mind with file input/output. So let’s start by looking at general aspects when reading and writing files.
First off is performance. Disk I/O is orders of magnitude slower than working in memory. One distinguishes between the latency of accessing a given file, and the further throughput of the actual data. It’s commonly recommended in HPC to bundle some data, then write it in bulk to disk, instead of appending small amounts of data at each computational step. In addition, always avoid writing millions of small files, for which the file system access if often much slower, than for a small number of larger files.
Next is portability. When collaborating across operating systems, additional care must be taken to make the code cross-platform compatible. Many aspects may depend on the platform: text encoding, line endings, path separators. Even simple cases, as accessing data from a particular directory, may need some thinking. In most cases you should prefer relative paths (with respect to the code location). However when accessing a network share, use the well-defined absolute path instead. Whenever possible, use code that abstracts away the differences of each platform. And otherwise try to keep as many parameters as possible customizable, so that others may adapt them to their needs.
Last but not least comes error handling. Writing data to a file may fail. For instance if the connection to the network share is not stable or the parent folder has unexpected permissions. Especially in long-running jobs try to react to such errors, maybe by re-trying again after some time, instead of losing your results.
Having mentioned some of the common obstacles, we now focus back on Python. We start by discussing text encoding, explaining how to interact with the file system (via os
, shutil
and pathlib
) and finally how to handle files.