Text formats

Text formats#

A first choice, when printing results to the console is no longer sufficient, is to export them to some variant of text files. The advantage of text formats is that they are human-readable, non-proprietary, cross-platform and future-proof. We will discuss how Python can interact with some common text formats.

CSV#

The canonical choice is to dump all numbers with a chosen separator between, resulting in either Comma-Separated Values (CSV) or Tab-Separated Values (TSV). This format is best for flat, tabular data structures. It’s best practice to include a header as first line with the names of the columns.

A,B,C
1,2,3

Standard Library#

The csv module providers reader and writer objects for CSV files.

Tip

The below example uses an io.StringIO object as proxy for a real CSV file. In a nutshell this is a string-like object that behaves like a file, but in-memory and not on the filesystem. This allows our snippets to contain the data directly inside the code, thereby making the examples self-contained and hopefully easier to grasp. Feel free to adapt the with csv_file as f: contexts to with open("file.csv") as f: in order to work with actual files.

import csv
import io

csv_file = io.StringIO(
    "A,B,C\n"
    "1,2,3\n"
    "a,\"b,b\nb\",c\n"
)

with csv_file as f:
    csv_reader = csv.reader(f)
    # Save the first row with the headers
    csv_header = next(csv_reader)
    # Save all others rows as nested list
    data = [row for row in csv_reader]

assert data == [["1", "2", "3"], ["a", "b,b\nb", "c"]]

# Export again as csv (with the same contents as csv_file)
with open("file.csv", "w") as f:
    csv.writer(f).writerow(csv_header)
    csv.writer(f).writerows(data)

Note how the second element of the second row is quoted, in order to support nested newlines and commas.

There are also dedicated DictReader and DictWriter methods to work with rows as dictionaries instead of lists.

import csv

data = [
    {"A": 1, "B": 2, "C": 3},
    {"A": 4, "B": 5, "C": 6},
]

with open("file_dict.csv", "w") as f:
    # Use the dict keys of the first row as field names
    csv_writer = csv.DictWriter(f, fieldnames=data[0].keys())
    # Write header with field names and the data
    csv_writer.writeheader()
    csv_writer.writerows(data)

NumPy#

When working with scientific libraries, it’s recommended to use their helper methods to import/export text data, instead of the csv standard library module. The numpy library offers np.loadtxt() as a faster reader for simple text files. In addition there’s np.genfromtxt() with more sophisticated handling of missing values. Conversely np.savetxt() can export data to a text file.

import numpy as np
import io

csv_file = io.StringIO(
    "value_x,value_y\n"
    "2.0,3.0\n"
    "2.2,3.3\n"
)

data = np.loadtxt(
    csv_file,       # csv file to import as ndarray
    delimiter=",",  # use comma as separator
    skiprows=1,     # skip the one line with the header
)

Pandas#

The data analysis library pandas offers read_csv() and to_csv() to import/export CSV files. These methods can offer a better performance and more configuration options than the numpy equivalents. They can also readily handle and parse datetime values.

import pandas as pd
import io

csv_file = io.StringIO(
    "date , value_x, value_y\n"
    "1-2-2023, 2.0, 3.0\n"
    "2-2-2023, 2.2, 3.3\n"
)

df = pd.read_csv(
    csv_file,                           # csv file to import as dataframe
    header=0,                           # treat the first csv line as header
    names=["datecol", "col1", "col2"],  # override header and assign custom names
    skipinitialspace=True,              # ignore space after comma in csv
    engine="c",                         # use faster C engine for parsing
    parse_dates=["datecol"],            # parse given column as date
    dayfirst=True,                      # dates start with day (eg DD-MM-YYYY)
    compression=None,                   # on-the-fly decompression (zip, tar, ...)
)

JSON#

The JavaScript Object Notation (JSON) allows to encode more complicated and nested data structures. Its notation is very close to the basic data types in Python. Moreover it has become the standard data exchange format of most HTTP and REST APIs, for instance submitting measurement data to the InfluxDB time series database.

{
    "firstname": "John",
    "surname": "Doe",
    "age": 42,
    "married": false,
    "children": null,
    "emails": [
        {
            "type": "private",
            "email": "johndoe@example.com"
        }
    ]
}

The following Python snippet uses the builtin json module to import the above JSON, modify it, and export the changes.

import json
from pprint import pprint

with open("johndoe.json") as f:
    johndoe = json.load(f)

pprint(johndoe)
#  {'age': 42,
#   'children': None,
#   'emails': [{'email': 'johndoe@example.com', 'type': 'private'}],
#   'firstname': 'John',
#   'married': False,
#   'surname': 'Doe'}

johndoe["married"] = True
johndoe["spouse"] = "Jane Doe"

with open("johndoe.json", "w") as f:
    json.dump(johndoe, f, sort_keys=False, indent=4)

Watch out that not all data types can directly be serialized as JSON. In particular, when expanding the above example to naively include a date, the json.dump fails with TypeError: Object of type date is not JSON serializable. A simple fix could be to add the default=str argument, so that str() is applied first to non-serializable objects, yielding the “YYYY-MM-DD” string representation of the date.

import json
from datetime import datetime

johndoe = {}
johndoe["weddingdate"] = datetime.now().date()

with open("johndoe.json", "w") as f:
    json.dump(johndoe, f, sort_keys=False, indent=4, default=str)

XML#

The Extensible Markup Language (XML) is a close relative to the well-known HTML of the world wide web.

<?xml version="1.0" encoding="utf-8"?>
<data>
 <row>
   <shape>square</shape>
   <sides>4</sides>
 </row>
 <row>
   <shape>triangle</shape>
   <sides>3</sides>
 </row>
 <row>
   <shape>circle</shape>
   <sides/>
 </row>
</data>

Different XML libraries exist in the Python world:

Others#

  • Python Standard Library includes the ConfigParser module to read/write Initialization (INI) config files.

  • Since Python 3.11 tomllib is included to parse Tom’s Obvious Minimal Language (TOML) files.

  • The third-party ruamel.yaml provides a powerful interface to Yet Another Markup Language (YAML) files.