Binary formats#

Especially for large data sets, text formats do not scale and one should consider binary formats instead. They are designed to be machine-readable and can offer a better input/output performance, as they are optimized to be closer to the machine representation of the data. Another common advantage is built-in support for sparse data and compression. Many different formats exist, some proprietary, some restricted to specific research domains. We will discuss some open-source formats that are commonly found in scientific computing.

Pickle#

The pickle module allows to serialize Python objects into binary files, and restore them later on by de-serialization. Make sure to only unpickle data that you trust, as it basically allows for arbitrary code execution. Even though it’s important to know pickle exists, there are certainly better formats for scientific computing.

import pickle

some_object = {"A": 1, "B": [2, 3]}

with open("object.dump", "wb") as f:
    pickle.dump(some_object, f)

with open("object.dump", "rb") as f:
    imported_object = pickle.load(f)

assert imported_object == some_object

Sqlite3#

The sqlite3 module provides an interface to SQLite databases. These are single-file databases with remarkable performance. They are typically used as relational databases but also support JSON data. The query language is a dialect of Structured Query Language (SQL).

See also

If you need to access other SQL databases, have a look at pandas.read_sql() or SQLAlchemy for a full-fledged object-relational mapper (ORM).

import sqlite3

# Initialize connection to the SQLite DB
con = sqlite3.connect("measurements.db")

# Let queries return Row objects accessible by name
con.row_factory = sqlite3.Row

# Initialize cursor to execute SQL statements
cur = con.cursor()

# Create table in empty database
cur.execute("""
    CREATE TABLE IF NOT EXISTS
    measurements(device, temperature)
""")

# Prepare insertion of measurement data
measurements = [("dev1", 77.2), ("dev2", 294.5)]
cur.executemany(
    "INSERT INTO measurements VALUES(?, ?)",
    measurements,
)

# Commit transaction to write data to DB
con.commit()

# Retrieve all measurements from the DB
for row in cur.execute("SELECT * FROM measurements"):
    print(f"{row['device']} -> {row['temperature']}")

# Close connection to DB
con.close()

NumPy#

NumPy has its own binary serialization format. Single arrays are persisted in .npy files using np.save(), while multiple arrays are combined into .npz files with np.savez(). Conversely there’s np.load() to import both types of files back in.

Excel#

In case you got handed an Excel .xlsx spreadsheet with lab data, consider using Python for the further data analysis, instead of the proprietary Office software.

Pandas provides read_excel() and to_excel() as powerful interface to various spreadsheet file formats. Under the hood it uses the openpyxl engine to read and write Excel files natively from Python.

HDF#

The Hierarchical Data Format (HDF) is designed to store and organize larger amounts of scientific data and aims to provide fast I/O processing. The current version is HDF5 and has libraries for basically all programming languages. In addition, the HDFView Java software offers a GUI to visually navigate and modify the data.

Yet again, pandas offers convenient helper methods read_hdf() and to_hdf(). It makes use of the lower-level PyTables library. Alternatively there’s also the h5py Python package.