Concurrency#

The ability to execute multiple tasks in parallel is often crucial in scientific computing, as it may help to significantly improve the overall runtime of various algorithms. The scientific libraries like numpy do a lot of the heavy-lifting under the hood and let the user directly run their computations on multiple cores, without having to worry about the implementation details. Nevertheless it can be instructive to know how to write simple multi-threading or multi-processing code directly in Python. Besides the asyncio approach, the Standard Library also contains concurrent.futures, which is probably easier to get started with.

HTTP requests#

Let’s consider the scenario where you need to download multiple internet pages and parse their contents. This could be any HTTP REST API where you need to collect some data, or, as in our case, simply the first few lines of a Wikipedia article.

import requests


def wiki_extract_for_title(wiki_title):
    """Query the Wikipedia API and return the first 5 sentences of the text with given title."""

    resp = requests.get(
        "https://en.wikipedia.org/w/api.php"
        "?action=query&prop=extracts&exsentences=5&explaintext=1&format=json&titles="
        + wiki_title
    )
    # Convert the JSON into a Python dict
    resp_dict = resp.json()
    # Get the extract text from the Python dict
    page_id = list(resp_dict["query"]["pages"].keys())[0]
    extract = resp_dict["query"]["pages"][page_id]["extract"]

    return extract


wiki_titles = ["Albert_Einstein", "James_Clerk_Maxwell"]
wiki_extracts = []

for wiki_title in wiki_titles:
    wiki_extracts.append(wiki_extract_for_title(wiki_title))

print(wiki_extracts)

Note

We use the requests package instead of the built-in urllib, as it is a very popular HTTP library with a more concise syntax. Another alternative would be httpx.

Now assume you need to make hundreds or thousands of such HTTP requests. With a linear execution as above, the processor spends most of its time idling and waiting for the response of the web server. This is a perfect use case where concurrency provides a considerable speedup.

concurrent.futures#

The concurrent.futures module provides a high-level interface for concurrency via multiple threads or processes. In a nutshell, you define a pool of workers (either ThreadPoolExecutor for threads, or ProcessPoolExecutor for processes) and submit your tasks to its queue. The returned object is an instance of concurrent.futures.Future, which is conceptually similar to a promise or callback. It does not immediately contain the result of your queued task, but promises to have it in the future.

import requests
import concurrent.futures


def wiki_extract_for_title(wiki_title):
    """Query the Wikipedia API and return the first 5 sentences of the text with given title."""

    resp = requests.get(
        "https://en.wikipedia.org/w/api.php"
        "?action=query&prop=extracts&exsentences=5&explaintext=1&format=json&titles="
        + wiki_title
    )
    # Convert the JSON into a Python dict
    resp_dict = resp.json()
    # Get the extract text from the Python dict
    page_id = list(resp_dict["query"]["pages"].keys())[0]
    extract = resp_dict["query"]["pages"][page_id]["extract"]

    return extract


wiki_titles = ["Albert_Einstein", "James_Clerk_Maxwell"]
wiki_extracts = []

# Use a pool of 10 threads for concurrent execution
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    queued_futures = []
    # Submit tasks to the worker queue
    for wiki_title in wiki_titles:
        future = executor.submit(wiki_extract_for_title, wiki_title)
        queued_futures.append(future)
    # Wait for all tasks to complete and gather the results
    for future in concurrent.futures.as_completed(queued_futures):
        wiki_extracts.append(future.result())

print(wiki_extracts)