Concurrency#
The ability to execute multiple tasks in parallel is often crucial in scientific computing, as it may help to significantly improve the overall runtime of various algorithms. The scientific libraries like numpy
do a lot of the heavy-lifting under the hood and let the user directly run their computations on multiple cores, without having to worry about the implementation details. Nevertheless it can be instructive to know how to write simple multi-threading or multi-processing code directly in Python. Besides the asyncio approach, the Standard Library also contains concurrent.futures, which is probably easier to get started with.
HTTP requests#
Let’s consider the scenario where you need to download multiple internet pages and parse their contents. This could be any HTTP REST API where you need to collect some data, or, as in our case, simply the first few lines of a Wikipedia article.
import requests
def wiki_extract_for_title(wiki_title):
"""Query the Wikipedia API and return the first 5 sentences of the text with given title."""
resp = requests.get(
"https://en.wikipedia.org/w/api.php"
"?action=query&prop=extracts&exsentences=5&explaintext=1&format=json&titles="
+ wiki_title
)
# Convert the JSON into a Python dict
resp_dict = resp.json()
# Get the extract text from the Python dict
page_id = list(resp_dict["query"]["pages"].keys())[0]
extract = resp_dict["query"]["pages"][page_id]["extract"]
return extract
wiki_titles = ["Albert_Einstein", "James_Clerk_Maxwell"]
wiki_extracts = []
for wiki_title in wiki_titles:
wiki_extracts.append(wiki_extract_for_title(wiki_title))
print(wiki_extracts)
Note
We use the requests package instead of the built-in urllib, as it is a very popular HTTP library with a more concise syntax. Another alternative would be httpx.
Now assume you need to make hundreds or thousands of such HTTP requests. With a linear execution as above, the processor spends most of its time idling and waiting for the response of the web server. This is a perfect use case where concurrency provides a considerable speedup.
concurrent.futures#
The concurrent.futures
module provides a high-level interface for concurrency via multiple threads or processes. In a nutshell, you define a pool of workers (either ThreadPoolExecutor
for threads, or ProcessPoolExecutor
for processes) and submit your tasks to its queue. The returned object is an instance of concurrent.futures.Future
, which is conceptually similar to a promise or callback. It does not immediately contain the result of your queued task, but promises to have it in the future.
import requests
import concurrent.futures
def wiki_extract_for_title(wiki_title):
"""Query the Wikipedia API and return the first 5 sentences of the text with given title."""
resp = requests.get(
"https://en.wikipedia.org/w/api.php"
"?action=query&prop=extracts&exsentences=5&explaintext=1&format=json&titles="
+ wiki_title
)
# Convert the JSON into a Python dict
resp_dict = resp.json()
# Get the extract text from the Python dict
page_id = list(resp_dict["query"]["pages"].keys())[0]
extract = resp_dict["query"]["pages"][page_id]["extract"]
return extract
wiki_titles = ["Albert_Einstein", "James_Clerk_Maxwell"]
wiki_extracts = []
# Use a pool of 10 threads for concurrent execution
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
queued_futures = []
# Submit tasks to the worker queue
for wiki_title in wiki_titles:
future = executor.submit(wiki_extract_for_title, wiki_title)
queued_futures.append(future)
# Wait for all tasks to complete and gather the results
for future in concurrent.futures.as_completed(queued_futures):
wiki_extracts.append(future.result())
print(wiki_extracts)