Zonca, AndreaWeakley, Le MaiStandish, Matthew
Dask is an open-source Python library for parallel computing widely used by data scientists. Dask can scale Python code from multi-core local machines to large distributed clusters in the cloud. This tutorial will go over how to leverage Dask to provide distributed computing capabilities in Python to Jupyter Notebook users running on a JupyterHub instance deployed on top of Kubernetes on Jetstream2. We will explain how Dask works and how easy it is to process data in parallel using its high-level API, we will also rely on the Jestream2 object store system to save/load data in parallel using the cloud-native file format Zarr. During the tutorial we will also detail how the whole infrastructure is deployed and where each of the services is running within the infrastructure. Pointers to tutorials on how to deploy all the different components to Jetstream2 will also be provided.
Zonca, AndreaWeakley, Le MaiStandish, Matthew