13. dask: Parallel Computations for Large Datasets

13. `dask`: Parallel Computations for Large Datasets#

Some people think that large datasets or “big data” are mostly applied to machine learning and artificial intelligence fields. In atmospheric science,

Big data refers to data sets that are so voluminous and complex that traditional data processing application software is inadequate to deal with them.

Due to the long periods and finer grid resolutions of modern reanalysis data or model outputs, it often leads to system overload if not processed properly. You may see the following error message:

MemoryError: Unable to allocate 52.2 GiB for an array with shape (365, 37, 721, 1440) and data type float32

This error message appears because the data size has exceeded the RAM capacity. How should we avoid this situation?

Dask Arrays#

dask is a flexible library for parallel computing in Python. It can scale up to operate on large datasets and perform computations that cannot fit into memory. dask achieves this by breaking down large computations into smaller tasks, which are then executed in parallel. This can avoid consuming large amount of RAM.

To understand the usage of dask, with demonstrate first with a 1000 × 4000 array size.

1. Numpy Array:

import numpy as np

shape = (1000, 4000)
ones_np = np.ones(shape)
ones_np

array([[1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.]])

2. Dask Array:

import dask.array as da

ones = da.ones(shape)
ones

	Array	Chunk
Bytes	30.52 MiB	30.52 MiB
Shape	(1000, 4000)	(1000, 4000)
Dask graph	1 chunks in 1 graph layer
Data type	float64 numpy.ndarray

Dask devides the entire array into sub-arrays named “chunk”. In dask, we can specify the size of a chunk.

chunk_shape = (1000, 1000)
ones = da.ones(shape, chunks=chunk_shape)
ones

	Array	Chunk
Bytes	30.52 MiB	7.63 MiB
Shape	(1000, 4000)	(1000, 1000)
Dask graph	4 chunks in 1 graph layer
Data type	float64 numpy.ndarray

We can do some arithmetic calculations, such as multiplication and averaging.

ones_mean = (ones * ones[::-1, ::-1]).mean()
ones_mean

	Array	Chunk
Bytes	8 B	8 B
Shape	()	()
Dask graph	1 chunks in 6 graph layers
Data type	float64 numpy.ndarray

Following is the calculation procedure:

Dask allows computation of each chunk in each memory core, and finally combines all the computation of each chunk to a final result. Dask integrates commonly-used functions in numpy and xarray, which is beneficial to processing climate data. Then how will dask help with large datasets? In the following sections, we will demonstrate two types of workflow that will leverage the usage of dask to elevate computation efficiency.

Dask Environment Setup#

We add the following codes before we proceed to primary computation jobs.

from dask import delayed, compute
from dask.distributed import Client, LocalCluster

cluster = LocalCluster(threads_per_worker=1, memory_limit=0)
client = Client(cluster)
client

Client

Client-bb3778a9-805e-11f0-afc5-e43d1aa7f14b

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

16ac7f74

Dashboard: http://127.0.0.1:8787/status	Workers: 64
Total threads: 64	Total memory: 0 B
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-0886516d-ccaa-4827-a052-1ee185c09f65

Comm: tcp://127.0.0.1:34287	Workers: 64
Dashboard: http://127.0.0.1:8787/status	Total threads: 64
Started: Just now	Total memory: 0 B

Workers

Worker: 0

Comm: tcp://127.0.0.1:41575	Total threads: 1
Dashboard: http://127.0.0.1:38831/status	Memory: 0 B
Nanny: tcp://127.0.0.1:43743
Local directory: /tmp/dask-scratch-space/worker-bd3hz6tj

Worker: 1

Comm: tcp://127.0.0.1:37383	Total threads: 1
Dashboard: http://127.0.0.1:35319/status	Memory: 0 B
Nanny: tcp://127.0.0.1:39179
Local directory: /tmp/dask-scratch-space/worker-1jpyx5fw

Worker: 2

Comm: tcp://127.0.0.1:39917	Total threads: 1
Dashboard: http://127.0.0.1:34359/status	Memory: 0 B
Nanny: tcp://127.0.0.1:41279
Local directory: /tmp/dask-scratch-space/worker-tqohi74r

Worker: 3

Comm: tcp://127.0.0.1:43539	Total threads: 1
Dashboard: http://127.0.0.1:46839/status	Memory: 0 B
Nanny: tcp://127.0.0.1:36207
Local directory: /tmp/dask-scratch-space/worker-zqdhaqi5

Worker: 4

Comm: tcp://127.0.0.1:39513	Total threads: 1
Dashboard: http://127.0.0.1:39239/status	Memory: 0 B
Nanny: tcp://127.0.0.1:38991
Local directory: /tmp/dask-scratch-space/worker-t6jdm7il

Worker: 5

Comm: tcp://127.0.0.1:43125	Total threads: 1
Dashboard: http://127.0.0.1:44047/status	Memory: 0 B
Nanny: tcp://127.0.0.1:40793
Local directory: /tmp/dask-scratch-space/worker-kfh2tpox

Worker: 6

Comm: tcp://127.0.0.1:37373	Total threads: 1
Dashboard: http://127.0.0.1:43763/status	Memory: 0 B
Nanny: tcp://127.0.0.1:41985
Local directory: /tmp/dask-scratch-space/worker-avy8uxf7

Worker: 7

Comm: tcp://127.0.0.1:40095	Total threads: 1
Dashboard: http://127.0.0.1:40497/status	Memory: 0 B
Nanny: tcp://127.0.0.1:35001
Local directory: /tmp/dask-scratch-space/worker-_o1w5ra4

Worker: 8

Comm: tcp://127.0.0.1:32951	Total threads: 1
Dashboard: http://127.0.0.1:36639/status	Memory: 0 B
Nanny: tcp://127.0.0.1:33675
Local directory: /tmp/dask-scratch-space/worker-y34nluwt

Worker: 9

Comm: tcp://127.0.0.1:38607	Total threads: 1
Dashboard: http://127.0.0.1:45609/status	Memory: 0 B
Nanny: tcp://127.0.0.1:38407
Local directory: /tmp/dask-scratch-space/worker-42749a2g

Worker: 10

Comm: tcp://127.0.0.1:38763	Total threads: 1
Dashboard: http://127.0.0.1:40595/status	Memory: 0 B
Nanny: tcp://127.0.0.1:36755
Local directory: /tmp/dask-scratch-space/worker-h_r0a91z

Worker: 11

Comm: tcp://127.0.0.1:35899	Total threads: 1
Dashboard: http://127.0.0.1:41489/status	Memory: 0 B
Nanny: tcp://127.0.0.1:43905
Local directory: /tmp/dask-scratch-space/worker-qopjs_by

Worker: 12

Comm: tcp://127.0.0.1:33193	Total threads: 1
Dashboard: http://127.0.0.1:33981/status	Memory: 0 B
Nanny: tcp://127.0.0.1:40287
Local directory: /tmp/dask-scratch-space/worker-33j870vn

Worker: 13

Comm: tcp://127.0.0.1:39605	Total threads: 1
Dashboard: http://127.0.0.1:43329/status	Memory: 0 B
Nanny: tcp://127.0.0.1:35109
Local directory: /tmp/dask-scratch-space/worker-97uyj15z

Worker: 14

Comm: tcp://127.0.0.1:39029	Total threads: 1
Dashboard: http://127.0.0.1:34905/status	Memory: 0 B
Nanny: tcp://127.0.0.1:33343
Local directory: /tmp/dask-scratch-space/worker-gjzxdv71

Worker: 15

Comm: tcp://127.0.0.1:46735	Total threads: 1
Dashboard: http://127.0.0.1:33583/status	Memory: 0 B
Nanny: tcp://127.0.0.1:32831
Local directory: /tmp/dask-scratch-space/worker-rhwpt6p4

Worker: 16

Comm: tcp://127.0.0.1:39331	Total threads: 1
Dashboard: http://127.0.0.1:39461/status	Memory: 0 B
Nanny: tcp://127.0.0.1:39197
Local directory: /tmp/dask-scratch-space/worker-iu1al6fe

Worker: 17

Comm: tcp://127.0.0.1:45085	Total threads: 1
Dashboard: http://127.0.0.1:41189/status	Memory: 0 B
Nanny: tcp://127.0.0.1:33893
Local directory: /tmp/dask-scratch-space/worker-_m0sg0lw

Worker: 18

Comm: tcp://127.0.0.1:39561	Total threads: 1
Dashboard: http://127.0.0.1:46761/status	Memory: 0 B
Nanny: tcp://127.0.0.1:36161
Local directory: /tmp/dask-scratch-space/worker-bxafojev

Worker: 19

Comm: tcp://127.0.0.1:36847	Total threads: 1
Dashboard: http://127.0.0.1:44415/status	Memory: 0 B
Nanny: tcp://127.0.0.1:45083
Local directory: /tmp/dask-scratch-space/worker-cwl69jku

Worker: 20

Comm: tcp://127.0.0.1:35903	Total threads: 1
Dashboard: http://127.0.0.1:44985/status	Memory: 0 B
Nanny: tcp://127.0.0.1:37971
Local directory: /tmp/dask-scratch-space/worker-wevp8zvg

Worker: 21

Comm: tcp://127.0.0.1:43691	Total threads: 1
Dashboard: http://127.0.0.1:43077/status	Memory: 0 B
Nanny: tcp://127.0.0.1:39207
Local directory: /tmp/dask-scratch-space/worker-uj032la0

Worker: 22

Comm: tcp://127.0.0.1:45813	Total threads: 1
Dashboard: http://127.0.0.1:41991/status	Memory: 0 B
Nanny: tcp://127.0.0.1:39683
Local directory: /tmp/dask-scratch-space/worker-8cse6w_m

Worker: 23

Comm: tcp://127.0.0.1:37779	Total threads: 1
Dashboard: http://127.0.0.1:35121/status	Memory: 0 B
Nanny: tcp://127.0.0.1:44379
Local directory: /tmp/dask-scratch-space/worker-zh1buzvi

Worker: 24

Comm: tcp://127.0.0.1:35703	Total threads: 1
Dashboard: http://127.0.0.1:43107/status	Memory: 0 B
Nanny: tcp://127.0.0.1:46281
Local directory: /tmp/dask-scratch-space/worker-cjyuba9z

Worker: 25

Comm: tcp://127.0.0.1:46769	Total threads: 1
Dashboard: http://127.0.0.1:41495/status	Memory: 0 B
Nanny: tcp://127.0.0.1:45733
Local directory: /tmp/dask-scratch-space/worker-j6g0ivvy

Worker: 26

Comm: tcp://127.0.0.1:40881	Total threads: 1
Dashboard: http://127.0.0.1:33507/status	Memory: 0 B
Nanny: tcp://127.0.0.1:33997
Local directory: /tmp/dask-scratch-space/worker-6elroj21

Worker: 27

Comm: tcp://127.0.0.1:41775	Total threads: 1
Dashboard: http://127.0.0.1:38989/status	Memory: 0 B
Nanny: tcp://127.0.0.1:45489
Local directory: /tmp/dask-scratch-space/worker-btm13ixf

Worker: 28

Comm: tcp://127.0.0.1:36227	Total threads: 1
Dashboard: http://127.0.0.1:41141/status	Memory: 0 B
Nanny: tcp://127.0.0.1:40351
Local directory: /tmp/dask-scratch-space/worker-tayq9t8a

Worker: 29

Comm: tcp://127.0.0.1:44155	Total threads: 1
Dashboard: http://127.0.0.1:33619/status	Memory: 0 B
Nanny: tcp://127.0.0.1:46709
Local directory: /tmp/dask-scratch-space/worker-t2mr_drb

Worker: 30

Comm: tcp://127.0.0.1:39005	Total threads: 1
Dashboard: http://127.0.0.1:35897/status	Memory: 0 B
Nanny: tcp://127.0.0.1:43537
Local directory: /tmp/dask-scratch-space/worker-egbopwd8

Worker: 31

Comm: tcp://127.0.0.1:42901	Total threads: 1
Dashboard: http://127.0.0.1:41291/status	Memory: 0 B
Nanny: tcp://127.0.0.1:34781
Local directory: /tmp/dask-scratch-space/worker-t3o8p6mo

Worker: 32

Comm: tcp://127.0.0.1:38037	Total threads: 1
Dashboard: http://127.0.0.1:38389/status	Memory: 0 B
Nanny: tcp://127.0.0.1:32973
Local directory: /tmp/dask-scratch-space/worker-_b79z2sl

Worker: 33

Comm: tcp://127.0.0.1:35831	Total threads: 1
Dashboard: http://127.0.0.1:36887/status	Memory: 0 B
Nanny: tcp://127.0.0.1:44815
Local directory: /tmp/dask-scratch-space/worker-3ngrnhvj

Worker: 34

Comm: tcp://127.0.0.1:38143	Total threads: 1
Dashboard: http://127.0.0.1:41233/status	Memory: 0 B
Nanny: tcp://127.0.0.1:42669
Local directory: /tmp/dask-scratch-space/worker-0mzm6cd7

Worker: 35

Comm: tcp://127.0.0.1:33631	Total threads: 1
Dashboard: http://127.0.0.1:35231/status	Memory: 0 B
Nanny: tcp://127.0.0.1:46611
Local directory: /tmp/dask-scratch-space/worker-jvb0526f

Worker: 36

Comm: tcp://127.0.0.1:45453	Total threads: 1
Dashboard: http://127.0.0.1:42223/status	Memory: 0 B
Nanny: tcp://127.0.0.1:41875
Local directory: /tmp/dask-scratch-space/worker-h0w61115

Worker: 37

Comm: tcp://127.0.0.1:38361	Total threads: 1
Dashboard: http://127.0.0.1:40887/status	Memory: 0 B
Nanny: tcp://127.0.0.1:43365
Local directory: /tmp/dask-scratch-space/worker-z1d7s83g

Worker: 38

Comm: tcp://127.0.0.1:34187	Total threads: 1
Dashboard: http://127.0.0.1:44601/status	Memory: 0 B
Nanny: tcp://127.0.0.1:40913
Local directory: /tmp/dask-scratch-space/worker-vpec97is

Worker: 39

Comm: tcp://127.0.0.1:44877	Total threads: 1
Dashboard: http://127.0.0.1:41301/status	Memory: 0 B
Nanny: tcp://127.0.0.1:40361
Local directory: /tmp/dask-scratch-space/worker-oex62c1d

Worker: 40

Comm: tcp://127.0.0.1:35293	Total threads: 1
Dashboard: http://127.0.0.1:34101/status	Memory: 0 B
Nanny: tcp://127.0.0.1:40113
Local directory: /tmp/dask-scratch-space/worker-pidbx0ic

Worker: 41

Comm: tcp://127.0.0.1:39387	Total threads: 1
Dashboard: http://127.0.0.1:44471/status	Memory: 0 B
Nanny: tcp://127.0.0.1:35505
Local directory: /tmp/dask-scratch-space/worker-2108c_nh

Worker: 42

Comm: tcp://127.0.0.1:40413	Total threads: 1
Dashboard: http://127.0.0.1:44065/status	Memory: 0 B
Nanny: tcp://127.0.0.1:38049
Local directory: /tmp/dask-scratch-space/worker-9y2rz4t9

Worker: 43

Comm: tcp://127.0.0.1:34935	Total threads: 1
Dashboard: http://127.0.0.1:33495/status	Memory: 0 B
Nanny: tcp://127.0.0.1:45413
Local directory: /tmp/dask-scratch-space/worker-7nxvz5jt

Worker: 44

Comm: tcp://127.0.0.1:33585	Total threads: 1
Dashboard: http://127.0.0.1:38293/status	Memory: 0 B
Nanny: tcp://127.0.0.1:44093
Local directory: /tmp/dask-scratch-space/worker-5crt5o4b

Worker: 45

Comm: tcp://127.0.0.1:45843	Total threads: 1
Dashboard: http://127.0.0.1:34477/status	Memory: 0 B
Nanny: tcp://127.0.0.1:46075
Local directory: /tmp/dask-scratch-space/worker-txk224nq

Worker: 46

Comm: tcp://127.0.0.1:41749	Total threads: 1
Dashboard: http://127.0.0.1:36597/status	Memory: 0 B
Nanny: tcp://127.0.0.1:33491
Local directory: /tmp/dask-scratch-space/worker-3ztkttnc

Worker: 47

Comm: tcp://127.0.0.1:35247	Total threads: 1
Dashboard: http://127.0.0.1:39861/status	Memory: 0 B
Nanny: tcp://127.0.0.1:39685
Local directory: /tmp/dask-scratch-space/worker-ytgagwjc

Worker: 48

Comm: tcp://127.0.0.1:45491	Total threads: 1
Dashboard: http://127.0.0.1:42491/status	Memory: 0 B
Nanny: tcp://127.0.0.1:36429
Local directory: /tmp/dask-scratch-space/worker-m6wlrm85

Worker: 49

Comm: tcp://127.0.0.1:37861	Total threads: 1
Dashboard: http://127.0.0.1:41247/status	Memory: 0 B
Nanny: tcp://127.0.0.1:44315
Local directory: /tmp/dask-scratch-space/worker-37je10wa

Worker: 50

Comm: tcp://127.0.0.1:41403	Total threads: 1
Dashboard: http://127.0.0.1:33381/status	Memory: 0 B
Nanny: tcp://127.0.0.1:36815
Local directory: /tmp/dask-scratch-space/worker-h9h7upru

Worker: 51

Comm: tcp://127.0.0.1:36891	Total threads: 1
Dashboard: http://127.0.0.1:44907/status	Memory: 0 B
Nanny: tcp://127.0.0.1:33161
Local directory: /tmp/dask-scratch-space/worker-g1o4f8pa

Worker: 52

Comm: tcp://127.0.0.1:37553	Total threads: 1
Dashboard: http://127.0.0.1:38983/status	Memory: 0 B
Nanny: tcp://127.0.0.1:33283
Local directory: /tmp/dask-scratch-space/worker-xsh62sae

Worker: 53

Comm: tcp://127.0.0.1:38341	Total threads: 1
Dashboard: http://127.0.0.1:37625/status	Memory: 0 B
Nanny: tcp://127.0.0.1:33983
Local directory: /tmp/dask-scratch-space/worker-__d748mv

Worker: 54

Comm: tcp://127.0.0.1:37019	Total threads: 1
Dashboard: http://127.0.0.1:40047/status	Memory: 0 B
Nanny: tcp://127.0.0.1:37979
Local directory: /tmp/dask-scratch-space/worker-hn45rf6y

Worker: 55

Comm: tcp://127.0.0.1:32849	Total threads: 1
Dashboard: http://127.0.0.1:37961/status	Memory: 0 B
Nanny: tcp://127.0.0.1:43759
Local directory: /tmp/dask-scratch-space/worker-9nnh_cqx

Worker: 56

Comm: tcp://127.0.0.1:44631	Total threads: 1
Dashboard: http://127.0.0.1:43829/status	Memory: 0 B
Nanny: tcp://127.0.0.1:34801
Local directory: /tmp/dask-scratch-space/worker-augsahuv

Worker: 57

Comm: tcp://127.0.0.1:38881	Total threads: 1
Dashboard: http://127.0.0.1:41879/status	Memory: 0 B
Nanny: tcp://127.0.0.1:43735
Local directory: /tmp/dask-scratch-space/worker-q3xr17s7

Worker: 58

Comm: tcp://127.0.0.1:40087	Total threads: 1
Dashboard: http://127.0.0.1:42037/status	Memory: 0 B
Nanny: tcp://127.0.0.1:37533
Local directory: /tmp/dask-scratch-space/worker-0m6q2_xe

Worker: 59

Comm: tcp://127.0.0.1:37067	Total threads: 1
Dashboard: http://127.0.0.1:42831/status	Memory: 0 B
Nanny: tcp://127.0.0.1:45705
Local directory: /tmp/dask-scratch-space/worker-8sjuibho

Worker: 60

Comm: tcp://127.0.0.1:37809	Total threads: 1
Dashboard: http://127.0.0.1:39819/status	Memory: 0 B
Nanny: tcp://127.0.0.1:38239
Local directory: /tmp/dask-scratch-space/worker-2u9ouae7

Worker: 61

Comm: tcp://127.0.0.1:37829	Total threads: 1
Dashboard: http://127.0.0.1:39529/status	Memory: 0 B
Nanny: tcp://127.0.0.1:36063
Local directory: /tmp/dask-scratch-space/worker-vyzi73s5

Worker: 62

Comm: tcp://127.0.0.1:42233	Total threads: 1
Dashboard: http://127.0.0.1:42285/status	Memory: 0 B
Nanny: tcp://127.0.0.1:41257
Local directory: /tmp/dask-scratch-space/worker-_z_6v19i

Worker: 63

Comm: tcp://127.0.0.1:38079	Total threads: 1
Dashboard: http://127.0.0.1:43831/status	Memory: 0 B
Nanny: tcp://127.0.0.1:34733
Local directory: /tmp/dask-scratch-space/worker-51_fb0dd

LocalCluster: Used to start multiple workers on the local machine for parallel computation
Client: Connects to the scheduler and handles communication between your Python session and the cluster. Every computation job is submitted to the workers through the client.
cluster = LocalCluster(threads_per_worker=1, memory_limit=0): Creates a local cluster where each worker runs on a single thread. Here the memory limit is set to 0 (no restriction). Be cautious: if the dataset is too large or the workflow is poorly designed, this may overload the available RAM. In practice, you can set the memory limit explicitly (e.g., "4GB", "8GB", "16GB") to protect your system.
client = Client(cluster): Initializes a client object that communicates with the cluster. This ensures all Dask operations (e.g., loading large NetCDF/GRIB files with xarray, or delayed computations) are distributed to the workers via the scheduler.

There is also a link to the Client Dashboard. You can click to monitor your LocalCluster in real time.

Large Climate Dataset Processing#

In Unit 2, we introduced the parallel=True option in xarray.open_mfdataset. This option allows xarray to read the file using dask.

import xarray as xr

u = xr.open_mfdataset('./data/ncep_r2_uv850/u850.*.nc',
                           combine = "by_coords",               
                           parallel=True,
                         ).uwnd
v = xr.open_mfdataset('data/ncep_r2_uv850/v850.*.nc',
                           combine = "by_coords",               
                           parallel=True,
                         ).vwnd
u

<xarray.DataArray 'uwnd' (time: 8766, level: 1, lat: 73, lon: 144)> Size: 369MB
dask.array<concatenate, shape=(8766, 1, 73, 144), dtype=float32, chunksize=(366, 1, 73, 144), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 70kB 1998-01-01 1998-01-02 ... 2021-12-31
  * lon      (lon) float32 576B 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5
  * lat      (lat) float32 292B 90.0 87.5 85.0 82.5 ... -82.5 -85.0 -87.5 -90.0
  * level    (level) float32 4B 850.0
Attributes: (12/14)
    standard_name:         eastward_wind
    long_name:             Daily U-wind on Pressure Levels
    units:                 m/s
    unpacked_valid_range:  [-140.  175.]
    actual_range:          [-78.96 110.35]
    precision:             2
    ...                    ...
    var_desc:              u-wind
    dataset:               NCEP/DOE AMIP-II Reanalysis (Reanalysis-2) Daily A...
    level_desc:            Pressure Levels
    statistic:             Mean
    parent_stat:           Individual Obs
    cell_methods:          time: mean (of 4 6-hourly values in one day)

At this point, u is a Dask-backed DataArray, not a full in-memory NumPy array. At this point, no actual data is read into RAM yet. The DataArray only stores metadata: dimensions, coordinates, data type, chunking info. Hence, the u array is very small (only a few MB regardless of how big your NetCDF files).

Although xarray automatically chunks the dataset when using xr.open_mfdataset(), you can rechunk manually as follow:

from xarray.groupers import TimeResampler

u_rechunk = u.chunk({'time': TimeResampler("YS"), 'lon': 36, 'lat': 24})
u_rechunk

<xarray.DataArray 'uwnd' (time: 8766, level: 1, lat: 73, lon: 144)> Size: 369MB
dask.array<rechunk-p2p, shape=(8766, 1, 73, 144), dtype=float32, chunksize=(366, 1, 24, 36), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 70kB 1998-01-01 1998-01-02 ... 2021-12-31
  * lon      (lon) float32 576B 0.0 2.5 5.0 7.5 10.0 ... 350.0 352.5 355.0 357.5
  * lat      (lat) float32 292B 90.0 87.5 85.0 82.5 ... -82.5 -85.0 -87.5 -90.0
  * level    (level) float32 4B 850.0
Attributes: (12/14)
    standard_name:         eastward_wind
    long_name:             Daily U-wind on Pressure Levels
    units:                 m/s
    unpacked_valid_range:  [-140.  175.]
    actual_range:          [-78.96 110.35]
    precision:             2
    ...                    ...
    var_desc:              u-wind
    dataset:               NCEP/DOE AMIP-II Reanalysis (Reanalysis-2) Daily A...
    level_desc:            Pressure Levels
    statistic:             Mean
    parent_stat:           Individual Obs
    cell_methods:          time: mean (of 4 6-hourly values in one day)

Manual rechunking is particularly useful when the dataset is very large and the automatic chunking produces too many small chunks, which can lead to high scheduling overhead in Dask.

Practical guidelines for chunk sizing:

Aim for chunk sizes roughly 50–200 MB each. This is large enough to reduce overhead but small enough to fit in RAM comfortably.
Avoid having too many chunks per dimension (e.g., hundreds or thousands), because Dask has to manage each chunk as a separate task, which can slow down the computation.

Read more details on how to choose good chunk sizes here.

Note

You can use ds.chunk(time=TimeResampler()) to rechunk according to a specified unit of time. ds.chunk(time=TimeResampler("MS")), for example, will set the chunks so that a month of data is contained in one chunk.

`.persist()`#

Every .compute() triggers the entire task graph from beginning: reading the original NetCDF/GRIB files from disk, chunking the data, performing intermidiate computations, and producing the final results. Supoose we’d like to do various statistical computation on u, such as

umean = u.mean(axis=0).compute()
uvar  = u.var(axis=0).compute()
umed  = u.median(axis=0).compute()
umax  = u.max(axis=0).compute()
umin  = u.min(axis=0).compute()

Those process from reading original NetCDF files to produce u will be repeated once per line of computation that uses u, leading to redundant work and inefficiency. Therefore, .persist() triggers computation of parts of the task graph and stores the resulting chunks in the cluster workers’ memory (RAM). Subsequent computations based on the persisted variables will directly use the in-memory chunks, avoiding repeated reading and intermediate computations.

u_persist = u.persist()
v_persist = v.persist()

To sum up, use .persist() when you have a core variable (like u in this example) that will be used multiple times. But also be mindful of worker memory usage: if the dataset is very large, persisting may fill up RAM. In that case, consider rechunking or persisting only a subset (e.g., a time slice).

Example 1: Calculate the daily climatology of u and v.

We will compute the daily climatology with the flox.xarray.xarray_reduce libray. The input wind fields are the persisted variables.

from flox.xarray import xarray_reduce

uDayClm = xarray_reduce(u_persist,
                        u_persist.time.dt.dayofyear,
                        func='mean',dim='time')
vDayClm = xarray_reduce(v_persist,
                        v_persist.time.dt.dayofyear,
                        func='mean',dim='time')
uDayClm 

<xarray.DataArray 'uwnd' (dayofyear: 366, level: 1, lat: 73, lon: 144)> Size: 15MB
dask.array<transpose, shape=(366, 1, 73, 144), dtype=float32, chunksize=(366, 1, 73, 144), chunktype=numpy.ndarray>
Coordinates:
  * lon        (lon) float32 576B 0.0 2.5 5.0 7.5 ... 350.0 352.5 355.0 357.5
  * lat        (lat) float32 292B 90.0 87.5 85.0 82.5 ... -85.0 -87.5 -90.0
  * level      (level) float32 4B 850.0
  * dayofyear  (dayofyear) int64 3kB 1 2 3 4 5 6 7 ... 361 362 363 364 365 366
Attributes: (12/14)
    standard_name:         eastward_wind
    long_name:             Daily U-wind on Pressure Levels
    units:                 m/s
    unpacked_valid_range:  [-140.  175.]
    actual_range:          [-78.96 110.35]
    precision:             2
    ...                    ...
    var_desc:              u-wind
    dataset:               NCEP/DOE AMIP-II Reanalysis (Reanalysis-2) Daily A...
    level_desc:            Pressure Levels
    statistic:             Mean
    parent_stat:           Individual Obs
    cell_methods:          time: mean (of 4 6-hourly values in one day)

After lazy computation, we can now really trigger computation and return the final result with .compute().

The computed result is now saved in a DataArray.

Takeaway#

In this example, our dataset is relatively small, so the advantage of using persist() and lazy computation may not be obvious. However, the workflow we practiced:

Open data lazily with Dask
Decide chunking strategy
Use .persist() for frequently reused core variables
Use .compute() only when results are truly needed

This is exactly the same workflow that will scale up to very large climate datasets (e.g., high resolution ERA5 reanalysis, multi-dacade CMIP6 data, or high resolution ensemble model reforecast GEFSv12 data).

Loops with Parallel Computation#

Sometimes, intrinsic functions cannot directly be applied to our analysis procedure. Therefore, we have to loop over different index for several times. We can wrap this workflow into a function, and then run the function across a loop in parallel using Dask. For example, if we have 8 iterations in a loop and 8 available cores (workers), Dask can distribute each iteration to a separate core. In ideal conditions, this can reduce the total loop runtime roughly to the time of a single iteration, since the 8 iterations are computed simultaneously. This approach allows us to take full advantage of multiple cores without manually managing threads or processes.

Example 2: Plot precipitation and 850-hPa wind composite for each MJO phase in DJF months.

Step 1: Read BoM RMM index, which can be accessed from IRI data library.

import pandas as pd

# Read MJO data
mjo_ds = xr.open_dataset('http://iridl.ldeo.columbia.edu/SOURCES/.BoM/.MJO/.RMM/dods',
                         decode_times=False)

T = mjo_ds.T.values
mjo_ds['T'] = pd.date_range("1974-06-01", periods=len(T))  # 資料的起始時間為1974-06-01

mjo_sig_phase = xr.where(mjo_ds.amplitude>=1,  mjo_ds.phase, 0)
mjo_slice = mjo_sig_phase.sel(T=slice("1998-01-01","2020-12-31"))
mjo_djf = mjo_slice.sel(T=mjo_slice['T'].dt.month.isin([12, 1, 2]))

syntax error, unexpected WORD_WORD, expecting ';' or ','
context: Attributes { T { String calendar "standard"; Int32 expires 1755820800; String standard_name "time"; Float32 pointwidth 1.0; Int32 gridtype 0; String units "julian_day"; } amplitude { Int32 expires 1755820800; String units "unitless"; Float32 missing_value 9.99999962E35; } phase { Int32 expires 1755820800; String units "unitless"; Float32 missing_value 999.0; } RMM1 { Int32 expires 1755820800; String units "unitless"; Float32 missing_value 9.99999962E35; } RMM2 { Int32 expires 1755820800; String units "unitless"; Float32 missing_value 9.99999962E35; }NC_GLOBAL { String references "Wheeler_Hendon2004"; Int32 expires 1755820800; URL Wheeler and Hendon^ (2004) Monthly Weather Review article "http://journals.ametsoc.org/doi/abs/10.1175/1520-0493(2004)132%3C1917:AARMMI%3E2.0.CO;2"; String description "Real-time Multivariate MJO Index (with components of interannual variability removed)"; URL summary from BoM "http://www.bom.gov.au/climate/mjo/"; URL data source "http://www.bom.gov.au/climate/mjo/graphics/rmm.74toRealtime.txt"; String Conventions "IRIDL";}}
Illegal attribute
context: Attributes { T { String calendar "standard"; Int32 expires 1755820800; String standard_name "time"; Float32 pointwidth 1.0; Int32 gridtype 0; String units "julian_day"; } amplitude { Int32 expires 1755820800; String units "unitless"; Float32 missing_value 9.99999962E35; } phase { Int32 expires 1755820800; String units "unitless"; Float32 missing_value 999.0; } RMM1 { Int32 expires 1755820800; String units "unitless"; Float32 missing_value 9.99999962E35; } RMM2 { Int32 expires 1755820800; String units "unitless"; Float32 missing_value 9.99999962E35; }NC_GLOBAL { String references "Wheeler_Hendon2004"; Int32 expires 1755820800; URL Wheeler and Hendon^ (2004) Monthly Weather Review article "http://journals.ametsoc.org/doi/abs/10.1175/1520-0493(2004)132%3C1917:AARMMI%3E2.0.CO;2"; String description "Real-time Multivariate MJO Index (with components of interannual variability removed)"; URL summary from BoM "http://www.bom.gov.au/climate/mjo/"; URL data source "http://www.bom.gov.au/climate/mjo/graphics/rmm.74toRealtime.txt"; String Conventions "IRIDL";}}

Step 2: Read data.

def calc_anml_slice(da): 
    da_slice = da.sel(time=slice("1998-01-01","2020-12-31"),lat=slice(20,-20),lon=slice(40,180)).chunk({'time': TimeResampler("YS")})
    da_anml = da_slice.groupby('time.dayofyear') - da_slice.groupby('time.dayofyear').mean('time')  # Anomaly
    da_djf = da_anml.sel(time=da_anml.time.dt.month.isin([12, 1, 2])) 
    return da_djf

pcp = xr.open_dataset('./data/cmorph_sample.nc').cmorph
pcp = pcp[:,::-1,:]  # Re-order latitude from north to south (same as u and v)

# Calculate OLR and wind anomaly 
pcpa = calc_anml_slice(pcp)
ua   = calc_anml_slice(u.isel(level=0))
va   = calc_anml_slice(v.isel(level=0))

pcpa

<xarray.DataArray 'cmorph' (time: 2076, lat: 160, lon: 560)> Size: 744MB
dask.array<getitem, shape=(2076, 160, 560), dtype=float32, chunksize=(216, 160, 560), chunktype=numpy.ndarray>
Coordinates:
  * time       (time) datetime64[ns] 17kB 1998-01-01 1998-01-02 ... 2020-12-31
  * lon        (lon) float32 2kB 40.12 40.38 40.62 40.88 ... 179.4 179.6 179.9
  * lat        (lat) float32 640B 19.88 19.62 19.38 ... -19.38 -19.62 -19.88
    dayofyear  (time) int64 17kB 1 2 3 4 5 6 7 8 ... 360 361 362 363 364 365 366