Skip to content

Commit 5511156

Browse files
authored
Merge pull request #16 from janelia-cellmap/multiscale
extension of #15 This PR consolidates several feature branches that support conversion of light imagery: Builds on top of the tif_volume branch Move to Pixi build Allow LSF directives on the command line Create .zattrs before writing data Use Z axis for slabs Optionally turn off read optimizations because they don't work in all cases Don't write empty units (breaks Neuroglancer) Added: Integrated multiscale pyramid generation Added config as click option to specify all the necessary parameters in a yaml file Added pydantic model to validate input parameters Rewrite optimize reads for tiff file conversion. Read the smallest tiff segment but big enough to write into chunks only once (align with the chunk grid)
2 parents a8030b6 + 77170ea commit 5511156

File tree

15 files changed

+3508
-145
lines changed

15 files changed

+3508
-145
lines changed

.github/workflows/ci.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ jobs:
1212

1313
strategy:
1414
matrix:
15-
python-version: ['3.12'] #['3.10', '3.11', '3.12']
15+
python-version: ['3.12']
16+
1617

1718
steps:
1819
- uses: actions/checkout@v4

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,6 @@ dist/
77
.python-version
88
*.egg-info/
99
build/
10-
.vscode/
10+
.vscode/
11+
.claude
12+
CLAUDE.md

README.md

Lines changed: 58 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,72 @@
1-
# zarrify
1+
# Zarrify
22

3-
Convert TIFF/TIFF Stacks, MRC, and N5 files to Zarr (v2) format with ome-ngff (0.4) metadata.
3+
Convert TIFF/TIFF Stacks, MRC, and N5 files to Zarr (v2) format with OME-NGFF 0.4 metadata.
44

5-
## Install
5+
## Prerequisites
6+
7+
You must [install Pixi](https://pixi.sh/latest/installation/) to run this tool.
8+
9+
## Install into local env
10+
11+
After you check out the repo, run Pixi to download dependencies and create the local environment:
612

713
```bash
8-
pip install zarrify
14+
pixi run dev-install
915
```
1016

1117
## Usage
1218

13-
### Local processing
14-
```bash
15-
zarrify --src input.tiff --dest output.zarr --cluster local --workers 20
19+
### Step 1: Convert image to OME-Zarr format
20+
21+
To convert an OME-TIFF image with axes ZCYX:
22+
23+
```
24+
pixi run zarrify --src=/path/to/input --dest=/path/to/output.zarr --axes z c y x --units nanometer '' nanometer nanometer --zarr_chunks 64 3 128 128
1625
```
1726

18-
### LSF cluster processing
19-
```bash
20-
bsub -n 1 -J to_zarr 'zarrify --src input.tiff --dest output.zarr --cluster lsf --workers 20'
27+
### Step 2: Add multiscale pyramid
28+
29+
To add multiscale pyramids to the above OME-Zarr:
30+
2131
```
32+
pixi run zarr-multiscale --config=/path/to/config.yaml
33+
```
34+
35+
Where `config.yaml` contains the parameters, e.g.:
36+
37+
```yaml
38+
src: "/path/to/output.zarr"
39+
workers: 50
40+
data_origin: "raw"
41+
cluster: "lsf"
42+
log_dir: "/path/to/logs/workers"
43+
antialiasing: false
44+
high_aspect_ratio: false
45+
custom_scale_factors:
46+
- [1, 1, 1, 1]
47+
- [1, 1, 2, 2]
48+
- [1, 1, 4, 4]
49+
- [1, 1, 8, 8]
50+
- [1, 2, 16, 16]
51+
- [1, 4, 32, 32]
52+
- [1, 8, 64, 64]
53+
- [1, 16, 128, 128]
54+
- [1, 32, 256, 256]
55+
- [1, 64, 512, 512]
56+
- [1, 128, 1024, 1024]
57+
```
58+
59+
### IBM Platform LSF
60+
61+
You can run the pipeline on an LSF cluster by specifying the `--cluster=lsf` parameter. For example:
62+
63+
```
64+
export INPUT=/path/to/input
65+
export OUTPUT=/path/to/output.zarr
66+
export LOGDIR=`pwd`
67+
bsub -n 4 -P myproject -o $LOGDIR/jobout.log -e $LOGDIR/joberr.log "/misc/sc/pixi run zarrify --src=$INPUT --dest=$OUTPUT --cluster $CLUSTER --log_dir $LOGDIR/workers --workers 500 --zarr_chunks 3 128 128 128 --extra_directives='-P myproject' --scale 1.0 1.0 0.116 0.116 --optimize_reads"
68+
```
69+
2270
2371
## Python API
2472

config_example.yaml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
src : "/path_to_file/input_file.tiff" # image source path
2+
dest : "/path_to_output/output_file.zarr" # zarr container path
3+
4+
# Dask configs
5+
cluster : local # "local"/"lsf" cluster
6+
workers : 20 # number of dask workers
7+
log_dir : # specify log directory, otherwise get an email from each worker
8+
extra_directives : null # pass extra dask options
9+
10+
# Output zarr array configs
11+
zarr_chunks : [128, 128, 128] # output chunksize
12+
axes : [z, y, x] # axis order
13+
translation : [0.0, 0.0, 0.0] # array offset
14+
scale : [8.0, 8.0, 8.0] # voxel size
15+
units : [nanometer, nanometer, nanometer] # physical units
16+
optimize_reads : True
17+
18+
# Multiscale configs
19+
multiscale : True # whether to create a multiscale pyramid from converted zarr array or not
20+
ms_workers : 20 # number of dask workers
21+
data_origin : "raw" # "raw"/"labels"
22+
antialiasing : False # True/False blur dataset with Gaussian filter before downsampling to reduce aliasing
23+
normalize_voxel_size : False # True/False
24+
custom_scale_factors : null # specify custom multipliers for each level [[1,1,1], [2,4,8], [4, 8, 16],...]
25+

pixi.lock

Lines changed: 2712 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 36 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
[project]
22
name = "zarrify"
3-
version = "0.0.8"
3+
version = "0.1.0"
44
description = "Convert scientific image formats (TIFF, MRC, N5) to OME-Zarr"
55
authors = [
66
{ name = "Yurii Zubov", email = "[email protected]" }
77
]
88
readme = "README.md"
9-
requires-python = ">=3.10"
9+
requires-python = ">=3.12"
1010
license = { text = "MIT" }
1111
keywords = ["zarr", "ome-zarr", "tiff", "mrc", "n5", "scientific-imaging"]
1212
classifiers = [
@@ -17,32 +17,56 @@ classifiers = [
1717
"Programming Language :: Python :: 3.11",
1818
"Topic :: Scientific/Engineering :: Image Processing",
1919
]
20-
2120
dependencies = [
21+
"bokeh>=3.3.2",
2222
"click>=8.1.8,<9.0.0",
23-
"colorama>=0.4.6,<0.5.0",
23+
"colorama>=0.4.6,<0.5.0",
2424
"dask>=2024.12.1,<2025.0.0",
25-
"dask-jobqueue==0.8.2",
25+
"dask-jobqueue==0.9.0",
2626
"imagecodecs>=2024.12.30,<2025.0.0",
2727
"mrcfile>=1.5.3,<2.0.0",
2828
"natsort>=8.4.0,<9.0.0",
29-
"numcodecs>=0.13.0,<0.16",
30-
"pydantic-zarr>=0.4.0,<0.5.0",
29+
"numcodecs>=0.15.1,<0.16.1",
30+
"numpy>=1.26.2",
31+
"pydantic>=2.11.7,<3",
32+
"pydantic-zarr>=0.4.0,<0.9.0",
3133
"pint>=0.20.0,<1.0.0",
34+
"scipy>=1.16.1",
3235
"tifffile>=2024.1.0,<2025.5.10",
33-
"zarr==2.18.3",
34-
"bokeh>=3.1.0"
36+
"xarray-multiscale>=2.1.0",
37+
"zarr>=2.18.3,<3.0"
3538
]
3639

3740
[project.scripts]
3841
zarrify = "zarrify.to_zarr:cli"
42+
zarr-multiscale = "zarrify.multiscale.to_multiscale:cli"
3943

4044
[project.optional-dependencies]
4145
test = [
4246
"pytest"
4347
]
4448

49+
[tool.pixi.project]
50+
channels = ["conda-forge"]
51+
platforms = ["osx-64", "osx-arm64", "linux-64"]
52+
[tool.pixi.tasks]
53+
dev-install = "pip install -e ."
54+
zarrify = { cmd = "zarrify" }
55+
zarr-multiscale = { cmd = "zarr-multiscale" }
56+
test = "pytest"
57+
clean = "rm -rf build/ dist/ *.egg-info/ .pytest_cache/ __pycache__/ dask_dashboard_link.txt"
58+
59+
[tool.pixi.feature.test.dependencies]
60+
pytest = "*"
61+
62+
[tool.pixi.feature.test.tasks]
63+
test-install = "pip install -e ."
64+
test = { cmd = "pytest", depends-on = ["test-install"] }
65+
66+
[tool.pixi.environments]
67+
default = { solve-group = "default" }
68+
test = { features = ["test"], solve-group = "default" }
4569

46-
[build-system]
47-
requires = ["setuptools>=61.0"]
48-
build-backend = "setuptools.build_meta"
70+
[tool.pixi.dependencies]
71+
python = "3.12.*"
72+
pip = ">=25.2,<26"

src/zarrify/formats/mrc.py

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,8 @@ def __init__(
3838

3939
def write_to_zarr(
4040
self,
41-
dest: str,
41+
zarr_array: zarr.Array,
4242
client: Client,
43-
zarr_chunks : list[int],
44-
comp : ABCMeta = Zstd(level=6),
4543
):
4644
"""Use mrcfile memmap to access small parts of the mrc file and write them into zarr chunks.
4745
@@ -55,10 +53,7 @@ def write_to_zarr(
5553

5654
src_path = copy.copy(self.src_path)
5755

58-
if len(zarr_chunks) != len(self.shape):
59-
zarr_chunks = self.reshape_to_arr_shape(zarr_chunks, self.shape)
60-
61-
z_arr = self.get_output_array(dest, zarr_chunks, comp)
56+
z_arr = zarr_array
6257

6358
out_slices = slices_from_chunks(
6459
normalize_chunks(z_arr.chunks, shape=z_arr.shape)

src/zarrify/formats/tiff.py

Lines changed: 62 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,21 @@
44
import os
55
from dask.distributed import Client, wait
66
import time
7-
import dask.array as da
87
import copy
98
from zarrify.utils.volume import Volume
109
from abc import ABCMeta
1110
from numcodecs import Zstd
11+
from dask.array.core import slices_from_chunks, normalize_chunks
1212
import logging
1313

14+
logging.basicConfig(
15+
level=logging.INFO,
16+
format="%(asctime)s %(levelname)s %(name)s: %(message)s"
17+
)
18+
logger = logging.getLogger(__name__)
1419

15-
class Tiff3D(Volume):
20+
21+
class Tiff(Volume):
1622

1723
def __init__(
1824
self,
@@ -21,6 +27,7 @@ def __init__(
2127
scale: list[float],
2228
translation: list[float],
2329
units: list[str],
30+
optimize_reads: bool = False,
2431
):
2532
"""Construct all the necessary attributes for the proper conversion of tiff to OME-NGFF Zarr.
2633
@@ -35,57 +42,78 @@ def __init__(
3542
self.shape = self.zarr_arr.shape
3643
self.dtype = self.zarr_arr.dtype
3744
self.ndim = self.zarr_arr.ndim
38-
45+
self.optimize_reads = optimize_reads
46+
3947
# Scale metadata parameters to match data dimensionality
40-
self.metadata["axes"] = self.metadata["axes"][-self.ndim:]
48+
self.metadata["axes"] = list(self.metadata["axes"])[-self.ndim:]
4149
self.metadata["scale"] = self.metadata["scale"][-self.ndim:]
4250
self.metadata["translation"] = self.metadata["translation"][-self.ndim:]
4351
self.metadata["units"] = self.metadata["units"][-self.ndim:]
4452

4553
def write_to_zarr(self,
46-
dest: str,
54+
zarr_array: zarr.Array,
4755
client: Client,
48-
zarr_chunks : list[int],
49-
comp : ABCMeta = Zstd(level=6),
5056
):
5157

52-
# reshape chunk shape to align with arr shape
53-
if len(zarr_chunks) != self.shape:
54-
zarr_chunks = self.reshape_to_arr_shape(zarr_chunks, self.shape)
55-
56-
z_arr = self.get_output_array(dest, zarr_chunks, comp)
57-
chunks_list = np.arange(0, z_arr.shape[0], z_arr.chunks[0])
58+
# Find slab axis based on metadata axes - use z axis for slabbing
59+
axes = self.metadata["axes"]
60+
slab_axis = axes.index('z') if 'z' in axes else 0
61+
62+
z_arr = zarr_array
63+
64+
slice_chunks = z_arr.chunks
65+
if self.optimize_reads:
66+
logger.info("Optimizing read chunking...")
67+
# TODO: this works for some cases, doesn't work in others, need to understand why
68+
#slicing
69+
#(c, z, y, x) or (z, y, x) - combine (c, z) from zarr_chunks and (y, x) from tiff chunking
70+
logger.info(f"Output Zarr array chunks: {z_arr.chunks}")
71+
logger.info(f"Input Tiff array chunks: {self.zarr_arr.chunks}")
72+
slice_chunks = list(z_arr.chunks[:slab_axis+1]).copy()
73+
logger.info(f"Slice chunks: {slice_chunks}")
74+
75+
# cast slab size to write into zarr:
76+
for zarr_chunkdim, tiff_chunkdim, tiff_dim in zip(z_arr.chunks[slab_axis+1:], self.zarr_arr.chunks[slab_axis+1:], self.zarr_arr.shape[slab_axis+1:]):
77+
if tiff_chunkdim < zarr_chunkdim:
78+
slice_chunks.append(zarr_chunkdim)
79+
elif tiff_chunkdim/tiff_dim < 0.5:
80+
slice_chunks.append(int(tiff_chunkdim/zarr_chunkdim)*zarr_chunkdim)
81+
else:
82+
slice_chunks.append(tiff_dim)
83+
84+
logger.info(f"Slice chunks extended: {slice_chunks}")
85+
86+
# compute size of the slab
87+
slab_size_bytes = np.prod(slice_chunks) * np.dtype(self.dtype).itemsize
88+
89+
# get dask worker allocated memery size
90+
dask_worker_memory_bytes = next(iter(client.scheduler_info()["workers"].values()))["memory_limit"]
91+
92+
logger.info(f"Slab size: {slab_size_bytes / 1e9} GB")
93+
logger.info(f"Dask memory limit: {dask_worker_memory_bytes / 1e9} GB")
94+
if slab_size_bytes > dask_worker_memory_bytes:
95+
raise ValueError("Tiff segment size exceeds Dask worker memory limit. Please reduce the chunksize of the output array.")
96+
97+
logger.info(f"Zarr array shape: {self.zarr_arr.shape}")
98+
normalized_chunks = normalize_chunks(slice_chunks, shape=self.zarr_arr.shape)
99+
slice_tuples = slices_from_chunks(normalized_chunks)
58100

59101
src_path = copy.copy(self.src_path)
60102

61103
start = time.time()
62104
fut = client.map(
63-
lambda v: write_volume_slab_to_zarr(v, z_arr, src_path), chunks_list
64-
)
65-
logging.info(
66-
f"Submitted {len(chunks_list)} tasks to the scheduler in {time.time()- start}s"
105+
lambda v: write_volume_slab_to_zarr(v, z_arr, src_path), slice_tuples
67106
)
107+
logger.info(f"Submitted {len(slice_tuples)} tasks to the scheduler in {round(time.time()- start, 4)}s")
68108

69109
# wait for all the futures to complete
70110
result = wait(fut)
71-
logging.info(f"Completed {len(chunks_list)} tasks in {time.time() - start}s")
111+
logger.info(f"Completed {len(slice_tuples)} tasks in {round(time.time() - start, 2)}s")
72112

73113
return 0
74114

75115

76-
def write_volume_slab_to_zarr(chunk_num: int, zarray: zarr.Array, src_path: str):
77-
78-
# check if the slab is at the array boundary or not
79-
if chunk_num + zarray.chunks[0] > zarray.shape[0]:
80-
slab_thickness = zarray.shape[0] - chunk_num
81-
else:
82-
slab_thickness = zarray.chunks[0]
83-
84-
slab_shape = [slab_thickness] + list(zarray.shape[-2:])
85-
np_slab = np.empty(slab_shape, zarray.dtype)
86-
87-
tiff_slab = imread(src_path, key=range(chunk_num, chunk_num + slab_thickness, 1))
88-
np_slab[0 : zarray.chunks[0], :, :] = tiff_slab
89-
90-
# write a tiff stack slab into zarr array
91-
zarray[chunk_num : chunk_num + zarray.chunks[0], :, :] = np_slab
116+
def write_volume_slab_to_zarr(slice: slice, zarray: zarr.Array, src_path: str):
117+
tiff_store = imread(src_path, aszarr=True)
118+
src_tiff_arr = zarr.open(tiff_store, mode='r')
119+
zarray[slice] = src_tiff_arr[slice]

src/zarrify/formats/tiff_stack.py

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -67,17 +67,11 @@ def write_tile_slab_to_zarr(
6767

6868
# parallel writing of tiff stack into zarr array
6969
def write_to_zarr(self,
70-
dest: str,
70+
zarr_array: zarr.Array,
7171
client: Client,
72-
zarr_chunks : list[int],
73-
comp : ABCMeta = Zstd(level=6),
7472
):
7573

76-
# reshape chunk shape to align with arr shape
77-
if len(zarr_chunks) != len(self.shape):
78-
zarr_chunks = self.reshape_to_arr_shape(zarr_chunks, self.shape)
79-
80-
z_arr = self.get_output_array(dest, zarr_chunks, comp)
74+
z_arr = zarr_array
8175
chunks_list = np.arange(0, z_arr.shape[0], z_arr.chunks[0])
8276

8377
start = time.time()

0 commit comments

Comments
 (0)