You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the construction of the step graph (here) upstream steps in the dag are duplicated in memory. This leads to a large memory footprint and often OOM errors. I tested this by modify the step_graph.from_params method to print out the memory location. Code was inserted here
step_dict[step_name] = Step.from_params(step_params, step_name=step_name)
print(f"Size of step {step_name}: {getsize(step_dict[step_name])} mb")
step_data = step_dict[step_name]
for key, value in step_data.config.items():
print(f"{key}: {id(value)}")
print()
This behavior can be reproduced with this simple example.
We know of this issue, and there is a fix in the latest main branch. We just haven't been able to make a release with it because of a problem with Torch 2 that I'm tracking in #560.
🐛 Describe the bug
During the construction of the step graph (here) upstream steps in the dag are duplicated in memory. This leads to a large memory footprint and often OOM errors. I tested this by modify the step_graph.from_params method to print out the memory location. Code was inserted here
This behavior can be reproduced with this simple example.
config.jsonnet
ex_steps.py
The output is
Versions
Python 3.9.5
absl-py==1.4.0
ai2-tango==1.2.0
aiohttp==3.8.4
aiosignal==1.3.1
alembic==1.8.0
appdirs==1.4.4
asttokens==2.2.1
astunparse==1.6.3
async-timeout==4.0.2
attrs==22.2.0
backcall==0.2.0
base58==2.1.1
beaker-py==1.18.1
boto3==1.24.26
botocore==1.27.96
cached-path==1.1.6
cached-property==1.5.2
cachetools==5.3.0
certifi==2022.12.7
charset-normalizer==2.1.1
chex==0.1.6
click==8.1.3
click-help-colors==0.9.1
cloudpickle==2.2.1
comm==0.1.3
commonmark==0.9.1
cycler==0.11.0
dask==2022.8.1
datasets==2.10.1
debugpy==1.6.6
decorator==5.1.1
dill==0.3.6
distributed==2022.8.1
dm-tree==0.1.8
docker==6.0.1
docker-pycreds==0.4.0
etils==1.1.0
exceptiongroup==1.1.1
executing==1.2.0
fairscale==0.4.9
filelock==3.8.2
flatbuffers==23.3.3
flax==0.6.7
fonttools==4.39.2
frozenlist==1.3.3
fsspec==2023.3.0
gast==0.4.0
gitdb==4.0.10
GitPython==3.1.31
glob2==0.7
glog==0.3.1
google-api-core==2.8.2
google-auth==2.16.2
google-auth-oauthlib==0.4.6
google-cloud-core==2.3.2
google-cloud-storage==2.7.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.4.1
googleapis-common-protos==1.56.4
gprof2dot==2022.7.29
greenlet==1.1.3
grpcio==1.51.3
h5py==3.8.0
HeapDict==1.0.1
heartpy==1.2.7
huggingface-hub==0.10.1
idna==3.4
importlib-metadata==6.0.0
importlib-resources==5.12.0
iniconfig==2.0.0
ipdb==0.13.13
ipykernel==6.22.0
ipython==8.11.0
jax==0.4.6
jaxlib==0.4.6
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
jsonnet-binary==0.17.0
jupyter_client==8.1.0
jupyter_core==5.3.0
keras==2.11.0
kiwisolver==1.4.4
libclang==15.0.6.1
locket==1.0.0
Mako==1.2.4
Markdown==3.4.1
markdown-it-py==2.2.0
MarkupSafe==2.1.2
matplotlib==3.5.2
matplotlib-inline==0.1.6
mdurl==0.1.2
more-itertools==8.14.0
msgpack==1.0.5
multidict==6.0.4
multiprocess==0.70.14
nest-asyncio==1.5.6
neurokit2==0.2.0
numexpr==2.8.4
numpy==1.23.0
oauthlib==3.2.2
opencv-python==4.6.0.66
opt-einsum==3.3.0
optax==0.1.4
orbax==0.1.4
packaging==23.0
pandas==1.4.3
parso==0.8.3
partd==1.3.0
pathtools==0.1.2
patsy==0.5.3
pendulum==2.1.2
petname==2.6
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==3.1.1
pluggy==1.0.0
prompt-toolkit==3.0.38
protobuf==3.19.4
psutil==5.9.1
psycopg2-binary==2.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
pyarrow==11.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydantic==1.10.6
pyDeprecate==0.3.2
Pygments==2.14.0
pyparsing==3.0.9
pytest==7.2.2
python-dateutil==2.8.2
python-gflags==3.1.2
pytorch-lightning==1.7.7
pytz==2022.7.1
pytzdata==2020.1
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==25.0.2
regex==2022.10.31
requests==2.28.1
requests-oauthlib==1.3.1
responses==0.18.0
retry==0.9.2
rich==12.6.0
rjsonnet==0.5.2
rsa==4.9
s3transfer==0.6.0
sacremoses==0.0.53
scikit-learn==1.2.2
scipy==1.8.1
seaborn==0.12.0
sentencepiece==0.1.97
sentry-sdk==1.17.0
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
snakeviz==2.1.1
sortedcontainers==2.4.0
SQLAlchemy==1.4.39
sqlitedict==2.1.0
stack-data==0.6.2
statsmodels==0.13.2
tables==3.7.0
tblib==1.7.0
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow-cpu==2.11.0
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.31.0
tensorstore==0.1.33
termcolor==2.2.0
threadpoolctl==3.1.0
tokenizers==0.13.2
tomli==2.0.1
toolz==0.12.0
torch==1.12.1
torchaudio==0.12.1
torchmetrics==0.11.4
torchvision==0.13.1
tornado==6.2
tqdm==4.64.1
traitlets==5.9.0
transformers==4.25.1
typing_extensions==4.5.0
urllib3==1.26.15
validators==0.20.0
wandb==0.13.11
wcwidth==0.2.6
websocket-client==1.5.1
Werkzeug==2.2.3
wrapt==1.15.0
xarray==2022.3.0
xgboost==1.6.2
xxhash==3.2.0
yarl==1.8.2
zict==2.2.0
zipp==3.15.0
The text was updated successfully, but these errors were encountered: