Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid hardcoding /scratch/ #25

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

152334H
Copy link

@152334H 152334H commented Oct 23, 2024

on slurm jobs, a /scratch/slurm_tmpdir/ folder is attempted to be used for the TMP_DIR env var.

this is strange, given that

  • TMP_DIR is not used anywhere else in the repository, nor is it a common env var name
  • /scratch/ does not exist in all slurm setups, and is an artefact of the author's personal cluster having their NFS at /scratch/ specifically

therefore I propose deleting the code

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 23, 2024
@BadrYoubiIdrissi
Copy link
Contributor

Hi ! Thank you for your PR :) we've had some really annoying issues when leaving tmp to be the system default. Especially on clusters used by many people where /tmp fills up fast. Although the code doesn't directly use TMP_DIR, there are many packages beneath that do, most importantly multiprocessing and triton. A full tmp makes spawning processes extremely slow and painful and makes the whole code slower silently. How about making that environment variable configurable ?

@152334H
Copy link
Author

152334H commented Oct 25, 2024

Ah, I see. Forgive me then, I didn't realize the code was harmless when the folder was missing.

@152334H 152334H closed this Oct 25, 2024
@152334H
Copy link
Author

152334H commented Oct 25, 2024

wait, where in multiprocessing/triton do you see the use of TMP_DIR? I am unable to find hits for it via Github Search.

image

image

@152334H 152334H reopened this Oct 25, 2024
@152334H
Copy link
Author

152334H commented Oct 25, 2024

if the intended target is TMPDIR/TMP, it may be worthwhile to add it to EnvironmentArgs, with the hardcoded default as a special case

@BadrYoubiIdrissi
Copy link
Contributor

My bad ! Triton doesn't directly use it, we have this in our env setup also

# When using Triton, it attempts to locate prebuilt kernels in a cache
# located at ~/.triton/cache, but when that's backed by NFS this can fail
# with a "OSError: [Errno 116] Stale file handle" error. If we were to set
# it to a local directory it would belong to the first user who created it
# and it would fail for the job of any other successive user assigned to
# that machine. To avoid all this mess we use a temporary per-process cache.
triton_cache_dir = tempfile.mkdtemp()
atexit.register(shutil.rmtree, triton_cache_dir, ignore_errors=True)
env_vars["TRITON_CACHE_DIR"] = triton_cache_dir

But we do that before setting temp_dir so triton is probably not affected by that ! Let me see how to make this better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants