Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIPS issue submitting DDPJobDefinition job from the CodeFlare Notebook #357

Open
jbusche opened this issue Sep 25, 2023 · 2 comments
Open

Comments

@jbusche
Copy link
Contributor

jbusche commented Sep 25, 2023

Describe the Bug

On non-FIPS, when you submit the guided-demos/2_basic_jobs DDPJobDefinition mnisttest, the job is scheduled as pending, then switches to running and then completes.

On a FIPS cluster, I'm noticing the following error - (I'll post the entire output below in a comment)

Issue with path: /tmp/torchx_workspacel83oit3q
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/working_dir.py:64, in upload_working_dir_if_needed(runtime_env, scratch_dir, logger, upload_fn)
     63 try:
---> 64     working_dir_uri = get_uri_for_directory(working_dir, excludes=excludes)
     65 except ValueError:  # working_dir is not a directory
....
ValueError: directory /tmp/torchx_workspacel83oit3q must be an existing directory or a zip package

Codeflare Stack Component Versions

Please specify the component versions in which you have encountered this bug.

Codeflare SDK:

pip list |grep codeflare-sdk
codeflare-sdk            0.8.0

MCAD: Unknown, integrated into CodeFlare Operator v1.0.0-rc.1
Instascale: Unknown, integrated into CodeFlare Operator v1.0.0-rc.1
Codeflare Operator: v1.0.0-rc.1
Other: OpenShift 4.12.22 with FIPS enabled:
All master and worker nodes report FIPS enabled, for example:

ssh [email protected] cat /proc/sys/crypto/fips_enabled
1
and
ssh [email protected] cat /proc/sys/crypto/fips_enabled
1

Steps to Reproduce the Bug

  1. Create a FIPS cluster
  2. Install ODH 1.9.0 and CodeFlare v1.0.0-rc1 as usual
  3. Install the kfdefs as usual
  4. Launch the codeflare notebook as usual
  5. Run the guided-demos/2_basic_jobs.ipynb - it works up to where you submit the job, and then reports the Issue with path: /tmp/torchx_workspacel83oit3q issue.

What Have You Already Tried to Debug the Issue?

I tried it on non-FIPS and it worked fine. I also tried a second FIPS cluster to make sure it wasn't just a bad cluster.

Expected Behavior

I expected the job to be scheduled, run and complete successfully.

Screenshots, Console Output, Logs, etc.

More detail of the codeflare-notebook error message will be posted below.

Affected Releases

main

Additional Context

Add as applicable and when known:

  • Cloud: 1) AWS, 2) IBM Cloud, 3) Other (describe), or 4) on-premise: [1 - 4 + description?]
  • Kubernetes: 1) OpenShift
  • OpenShift or K8s version: 4.12.22
  • Other relevant info
    Enabled with FIPS
@jbusche
Copy link
Contributor Author

jbusche commented Sep 25, 2023

Full message:

/opt/app-root/lib64/python3.8/site-packages/urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host 'api.jimfips.cp.fyre.ibm.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
The Ray scheduler does not support port mapping.
Issue with path: /tmp/torchx_workspacel83oit3q
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/working_dir.py:64, in upload_working_dir_if_needed(runtime_env, scratch_dir, logger, upload_fn)
     63 try:
---> 64     working_dir_uri = get_uri_for_directory(working_dir, excludes=excludes)
     65 except ValueError:  # working_dir is not a directory

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:474, in get_uri_for_directory(directory, excludes)
    472     raise ValueError(f"directory {directory} must be an existing directory")
--> 474 hash_val = _hash_directory(directory, directory, _get_excludes(directory, excludes))
    476 return "{protocol}://{pkg_name}.zip".format(
    477     protocol=Protocol.GCS.value, pkg_name=RAY_PKG_PREFIX + hash_val.hex()
    478 )

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:175, in _hash_directory(root, relative_path, excludes, logger)
    174 excludes = [] if excludes is None else [excludes]
--> 175 _dir_travel(root, excludes, handler, logger=logger)
    176 return hash_val

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:128, in _dir_travel(path, excludes, handler, logger)
    127     logger.error(f"Issue with path: {path}")
--> 128     raise e
    129 if path.is_dir():

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:125, in _dir_travel(path, excludes, handler, logger)
    124 try:
--> 125     handler(path)
    126 except Exception as e:

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:152, in _hash_directory.<locals>.handler(path)
    151 def handler(path: Path):
--> 152     md5 = hashlib.md5()
    153     md5.update(str(path.relative_to(relative_path)).encode())

ValueError: [digital envelope routines: EVP_DigestInit_ex] disabled for FIPS

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In [6], line 6
      1 jobdef = DDPJobDefinition(
      2     name="mnisttest",
      3     script="mnist.py",
      4     scheduler_args={"requirements": "requirements.txt"}
      5 )
----> 6 job = jobdef.submit(cluster)

File /opt/app-root/lib64/python3.8/site-packages/codeflare_sdk/job/jobs.py:166, in DDPJobDefinition.submit(self, cluster)
    165 def submit(self, cluster: "Cluster" = None) -> "Job":
--> 166     return DDPJob(self, cluster)

File /opt/app-root/lib64/python3.8/site-packages/codeflare_sdk/job/jobs.py:174, in DDPJob.__init__(self, job_definition, cluster)
    172 self.cluster = cluster
    173 if self.cluster:
--> 174     self._app_handle = torchx_runner.schedule(job_definition._dry_run(cluster))
    175 else:
    176     self._app_handle = torchx_runner.schedule(
    177         job_definition._dry_run_no_cluster()
    178     )

File /opt/app-root/lib64/python3.8/site-packages/torchx/runner/api.py:278, in Runner.schedule(self, dryrun_info)
    271 with log_event(
    272     "schedule",
    273     scheduler,
    274     app_image=app_image,
    275     runcfg=json.dumps(cfg) if cfg else None,
    276 ) as ctx:
    277     sched = self._scheduler(scheduler)
--> 278     app_id = sched.schedule(dryrun_info)
    279     app_handle = make_app_handle(scheduler, self._name, app_id)
    280     app = none_throws(dryrun_info._app)

File /opt/app-root/lib64/python3.8/site-packages/torchx/schedulers/ray_scheduler.py:239, in RayScheduler.schedule(self, dryrun_info)
    237 # 1. Submit Job via the Ray Job Submission API
    238 try:
--> 239     job_id: str = client.submit_job(
    240         submission_id=cfg.app_id,
    241         # we will pack, hash, zip, upload, register working_dir in GCS of ray cluster
    242         # and use it to configure your job execution.
    243         entrypoint="python3 ray_driver.py",
    244         runtime_env=runtime_env,
    245     )
    247 finally:
    248     if dirpath.startswith(tempfile.gettempdir()):

File /opt/app-root/lib64/python3.8/site-packages/ray/dashboard/modules/job/sdk.py:203, in JobSubmissionClient.submit_job(self, entrypoint, job_id, runtime_env, metadata, submission_id, entrypoint_num_cpus, entrypoint_num_gpus, entrypoint_resources)
    200 metadata = metadata or {}
    201 metadata.update(self._default_metadata)
--> 203 self._upload_working_dir_if_needed(runtime_env)
    204 self._upload_py_modules_if_needed(runtime_env)
    206 # Run the RuntimeEnv constructor to parse local pip/conda requirements files.

File /opt/app-root/lib64/python3.8/site-packages/ray/dashboard/modules/dashboard_sdk.py:398, in SubmissionClient._upload_working_dir_if_needed(self, runtime_env)
    390 def _upload_fn(working_dir, excludes, is_file=False):
    391     self._upload_package_if_needed(
    392         working_dir,
    393         include_parent_dir=False,
    394         excludes=excludes,
    395         is_file=is_file,
    396     )
--> 398 upload_working_dir_if_needed(runtime_env, upload_fn=_upload_fn)

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/working_dir.py:68, in upload_working_dir_if_needed(runtime_env, scratch_dir, logger, upload_fn)
     66 package_path = Path(working_dir)
     67 if not package_path.exists() or package_path.suffix != ".zip":
---> 68     raise ValueError(
     69         f"directory {package_path} must be an existing "
     70         "directory or a zip package"
     71     )
     73 pkg_uri = get_uri_for_package(package_path)
     74 try:

ValueError: directory /tmp/torchx_workspacel83oit3q must be an existing directory or a zip package

@jbusche
Copy link
Contributor Author

jbusche commented Oct 20, 2023

@KPostOffice created a special ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl that when installed in my CodeFlare Notebook running on the FIPS cluster, was able to succeed to run a job:

While on the FIPS cluster:

  1. On the CodeFlare notebook from the ODH dashboard, install Kevin's custom wheel file
pip install ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl 
  1. Run the regular 2_basic_jobs guided demo, and now the step of submitting to Ray succeeds without /tmp error messages:
The Ray scheduler does not support port mapping.
2023-10-20 19:20:56,766	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_2fd8353f5ce24329.zip.
2023-10-20 19:20:56,767	INFO packaging.py:530 -- Creating a file package for local directory '/tmp/torchx_workspacem7zl9kwn'.
  1. The job completes as expected:
AppStatus:
  msg: !!python/object/apply:ray.dashboard.modules.job.common.JobStatus
  - SUCCEEDED
  num_restarts: -1
  roles:
  - replicas:
    - hostname: <NONE>
      id: 0
      role: ray
      state: !!python/object/apply:torchx.specs.api.AppState
      - 4
      structured_error_msg: <NONE>
    role: ray
  state: SUCCEEDED (4)
  structured_error_msg: <NONE>
  ui_url: null

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants