FIPS issue submitting DDPJobDefinition job from the CodeFlare Notebook #357

jbusche · 2023-09-25T19:08:49Z

Describe the Bug

On non-FIPS, when you submit the guided-demos/2_basic_jobs DDPJobDefinition mnisttest, the job is scheduled as pending, then switches to running and then completes.

On a FIPS cluster, I'm noticing the following error - (I'll post the entire output below in a comment)

Issue with path: /tmp/torchx_workspacel83oit3q
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/working_dir.py:64, in upload_working_dir_if_needed(runtime_env, scratch_dir, logger, upload_fn)
     63 try:
---> 64     working_dir_uri = get_uri_for_directory(working_dir, excludes=excludes)
     65 except ValueError:  # working_dir is not a directory
....
ValueError: directory /tmp/torchx_workspacel83oit3q must be an existing directory or a zip package

Codeflare Stack Component Versions

Please specify the component versions in which you have encountered this bug.

Codeflare SDK:

pip list |grep codeflare-sdk
codeflare-sdk            0.8.0

MCAD: Unknown, integrated into CodeFlare Operator v1.0.0-rc.1
Instascale: Unknown, integrated into CodeFlare Operator v1.0.0-rc.1
Codeflare Operator: v1.0.0-rc.1
Other: OpenShift 4.12.22 with FIPS enabled:
All master and worker nodes report FIPS enabled, for example:

ssh [email protected] cat /proc/sys/crypto/fips_enabled
1
and
ssh [email protected] cat /proc/sys/crypto/fips_enabled
1

Steps to Reproduce the Bug

Create a FIPS cluster
Install ODH 1.9.0 and CodeFlare v1.0.0-rc1 as usual
Install the kfdefs as usual
Launch the codeflare notebook as usual
Run the guided-demos/2_basic_jobs.ipynb - it works up to where you submit the job, and then reports the Issue with path: /tmp/torchx_workspacel83oit3q issue.

What Have You Already Tried to Debug the Issue?

I tried it on non-FIPS and it worked fine. I also tried a second FIPS cluster to make sure it wasn't just a bad cluster.

Expected Behavior

I expected the job to be scheduled, run and complete successfully.

Screenshots, Console Output, Logs, etc.

More detail of the codeflare-notebook error message will be posted below.

Affected Releases

main

Additional Context

Add as applicable and when known:

Cloud: 1) AWS, 2) IBM Cloud, 3) Other (describe), or 4) on-premise: [1 - 4 + description?]
Kubernetes: 1) OpenShift
OpenShift or K8s version: 4.12.22
Other relevant info
Enabled with FIPS

The text was updated successfully, but these errors were encountered:

jbusche · 2023-09-25T19:09:21Z

Full message:

/opt/app-root/lib64/python3.8/site-packages/urllib3/connectionpool.py:1045: InsecureRequestWarning: Unverified HTTPS request is being made to host 'api.jimfips.cp.fyre.ibm.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
The Ray scheduler does not support port mapping.
Issue with path: /tmp/torchx_workspacel83oit3q
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/working_dir.py:64, in upload_working_dir_if_needed(runtime_env, scratch_dir, logger, upload_fn)
     63 try:
---> 64     working_dir_uri = get_uri_for_directory(working_dir, excludes=excludes)
     65 except ValueError:  # working_dir is not a directory

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:474, in get_uri_for_directory(directory, excludes)
    472     raise ValueError(f"directory {directory} must be an existing directory")
--> 474 hash_val = _hash_directory(directory, directory, _get_excludes(directory, excludes))
    476 return "{protocol}://{pkg_name}.zip".format(
    477     protocol=Protocol.GCS.value, pkg_name=RAY_PKG_PREFIX + hash_val.hex()
    478 )

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:175, in _hash_directory(root, relative_path, excludes, logger)
    174 excludes = [] if excludes is None else [excludes]
--> 175 _dir_travel(root, excludes, handler, logger=logger)
    176 return hash_val

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:128, in _dir_travel(path, excludes, handler, logger)
    127     logger.error(f"Issue with path: {path}")
--> 128     raise e
    129 if path.is_dir():

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:125, in _dir_travel(path, excludes, handler, logger)
    124 try:
--> 125     handler(path)
    126 except Exception as e:

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/packaging.py:152, in _hash_directory.<locals>.handler(path)
    151 def handler(path: Path):
--> 152     md5 = hashlib.md5()
    153     md5.update(str(path.relative_to(relative_path)).encode())

ValueError: [digital envelope routines: EVP_DigestInit_ex] disabled for FIPS

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In [6], line 6
      1 jobdef = DDPJobDefinition(
      2     name="mnisttest",
      3     script="mnist.py",
      4     scheduler_args={"requirements": "requirements.txt"}
      5 )
----> 6 job = jobdef.submit(cluster)

File /opt/app-root/lib64/python3.8/site-packages/codeflare_sdk/job/jobs.py:166, in DDPJobDefinition.submit(self, cluster)
    165 def submit(self, cluster: "Cluster" = None) -> "Job":
--> 166     return DDPJob(self, cluster)

File /opt/app-root/lib64/python3.8/site-packages/codeflare_sdk/job/jobs.py:174, in DDPJob.__init__(self, job_definition, cluster)
    172 self.cluster = cluster
    173 if self.cluster:
--> 174     self._app_handle = torchx_runner.schedule(job_definition._dry_run(cluster))
    175 else:
    176     self._app_handle = torchx_runner.schedule(
    177         job_definition._dry_run_no_cluster()
    178     )

File /opt/app-root/lib64/python3.8/site-packages/torchx/runner/api.py:278, in Runner.schedule(self, dryrun_info)
    271 with log_event(
    272     "schedule",
    273     scheduler,
    274     app_image=app_image,
    275     runcfg=json.dumps(cfg) if cfg else None,
    276 ) as ctx:
    277     sched = self._scheduler(scheduler)
--> 278     app_id = sched.schedule(dryrun_info)
    279     app_handle = make_app_handle(scheduler, self._name, app_id)
    280     app = none_throws(dryrun_info._app)

File /opt/app-root/lib64/python3.8/site-packages/torchx/schedulers/ray_scheduler.py:239, in RayScheduler.schedule(self, dryrun_info)
    237 # 1. Submit Job via the Ray Job Submission API
    238 try:
--> 239     job_id: str = client.submit_job(
    240         submission_id=cfg.app_id,
    241         # we will pack, hash, zip, upload, register working_dir in GCS of ray cluster
    242         # and use it to configure your job execution.
    243         entrypoint="python3 ray_driver.py",
    244         runtime_env=runtime_env,
    245     )
    247 finally:
    248     if dirpath.startswith(tempfile.gettempdir()):

File /opt/app-root/lib64/python3.8/site-packages/ray/dashboard/modules/job/sdk.py:203, in JobSubmissionClient.submit_job(self, entrypoint, job_id, runtime_env, metadata, submission_id, entrypoint_num_cpus, entrypoint_num_gpus, entrypoint_resources)
    200 metadata = metadata or {}
    201 metadata.update(self._default_metadata)
--> 203 self._upload_working_dir_if_needed(runtime_env)
    204 self._upload_py_modules_if_needed(runtime_env)
    206 # Run the RuntimeEnv constructor to parse local pip/conda requirements files.

File /opt/app-root/lib64/python3.8/site-packages/ray/dashboard/modules/dashboard_sdk.py:398, in SubmissionClient._upload_working_dir_if_needed(self, runtime_env)
    390 def _upload_fn(working_dir, excludes, is_file=False):
    391     self._upload_package_if_needed(
    392         working_dir,
    393         include_parent_dir=False,
    394         excludes=excludes,
    395         is_file=is_file,
    396     )
--> 398 upload_working_dir_if_needed(runtime_env, upload_fn=_upload_fn)

File /opt/app-root/lib64/python3.8/site-packages/ray/_private/runtime_env/working_dir.py:68, in upload_working_dir_if_needed(runtime_env, scratch_dir, logger, upload_fn)
     66 package_path = Path(working_dir)
     67 if not package_path.exists() or package_path.suffix != ".zip":
---> 68     raise ValueError(
     69         f"directory {package_path} must be an existing "
     70         "directory or a zip package"
     71     )
     73 pkg_uri = get_uri_for_package(package_path)
     74 try:

ValueError: directory /tmp/torchx_workspacel83oit3q must be an existing directory or a zip package

jbusche · 2023-10-20T19:45:57Z

@KPostOffice created a special ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl that when installed in my CodeFlare Notebook running on the FIPS cluster, was able to succeed to run a job:

While on the FIPS cluster:

On the CodeFlare notebook from the ODH dashboard, install Kevin's custom wheel file

pip install ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl

Run the regular 2_basic_jobs guided demo, and now the step of submitting to Ray succeeds without /tmp error messages:

The Ray scheduler does not support port mapping.
2023-10-20 19:20:56,766	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_2fd8353f5ce24329.zip.
2023-10-20 19:20:56,767	INFO packaging.py:530 -- Creating a file package for local directory '/tmp/torchx_workspacem7zl9kwn'.

The job completes as expected:

AppStatus:
  msg: !!python/object/apply:ray.dashboard.modules.job.common.JobStatus
  - SUCCEEDED
  num_restarts: -1
  roles:
  - replicas:
    - hostname: <NONE>
      id: 0
      role: ray
      state: !!python/object/apply:torchx.specs.api.AppState
      - 4
      structured_error_msg: <NONE>
    role: ray
  state: SUCCEEDED (4)
  structured_error_msg: <NONE>
  ui_url: null

github-actions bot added the triage/needs-triage label Sep 25, 2023

dimakis removed the triage/needs-triage label Sep 29, 2023

KPostOffice mentioned this issue Oct 20, 2023

[Core] MD5 not supported as is for FIPS enabled machines ray-project/ray#40534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FIPS issue submitting DDPJobDefinition job from the CodeFlare Notebook #357

FIPS issue submitting DDPJobDefinition job from the CodeFlare Notebook #357

jbusche commented Sep 25, 2023

jbusche commented Sep 25, 2023

Uh oh!

jbusche commented Oct 20, 2023

Uh oh!

FIPS issue submitting DDPJobDefinition job from the CodeFlare Notebook #357

FIPS issue submitting DDPJobDefinition job from the CodeFlare Notebook #357

Comments

jbusche commented Sep 25, 2023

Describe the Bug

Codeflare Stack Component Versions

Steps to Reproduce the Bug

What Have You Already Tried to Debug the Issue?

Expected Behavior

Screenshots, Console Output, Logs, etc.

Affected Releases

Additional Context

jbusche commented Sep 25, 2023

Uh oh!

jbusche commented Oct 20, 2023

Uh oh!