Skip to content

GEDS Spilling prematurely when unable to access S3 (MinIO) storage. #40

@ojundi03

Description

@ojundi03

Describe the bug

  • This bug was encountered while using GEDS-HDFS as a tier-2 storage for Pravega.
  • While configured to spill to MInIO (S3), GEDS spills earlier than expected in the event of an MInIO outage. with the working directory set to a drive with 20GB of storage space the expected behaviour is that in the event of an MInIO outage, GEDS should fill up to ~70% of its capacity (~14GB), before throttling and errors are encountered.
  • In reality, only ~2.4GB is written to GEDS before throttling occurs.
  • In the logs, cURL error 7 (could not connect) and 28 (Timeout reached) are shown repeatedly. In particular, the first instance of error 28 aligns with when the throttling begins.
  • I believe GEDS may be able to last significantly longer while under an MInIO outage, and this is being hindered by some sort of timeout.

To Reproduce

  1. Follow the instructions 1.) and 2.) at https://github.com/cloudskin-eu/pravega-geds to achieve the Pravega-GEDS deployment.
  2. Run /setup-scripts/pravega-geds-install.sh to install the GEDS-integrated Pravega deployment on Kubernetes.
  3. Navigate to /experiment and run run-experiment.sh.
  4. Logs for the Pravega segment-store pod can be viewed through kubectl logs pravega-pravega-segmentstore-0. The error(s) should be visible in the logs.

Additional information

Configuration Used:
GEDS is configured using environment variables:

        options:
          pravegaservice.storage.layout: "CHUNKED_STORAGE"
          pravegaservice.storage.impl.name: "HDFS"
          hdfs.connect.uri: "hdfs://tier-2-geds"
          hdfs.fs.impl: "com.ibm.geds.hdfs.GEDSHadoopFileSystem"
        env:
          GEDS_METADATASERVER: "geds-metadataserver:4381"
          GEDS_LOCAL_STORAGE_PATH: "/tmp/pravega/cache"
          AWS_ACCESS_KEY_ID: "miniostorage"
          AWS_SECRET_ACCESS_KEY: "miniostorage"
          AWS_ENDPOINT_URL: "http://minio.pravega.svc.cluster.local:80"
          GEDS_CONFIGURE_S3_USING_ENV: "1"

Curl Code 7

java.util.concurrent.CompletionException: io.pravega.segmentstore.storage.chunklayer.ChunkStorageException: checkExists
	at io.pravega.segmentstore.storage.chunklayer.AsyncBaseChunkStorage.lambda$execute$13(AsyncBaseChunkStorage.java:751)
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at io.pravega.common.concurrent.ThreadPoolScheduledExecutorService$ScheduledRunnable.run(ThreadPoolScheduledExecutorService.java:209)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.pravega.segmentstore.storage.chunklayer.ChunkStorageException: checkExists
	at io.pravega.storage.hdfs.HDFSChunkStorage.convertException(HDFSChunkStorage.java:367)
	at io.pravega.storage.hdfs.HDFSChunkStorage.checkExists(HDFSChunkStorage.java:169)
	at io.pravega.segmentstore.storage.chunklayer.BaseChunkStorage.lambda$checkExistsAsync$3(BaseChunkStorage.java:89)
	at io.pravega.segmentstore.storage.chunklayer.AsyncBaseChunkStorage.lambda$execute$13(AsyncBaseChunkStorage.java:747)
	... 6 common frames omitted
Caused by: java.io.IOException: Unable to file status: _system/containers/_sysjournal.container4.snapshot_info: curlCode: 7, Couldn't connect to server
	at com.ibm.geds.GEDS.nativeStatus(Native Method)
	at com.ibm.geds.GEDS.status(GEDS.java:260)
	at com.ibm.geds.hdfs.GEDSHadoopFileSystem.getFileStatus(GEDSHadoopFileSystem.java:154)
	at io.pravega.storage.hdfs.HDFSChunkStorage.checkExists(HDFSChunkStorage.java:164)
	... 8 common frames omitted
_system/containers/_sysjournal.container4.snapshot_info

Curl Code 28

	at io.pravega.segmentstore.storage.chunklayer.AsyncBaseChunkStorage.lambda$execute$13(AsyncBaseChunkStorage.java:751)
	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at io.pravega.common.concurrent.ThreadPoolScheduledExecutorService$ScheduledRunnable.run(ThreadPoolScheduledExecutorService.java:209)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.pravega.segmentstore.storage.chunklayer.ChunkStorageException: checkExists
	at io.pravega.storage.hdfs.HDFSChunkStorage.convertException(HDFSChunkStorage.java:367)
	at io.pravega.storage.hdfs.HDFSChunkStorage.checkExists(HDFSChunkStorage.java:169)
	at io.pravega.segmentstore.storage.chunklayer.BaseChunkStorage.lambda$checkExistsAsync$3(BaseChunkStorage.java:89)
	at io.pravega.segmentstore.storage.chunklayer.AsyncBaseChunkStorage.lambda$execute$13(AsyncBaseChunkStorage.java:747)
	... 6 common frames omitted
Caused by: java.io.IOException: Unable to file status: _system/containers/_sysjournal.container7.snapshot_info: curlCode: 28, Timeout was reached
	at com.ibm.geds.GEDS.nativeStatus(Native Method)
	at com.ibm.geds.GEDS.status(GEDS.java:260)
	at com.ibm.geds.hdfs.GEDSHadoopFileSystem.getFileStatus(GEDSHadoopFileSystem.java:154)
	at io.pravega.storage.hdfs.HDFSChunkStorage.checkExists(HDFSChunkStorage.java:164)
	... 8 common frames omitted
_system/containers/_sysjournal.container7.snapshot_info```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions