Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some files randomally have appended null on read #1586

Closed
orgads opened this issue Dec 4, 2024 · 16 comments
Closed

Some files randomally have appended null on read #1586

orgads opened this issue Dec 4, 2024 · 16 comments
Assignees
Milestone

Comments

@orgads
Copy link

orgads commented Dec 4, 2024

I have 2 Node.JS services on kubernetes - one writes json files to Azure blob, and the other reads them.

I use blob.csi.azure.com driver, which uses blobfuse2 version 2.3.2.

Mount options are -o allow_other --file-cache-timeout-in-seconds=120.

The read and write are straightforward:

await fs.writeFile(path, JSON.stringify(obj, null, 4), 'utf8');
const data = JSON.parse(await fs.readFile(path, 'utf8'));

We read up to 10 files concurrently.

Sometimes some of the files are read with an extra null, although on the storage they don't have it. When I read the same file directly from the storage, or from a pod on another node, it doesn't have the extra null.

Example:

{
  "foo": "bar",
  "blubb": 17
}\0

This results in failure on JSON.parse.

This is not a race between read and write, because the first failing read happened ~30 minutes after the file was written (it is only written once).

After this failure, the same files keep failing, even after several hours (when blobfuse cache is already invalidated).

I can see this null also when running hexdump -C on the file from the command line, both in the pod and on the k8s node that hosts it.

The only way I found to solve this was to drop OS cache using echo 3 > /proc/sys/vm/drop_caches.

Any hints how to debug this further?

@vibhansa-msft
Copy link
Member

How are these files originally created ? Was it using blobfuse only or some other tool.
If there is a consistent way to reproduce this, kindly share the steps.

@orgads
Copy link
Author

orgads commented Dec 9, 2024

Created with blobfuse. It's random, I have no idea how to reproduce this.

@vibhansa-msft
Copy link
Member

There were some issues with blobfuse writing using block-cache which were resolved with Blobfuse 2.3.2 version. Were these files created with a prior version?

@ashruti-msft
Copy link
Collaborator

Additionally, could you confirm whether you are using file-cache or block-cache for file write/read operations? I noticed that you were using the --file-cache-timeout-in-seconds=120 flag and just wanted to verify

@orgads
Copy link
Author

orgads commented Dec 9, 2024

@vibhansa-msft The writer is also version 2.3.2. It runs on arm64 if it matters. The reader is on amd64.

@ashruti-msft Looks like file-cache. This is the blobfuse2 command from the node:

blobfuse2 mount /var/lib/kubelet/plugins/kubernetes.io/csi/blob.csi.azure.com/<id>/globalmount --file-cache-timeout-in-seconds=120 --log-level LOG_DEBUG -o allow_other --cancel-list-on-mount-seconds=10 --empty-dir-check=false --tmp-path=/mnt/mystorage --container-name=container --pre-mount-validate=true --use-https=true --ignore-open-flags=true

@ashruti-msft
Copy link
Collaborator

This issue hasn't been reported before for file-cache. Could you let me know how frequently you encounter this scenario? I'll attempt to reproduce the issue, but since it occurs randomly, it might be a bit tricky. Some hints to reproduce this issue would be to focus on the number of reads/writes it takes to start showing this problem. Additionally, if you have the debug logs from the last time you encountered this issue, please share them as they might help.

@orgads
Copy link
Author

orgads commented Dec 15, 2024

It happened again 2 days ago, and now I have trace logs from both the writer and reader. How can I share them privately?

@vibhansa-msft
Copy link
Member

You can share them via email: [email protected].

@orgads
Copy link
Author

orgads commented Dec 16, 2024

@vibhansa-msft I tried, but my mail was rejected:

The group blobfusedev only accepts messages from people in its organization or on its allowed senders list, and your email address isn't on the list.

@orgads
Copy link
Author

orgads commented Dec 16, 2024

I tried to inspect the logs. I compared Dec 13 which was the first failure vs Dec 15 which succeeded. Found no difference except Request-Id and the date in the requests.

@orgads
Copy link
Author

orgads commented Dec 16, 2024

I sent a link to your email. If you didn't get it, let me know and I'll upload it to a storage.

@ashruti-msft
Copy link
Collaborator

I received the logs thanks

@ashruti-msft
Copy link
Collaborator

Hi @orgads,
I can see in the log file for Dec13 "blobfuse2.log-20241214" that we are getting delete and open calls at the exact same time "Dec 13 02:00:04" which might create inconsistencies since we cache the data for fast retrieval.
One workaround can possibly be to use direct-io mode and disable caching:
Use these flags:
-o direct_io
--attr-timeout=0
--entry-timeout=0
--negative-timeout=0
--attr-cache-timeout=0

Try to find the steps to reproduce this issue as without that we won't be able to debug further.

@orgads
Copy link
Author

orgads commented Dec 17, 2024

Hi @ashruti-msft, thanks for your response.

I don't fully understand.

  1. Do you mean we delete and open the same file at the same time?
  2. The issue happened at 02:00:00, and what you refer to happened 4 seconds later. How can this be related?

@ashruti-msft
Copy link
Collaborator

Hi @orgads,

At the 02:00:00 time mark I can see we are downloading the file multiple times as its not found in local cache. I do not see an explicit delete call but after that timemark I see the behaviour I described later. But like I said it's tough to debug the issue from our side. It would best if there were a set of steps reproducing the issue consistently.

@vibhansa-msft
Copy link
Member

Closing this as there is no update for a month. Feel free to reopen once you have required information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants