Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FWUP via nerves_hub_link can be run while fwup is performed via ssh, resulting in corruption #98

Open
dognotdog opened this issue Nov 29, 2021 · 2 comments

Comments

@dognotdog
Copy link

I've been wondering how I corrupted an SD card before, and now I found at least one way: with an active deployment running and fwup in progress, but also local performing a fwup via ssh, it looks like the two update processes simultaneously wrote to the "unused" partition, resulting in complete garbage.

As an aside, is the on-disk cecksum not verified on firmware updates?

@fhunleth
Copy link
Contributor

fwup uses checksums to verify that the .fw file is correct and it reads back what it writes by default. It does the latter progressively as it applies the update. This means that if you have one firmware update write a block 1 and then have another firmware update change it after the first one has verified the write, you can get correction. Even if fwup were to verify everything at the very end, this could still happen, but the window of time would be smaller.

What you're looking for is a mutex to only allow one firmware update to happen at a time. That doesn't exist yet. Given that we can changenerves_hub_link, fwup, and ssh_subsystem_fwup, it seems possible to make something. Could you make a proposal on how to implement?

@dognotdog
Copy link
Author

Having both a final checksum as well as some mutex mechanism seems worthwhile to me, but I do know too little about each of those subsystems, plus there are probably some caveats, eg. when "updating" an SD-Card outside of the device, we'd probably want that to not be blocked by an aborted update process, and similarly a crashed subsystem should not block a new update attempt.

It does seem like a robust mutex mechanism would be somewhat complicated to account for such edge cases, and right now I'm unsure what could work. The only condition is that a partition is not marked good unless it is wholly checked, how a retry is done is another matter.

Maybe writing a checksum before, one after, and final check of pre and post checksums against actual data? That way, assuming another sequential write interferes, the pre checksum would necessarily be changed before final validation, thus blocking the validation? This way, last one to complete wins, if there was no interference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants