Skip to content

Commit 29a0765

Browse files
authored
Fix PIL image hashing to use actual bytes instead of object repr (#3331)
The convert_pil_to_hash function was hashing str(BytesIO) which includes the memory address, causing different hashes for the same image across runs. This made task hashes non-deterministic for tasks with images. Reproducer: ```python import hashlib from io import BytesIO from PIL import Image img = Image.new('RGB', (2, 2), color='red') img_bytes = BytesIO() img.save(img_bytes, format="PNG") # Buggy: hashes "<_io.BytesIO object at 0x...>" print(str(img_bytes)) # <_io.BytesIO object at 0x1023d8bd0> buggy_hash = hashlib.sha256(str(img_bytes).encode()).hexdigest() # Fixed: hashes actual PNG bytes print(img_bytes.getvalue()[:30]) # b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x02...' fixed_hash = hashlib.sha256(img_bytes.getvalue()).hexdigest() ``` Running the same image twice: - Buggy approach: different hashes each time due to memory address - Fixed approach: consistent hash de33ddc09a0ba9b8... This fix ensures deterministic task hashes for evaluations with images.
1 parent 7ddd966 commit 29a0765

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

lm_eval/utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -576,7 +576,7 @@ def convert_pil_to_hash(value):
576576

577577
img_bytes = BytesIO()
578578
value.save(img_bytes, format="PNG")
579-
return hashlib.sha256(str(img_bytes).encode()).hexdigest()
579+
return hashlib.sha256(img_bytes.getvalue()).hexdigest()
580580

581581

582582
def convert_bytes_to_hash(value):

0 commit comments

Comments
 (0)