-
-
Notifications
You must be signed in to change notification settings - Fork 150
Description
This is along the same lines as #220 in that it manifests as trying to read a value as a 4GB value and getting unexpected EOF.
For a DICOM file with an explicit transfer syntax set, a value that is of a byte or string type (anything that falls to readBytes or readString - this includes e.g. OB, OW, and LT) and has length Undefined Length (i.e. 0xffffffff), the library will allocate a 4GB []byte and try and read to it, producing an error like:
readElement: error when reading value for element (0049,1001): unexpected EOF
I've included a trivial file that can trigger this case - it consists only of the DICOM header, a transfer syntax, patient name (JOHN^DOE) and a tag 0049,1001 (a tag we encountered in the wild that triggered this error) with value type OB and undefined length, with an actual length of 18 characters (I've had to put it in a zip archive, as github doesn't allow uploading of .dcm files): exemplar.dcm.zip
Getting an error in this case is expected. DCMTK will similarly reject this file:
❯ dcm2json exemplar.dcm
E: DcmItem: Parse error in (0049,1001): Illegal element with OB or OW Value Representation and undefined length encountered
E: dcm2json: error (Illegal element with OB or OW Value Representation and undefined length encountered) reading file: exemplar.dcm
The issue is that the library is allocating the 4GB []byte, even though it's discarded pretty quickly (and theoretically garbage collected). With our workflow, we can be reading and parsing several files concurrently, and normally we can use the size of the file to keep memory usage under control. However, because this can trigger even on small files, allocating (and subsequently attempting to write to, so it can't just be handwaved away to virtual memory) the large []byte can pretty quickly cause out-of-memory issues as the allocations & reads can all happen before the memory is freed in garbage collection.
I confirmed that the allocation is happening by adding a log line to readBytes right before the allocation here:
Lines 643 to 648 in 0fbaef5
| func (r *reader) readBytes(t tag.Tag, vr string, vl uint32) (Value, error) { | |
| // TODO: add special handling of PixelData | |
| if vr == vrraw.OtherByte || vr == vrraw.Unknown { | |
| data := make([]byte, vl) | |
| _, err := io.ReadFull(r.rawReader, data) | |
| return &bytesValue{value: data}, err |
I'm not entirely sure what the correct behaviour here is per the spec. From 7.1.1 Data Element Fields:
Undefined Lengths may be used for Data Elements having the Value Representation (VR) Sequence of Items (SQ) and Unknown (UN). For Data Elements with Value Representation OW or OB Undefined Length may be used depending on the negotiated Transfer Syntax
My interpretation of this is that Undefined Length should not be permitted for OB or OW, unless the transfer syntax specifies an exception (e.g. pixel data may be encoded as OB with Undefined Length).
Given that exceptions like pixel data should already be handled, I think a pretty safe fix would be to add a check to both readBytes and readString to preemptively reject the read if the vl is tag.VLUndefinedLength. I'll make a PR along those lines, but would appreciate your input on my interpretation of the spec in this instance.