Parsing OB/OW element of undefined length fails and temporarily allocates 4GB of memory

This is along the same lines as #220 in that it manifests as trying to read a value as a 4GB value and getting `unexpected EOF`.

For a DICOM file with an explicit transfer syntax set, a value that is of a byte or string type (anything that falls to `readBytes` or `readString` - this includes e.g. `OB`, `OW`, and `LT`) and has length Undefined Length (i.e. `0xffffffff`), the library will allocate a 4GB `[]byte` and try and read to it, producing an error like:
```
readElement: error when reading value for element (0049,1001): unexpected EOF
```

I've included a trivial file that can trigger this case - it consists only of the DICOM header, a transfer syntax, patient name (`JOHN^DOE`) and a tag `0049,1001` (a tag we encountered in the wild that triggered this error) with value type `OB` and undefined length, with an actual length of 18 characters (I've had to put it in a zip archive, as github doesn't allow uploading of .dcm files): [exemplar.dcm.zip](https://github.com/user-attachments/files/20350419/exemplar.dcm.zip)

Getting an error in this case is expected. DCMTK will similarly reject this file:
```
❯ dcm2json exemplar.dcm
E: DcmItem: Parse error in (0049,1001): Illegal element with OB or OW Value Representation and undefined length encountered
E: dcm2json: error (Illegal element with OB or OW Value Representation and undefined length encountered) reading file: exemplar.dcm
```

The issue is that the library is allocating the 4GB `[]byte`, even though it's discarded pretty quickly (and theoretically garbage collected). With our workflow, we can be reading and parsing several files concurrently, and normally we can use the size of the file to keep memory usage under control. However, because this can trigger even on small files, allocating (and subsequently attempting to write to, so it can't just be handwaved away to virtual memory) the large `[]byte` can pretty quickly cause out-of-memory issues as the allocations & reads can all happen before the memory is freed in garbage collection.

I confirmed that the allocation is happening by adding a log line to `readBytes` right before the allocation here:

https://github.com/suyashkumar/dicom/blob/0fbaef53037ef4c533a9e6004dfc5b7b224ecc39/read.go#L643-L648

---

I'm not entirely sure what the correct behaviour here is per the spec. From 7.1.1 Data Element Fields:
> Undefined Lengths may be used for Data Elements having the Value Representation (VR) Sequence of Items (SQ) and Unknown (UN). For Data Elements with Value Representation OW or OB Undefined Length may be used depending on the negotiated Transfer Syntax

My interpretation of this is that Undefined Length should _not_ be permitted for OB or OW, unless the transfer syntax specifies an exception (e.g. pixel data may be encoded as OB with Undefined Length).

Given that exceptions like pixel data should already be handled, I think a pretty safe fix would be to add a check to both `readBytes` and `readString` to preemptively reject the read if the `vl` is `tag.VLUndefinedLength`. I'll make a PR along those lines, but would appreciate your input on my interpretation of the spec in this instance.

	func (r *reader) readBytes(t tag.Tag, vr string, vl uint32) (Value, error) {
	// TODO: add special handling of PixelData
	if vr == vrraw.OtherByte \|\| vr == vrraw.Unknown {
	data := make([]byte, vl)
	_, err := io.ReadFull(r.rawReader, data)
	return &bytesValue{value: data}, err

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parsing OB/OW element of undefined length fails and temporarily allocates 4GB of memory #349

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Parsing OB/OW element of undefined length fails and temporarily allocates 4GB of memory #349

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions