R serialization refactor #240

aholmes · 2025-06-30T16:41:00Z

This PR is a refactor of the Ligare.programming R serialization methods. This is to support converting Python types and data into types and data that R can ingest with the corresponding R package found here.

Some key considerations for review:

serializers.py is effectively the public interface of the type conversion and serialization. Note that nothing from type_converters.py is "exported."
The serializer's default behaviors are a combination of "best guess" and practical requirements for CEV. For example, from_parts attempts to convert data via a few simple type checks. This can produce wrong results, but is generally what CEV needs and so is assumed to be correct behavior.
- This can be controlled in applications where these assumptions are wrong. For example, while it is not immediately clear that this works (this is a gap in documentation and tests), this check allows for usage of this library similar to how this test instantiates data to be serialized. I do something similar to this in CEV to construct a list( c(seq(...)), c(seq(...)) )

def get(value: Any):
    x = _number(value, comma_separated=True)
    return Vector(Seq(x))

# some code to get the values from request args, followed by ...

from_parts(data, "key_dest", ["seq1", "seq2"], composite_type=List)

This results in a var: List[Vector[Seq[int]]] for which serialize() returns list(c(seq(1,2,3)),c(seq(4,5,6))).

Currently the tests best describe usage and expectations. There are some gaps in specific usages, e.g., direct tests for _convert(...)'s behavior, but there should be sufficient tests to get an understanding of how this PR works.
The docblocks are a little out-of-date. make Sphinx-doctest does not pass.
CEV can be reviewed for usage in a real application.
This library does some sanitization of input data to avoid any kind of injection, but pointing out areas that I missed will be helpful.

nwiltsie

I've poked around and made some comments, although I haven't managed to wrap my head around everything yet.

nwiltsie · 2025-06-30T16:59:35Z

src/programming/Ligare/programming/R/type_conversion.py

+        if comma_separated or vector:
+            if not (safe_str := safe_comma_separated_string_regex.sub("", value)):
+                # if the safe_str is empty but value is not,
+                # then all characters were stripped and the
+                # supplied value is not "safe." Return None
+                # because this is invalid.
+                return String(None)

-       >>> string_from_csv(",") is None
-       True
+            items = (item for item in safe_str.split(","))
+            if vector:
+                return Vector([String(item) for item in items])
+            else:
+                return Composite([String(item) for item in items])


question (blocking): Do you ever expect to have quoted values? Your SAFE_COMMA_SEPARATED_STRING_PATTERN regex will specifically exclude quotation marks, and using .split(",") wouldn't respect them anyway.

For CEV's use case, no, not at the moment. While someone may want to use quotations for, e.g., xlab.label, they are explicitly stripped out as a brute force way to avoid breaking out of eval here.

nwiltsie · 2025-06-30T17:01:30Z

src/programming/Ligare/programming/R/type_conversion.py

 NULL = "__NULL__"
 FALSE = "FALSE"
 TRUE = "TRUE"


question (non-blocking): Do you expect to never have to pass-through the literal values "TRUE" or "FALSE"? ("__NULL__" seems a little less likely.)

TRUE and FALSE are a unique case with regards to these consts. R can convert them directly. R cannot convert NULL directly, which is why I use __NULL__. The use of TRUE and FALSE in this way, in Python, is partly for consistency in code and the way the values are handled "syntactically." Here is where I've documented this in the R package (although I think my comment is a little out-of-date there; the idea remains the same).

I have given a little thought to someone using these values as string types; they should remain strings because only the exact literal value of __NULL__ (not, e.g., '__NULL__', which is what string(val) would return) is converted to an R NULL value, but I have not explicitly tested this. I will likely address this soon, as I now have a need to convert __NULL__ as part of more complex composite types, e.g., list(__NULL__, c(seq(1,2,3))).

Scratch some of the above, I need to re-examine my thinking around this. Pretty sure I got my head turned around somewhere.

src/programming/Ligare/programming/R/type_conversion.py

nwiltsie · 2025-06-30T17:08:27Z

src/programming/Ligare/programming/R/type_conversion.py

+        else:
+            try:
+                super().__init__(int(value))
+            except:


suggestion (non-blocking): It's probably worth restricting this to ValueError.

I think I agree, but I could use some thought exercises around this maybe. My original thinking was that, however int(value) fails, always then trying float(value) is fine and, if that fails, that is what will completely give up.
So, I guess the question is what are valid float values that cause int(value) to fail without ValueError?

I haven't thought of any answers to your ValueError question, but I can authoritatively state that you should at least have except Exception:. Otherwise you'll catch SystemExit and KeyboardInterrupt:

$ cat test.py import sys try: sys.exit(0) except: print("Haha, can't quit me!") print("Logic continues") $ python3 test.py Haha, can't quit me! Logic continues

As an exercise ... this shows VaueError, TypeError, and OverflowError.

>>> x=["0","1","False","True","abc","hello",(),(i for i in ()),(lambda o: False),(lambda o: True),(lambda x: x),-((sys.maxsize - 1) * 2),-sys.maxsize - 1,0 ... ,1 + 2j,1,3.14,42,Ellipsis,False,None,NotImplemented,True,[],b"",b"0",b"1\n... ",b"abc",b"bytes",bytearray(b"ba"),enumerate(()),filter(bool, ()),frozen ... set({3, 4}),map(int, ()),memoryview(b"x"),object(),range(0),reversed([]),slice(0, 1, 1),sys.maxsize * 2,sys.maxsize,type,zip((), ()),{"k": "v"},{1, 2}, ... float('inf'),float('1e400')] >>> for a in x: ... try: ... print(f"Testing int({a})", end=': ') ... int(a) ... print("SUCCESS") ... except Exception as e: ... print(f"FAIL={type(e)}") ... try: ... print(f"Testing float({a})", end=': ') ... float(a) ... print("SUCCESS") ... except Exception as e2: ... print(f"FAIL={type(e2)}") Testing int(0): SUCCESS Testing int(1): SUCCESS Testing int(False): FAIL=<class 'ValueError'> Testing float(False): FAIL=<class 'ValueError'> Testing int(True): FAIL=<class 'ValueError'> Testing float(True): FAIL=<class 'ValueError'> Testing int(abc): FAIL=<class 'ValueError'> Testing float(abc): FAIL=<class 'ValueError'> Testing int(hello): FAIL=<class 'ValueError'> Testing float(hello): FAIL=<class 'ValueError'> Testing int(()): FAIL=<class 'TypeError'> Testing float(()): FAIL=<class 'TypeError'> Testing int(<generator object <genexpr> at 0x79d69f208940>): FAIL=<class 'TypeError'> Testing float(<generator object <genexpr> at 0x79d69f208940>): FAIL=<class 'TypeError'> Testing int(<function <lambda> at 0x79d69dc6e700>): FAIL=<class 'TypeError'> Testing float(<function <lambda> at 0x79d69dc6e700>): FAIL=<class 'TypeError'> Testing int(<function <lambda> at 0x79d69dc2b880>): FAIL=<class 'TypeError'> Testing float(<function <lambda> at 0x79d69dc2b880>): FAIL=<class 'TypeError'> Testing int(<function <lambda> at 0x79d69dc2a520>): FAIL=<class 'TypeError'> Testing float(<function <lambda> at 0x79d69dc2a520>): FAIL=<class 'TypeError'> Testing int(-18446744073709551612): SUCCESS Testing int(-9223372036854775808): SUCCESS Testing int(0): SUCCESS Testing int((1+2j)): FAIL=<class 'TypeError'> Testing float((1+2j)): FAIL=<class 'TypeError'> Testing int(1): SUCCESS Testing int(3.14): SUCCESS Testing int(42): SUCCESS Testing int(Ellipsis): FAIL=<class 'TypeError'> Testing float(Ellipsis): FAIL=<class 'TypeError'> Testing int(False): SUCCESS Testing int(None): FAIL=<class 'TypeError'> Testing float(None): FAIL=<class 'TypeError'> Testing int(NotImplemented): FAIL=<class 'TypeError'> Testing float(NotImplemented): FAIL=<class 'TypeError'> Testing int(True): SUCCESS Testing int([]): FAIL=<class 'TypeError'> Testing float([]): FAIL=<class 'TypeError'> Testing int(b''): FAIL=<class 'ValueError'> Testing float(b''): FAIL=<class 'ValueError'> Testing int(b'0'): SUCCESS Testing int(b'1\n... '): FAIL=<class 'ValueError'> Testing float(b'1\n... '): FAIL=<class 'ValueError'> Testing int(b'abc'): FAIL=<class 'ValueError'> Testing float(b'abc'): FAIL=<class 'ValueError'> Testing int(b'bytes'): FAIL=<class 'ValueError'> Testing float(b'bytes'): FAIL=<class 'ValueError'> Testing int(bytearray(b'ba')): FAIL=<class 'ValueError'> Testing float(bytearray(b'ba')): FAIL=<class 'ValueError'> Testing int(<enumerate object at 0x79d69daa7ec0>): FAIL=<class 'TypeError'> Testing float(<enumerate object at 0x79d69daa7ec0>): FAIL=<class 'TypeError'> Testing int(<filter object at 0x79d69dc3f910>): FAIL=<class 'TypeError'> Testing float(<filter object at 0x79d69dc3f910>): FAIL=<class 'TypeError'> Testing int(frozenset({3, 4})): FAIL=<class 'TypeError'> Testing float(frozenset({3, 4})): FAIL=<class 'TypeError'> Testing int(<map object at 0x79d69c58e830>): FAIL=<class 'TypeError'> Testing float(<map object at 0x79d69c58e830>): FAIL=<class 'TypeError'> Testing int(<memory at 0x79d69f208580>): FAIL=<class 'ValueError'> Testing float(<memory at 0x79d69f208580>): FAIL=<class 'ValueError'> Testing int(<object object at 0x79d6a51f3f60>): FAIL=<class 'TypeError'> Testing float(<object object at 0x79d6a51f3f60>): FAIL=<class 'TypeError'> Testing int(range(0, 0)): FAIL=<class 'TypeError'> Testing float(range(0, 0)): FAIL=<class 'TypeError'> Testing int(<list_reverseiterator object at 0x79d69dafeb00>): FAIL=<class 'TypeError'> Testing float(<list_reverseiterator object at 0x79d69dafeb00>): FAIL=<class 'TypeError'> Testing int(slice(0, 1, 1)): FAIL=<class 'TypeError'> Testing float(slice(0, 1, 1)): FAIL=<class 'TypeError'> Testing int(18446744073709551614): SUCCESS Testing int(9223372036854775807): SUCCESS Testing int(<class 'type'>): FAIL=<class 'TypeError'> Testing float(<class 'type'>): FAIL=<class 'TypeError'> Testing int(<zip object at 0x79d69f117f00>): FAIL=<class 'TypeError'> Testing float(<zip object at 0x79d69f117f00>): FAIL=<class 'TypeError'> Testing int({'k': 'v'}): FAIL=<class 'TypeError'> Testing float({'k': 'v'}): FAIL=<class 'TypeError'> Testing int({1, 2}): FAIL=<class 'TypeError'> Testing float({1, 2}): FAIL=<class 'TypeError'> Testing int(inf): FAIL=<class 'OverflowError'> Testing float(inf): SUCCESS Testing int(inf): FAIL=<class 'OverflowError'> Testing float(inf): SUCCESS

I haven't thought of any answers to your ValueError question, but I can authoritatively state that you should at least have except Exception:. Otherwise you'll catch SystemExit and KeyboardInterrupt:

$ cat test.py import sys try: sys.exit(0) except: print("Haha, can't quit me!") print("Logic continues") $ python3 test.py Haha, can't quit me! Logic continues

Ah good point. I will fix this.

nwiltsie · 2025-06-30T17:13:25Z

src/programming/Ligare/programming/R/type_conversion.py

+    def serialize(self) -> str | None:
+        if self.value is None:
+            return NULL
+        return f"'{self.value}'"


question (blocking): Isn't this vulnerable to single-quoted strings?

In [36]: mystr = "'Hello world'"; print(f"'{mystr}'") ''Hello world''

I think repr(self.value) might be more appropriate.

Isn't this vulnerable to single-quoted strings?

Yes. Thanks for calling this out, I didn't consider this issue when String is used directly. My assumption had been that these types are always behind the sanitization that occurs with the serialization methods, but of course that is incorrect because:

It's a "publicly"-named type of the module, so anyone should expect they can use them directly

I explicitly stated in this PR and demonstrate in tests that using these types directly is appropriate and sometimes necessary

I think repr(self.value) might be more appropriate.

Can you think of any issue with base Python considering __repr__ overrides? I am not overriding the method in this lib, however, I also didn't mark these classes as @final, so someone could end up doing that. Are there any other repr(...) considerations to be aware of?

I think if someone has the ability to subclass the serialization types you've already lost, right? There's no need to do code injection if you already control the process.

if someone has the ability to subclass the serialization types

In this case I meant developers, not attackers. :)

Ah! Then no, I don't think there are big issues with overriding __repr__. It's always possible for developers to mess it up, but that's on them.

I looked into using repr for this and found that its main purpose is to return a "printable representation" of some object. That causes problems for the sort of serialization I'm doing here because I'm not concerned with printable representations.

For example, the string a'b'c should be one of (ignoring the questionable behavior of whether sanitizing occurs or not):

'abc' if sanitized (with string)

or 'a\\'b\\'c' if not sanitized (with String - but which currently returns a'b'c)

repr returns "a\'b\'c"

Note that I've excluded the enclosing quotes from Python's REPL for all cases, to reduce noise.

Here's the approach I've adopted that appears to do what I need, and removes the problem of quotes. csv.DictWriter (cause I send the data as CSV, because dataframes) then serializes the data transfer format, which uses additional backslashes for the backslashes.

class String(SerializedType): @override def serialize(self) -> str | None: if self.value is None: return NULL if getattr(self.value, "translate", None): translated = self.value.translate(str.maketrans({"'": r"\'"})) return f"'{translated}'" else: return f"'{self.value}'"

Thoughts?

src/programming/Ligare/programming/R/type_conversion.py

This is for when `string(...)` is _not_ used, which otherwise handles quotes. This is for the use of `String` directly.

aholmes added 10 commits June 20, 2025 15:30

Add conversion methods for R lists.

972802d

Add "token" types for serializing R types.

4747fe4

Progress on R type conversion refactor.

337864b

Fix type errors.

34bd4b0

Clean up some left over cruft from refactor.

fc7d835

Fix issues with wrong values for empty inputs.

3401f3a

Add additional tests.

5075869

Consolidate type conversion methods.

b688e5c

Refactor R serialization.

d4e88a1

Make it possible to nest composite types.

132d1b0

nwiltsie reviewed Jun 30, 2025

View reviewed changes

aholmes added 2 commits June 30, 2025 14:45

Serialize Strings for R ingestion to handle quotes.

098a5d9

This is for when `string(...)` is _not_ used, which otherwise handles quotes. This is for the use of `String` directly.

Avoid catching application-ending exceptions.

a65897e

R serialization refactor #240

Are you sure you want to change the base?

R serialization refactor #240

Uh oh!

Conversation

aholmes commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nwiltsie left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aholmes Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aholmes commented Jun 30, 2025 •

edited

Loading

nwiltsie left a comment •

edited

Loading

aholmes Jun 30, 2025 •

edited

Loading