Skip to content

Conversation

@aholmes
Copy link
Contributor

@aholmes aholmes commented Jun 30, 2025

This PR is a refactor of the Ligare.programming R serialization methods. This is to support converting Python types and data into types and data that R can ingest with the corresponding R package found here.

Some key considerations for review:

  • serializers.py is effectively the public interface of the type conversion and serialization. Note that nothing from type_converters.py is "exported."
  • The serializer's default behaviors are a combination of "best guess" and practical requirements for CEV. For example, from_parts attempts to convert data via a few simple type checks. This can produce wrong results, but is generally what CEV needs and so is assumed to be correct behavior.
    • This can be controlled in applications where these assumptions are wrong. For example, while it is not immediately clear that this works (this is a gap in documentation and tests), this check allows for usage of this library similar to how this test instantiates data to be serialized. I do something similar to this in CEV to construct a list( c(seq(...)), c(seq(...)) )
def get(value: Any):
    x = _number(value, comma_separated=True)
    return Vector(Seq(x))

# some code to get the values from request args, followed by ...

from_parts(data, "key_dest", ["seq1", "seq2"], composite_type=List)

This results in a var: List[Vector[Seq[int]]] for which serialize() returns list(c(seq(1,2,3)),c(seq(4,5,6))).

  • Currently the tests best describe usage and expectations. There are some gaps in specific usages, e.g., direct tests for _convert(...)'s behavior, but there should be sufficient tests to get an understanding of how this PR works.
  • The docblocks are a little out-of-date. make Sphinx-doctest does not pass.
  • CEV can be reviewed for usage in a real application.
  • This library does some sanitization of input data to avoid any kind of injection, but pointing out areas that I missed will be helpful.

Copy link
Contributor

@nwiltsie nwiltsie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've poked around and made some comments, although I haven't managed to wrap my head around everything yet.

Comment on lines +150 to +162
if comma_separated or vector:
if not (safe_str := safe_comma_separated_string_regex.sub("", value)):
# if the safe_str is empty but value is not,
# then all characters were stripped and the
# supplied value is not "safe." Return None
# because this is invalid.
return String(None)

>>> string_from_csv(",") is None
True
items = (item for item in safe_str.split(","))
if vector:
return Vector([String(item) for item in items])
else:
return Composite([String(item) for item in items])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): Do you ever expect to have quoted values? Your SAFE_COMMA_SEPARATED_STRING_PATTERN regex will specifically exclude quotation marks, and using .split(",") wouldn't respect them anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CEV's use case, no, not at the moment. While someone may want to use quotations for, e.g., xlab​.label, they are explicitly stripped out as a brute force way to avoid breaking out of eval here.

Comment on lines 14 to 16
NULL = "__NULL__"
FALSE = "FALSE"
TRUE = "TRUE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): Do you expect to never have to pass-through the literal values "TRUE" or "FALSE"? ("__NULL__" seems a little less likely.)

Copy link
Contributor Author

@aholmes aholmes Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRUE and FALSE are a unique case with regards to these consts. R can convert them directly. R cannot convert NULL directly, which is why I use __NULL__. The use of TRUE and FALSE in this way, in Python, is partly for consistency in code and the way the values are handled "syntactically." Here is where I've documented this in the R package (although I think my comment is a little out-of-date there; the idea remains the same).

I have given a little thought to someone using these values as string types; they should remain strings because only the exact literal value of __NULL__ (not, e.g., '__NULL__', which is what string(val) would return) is converted to an R NULL value, but I have not explicitly tested this. I will likely address this soon, as I now have a need to convert __NULL__ as part of more complex composite types, e.g., list(__NULL__, c(seq(1,2,3))).

Scratch some of the above, I need to re-examine my thinking around this. Pretty sure I got my head turned around somewhere.

else:
try:
super().__init__(int(value))
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): It's probably worth restricting this to ValueError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree, but I could use some thought exercises around this maybe. My original thinking was that, however int(value) fails, always then trying float(value) is fine and, if that fails, that is what will completely give up.
So, I guess the question is what are valid float values that cause int(value) to fail without ValueError?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought of any answers to your ValueError question, but I can authoritatively state that you should at least have except Exception:. Otherwise you'll catch SystemExit and KeyboardInterrupt:

$ cat test.py
import sys

try:
    sys.exit(0)
except:
    print("Haha, can't quit me!")

print("Logic continues")

$ python3 test.py
Haha, can't quit me!
Logic continues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an exercise ... this shows VaueError, TypeError, and OverflowError.

>>> x=["0","1","False","True","abc","hello",(),(i for i in ()),(lambda o: False),(lambda o: True),(lambda x: x),-((sys.maxsize - 1) * 2),-sys.maxsize - 1,0
... ,1 + 2j,1,3.14,42,Ellipsis,False,None,NotImplemented,True,[],b"",b"0",b"1\n... ",b"abc",b"bytes",bytearray(b"ba"),enumerate(()),filter(bool, ()),frozen
... set({3, 4}),map(int, ()),memoryview(b"x"),object(),range(0),reversed([]),slice(0, 1, 1),sys.maxsize * 2,sys.maxsize,type,zip((), ()),{"k": "v"},{1, 2},
... float('inf'),float('1e400')]
>>> for a in x:
...     try:
...         print(f"Testing int({a})", end=': ')
...         int(a)
...         print("SUCCESS")
...     except Exception as e:
...         print(f"FAIL={type(e)}")
...         try:
...             print(f"Testing float({a})", end=': ')
...             float(a)
...             print("SUCCESS")
...         except Exception as e2:
...             print(f"FAIL={type(e2)}")
Testing int(0): SUCCESS
Testing int(1): SUCCESS
Testing int(False): FAIL=<class 'ValueError'>
Testing float(False): FAIL=<class 'ValueError'>
Testing int(True): FAIL=<class 'ValueError'>
Testing float(True): FAIL=<class 'ValueError'>
Testing int(abc): FAIL=<class 'ValueError'>
Testing float(abc): FAIL=<class 'ValueError'>
Testing int(hello): FAIL=<class 'ValueError'>
Testing float(hello): FAIL=<class 'ValueError'>
Testing int(()): FAIL=<class 'TypeError'>
Testing float(()): FAIL=<class 'TypeError'>
Testing int(<generator object <genexpr> at 0x79d69f208940>): FAIL=<class 'TypeError'>
Testing float(<generator object <genexpr> at 0x79d69f208940>): FAIL=<class 'TypeError'>
Testing int(<function <lambda> at 0x79d69dc6e700>): FAIL=<class 'TypeError'>
Testing float(<function <lambda> at 0x79d69dc6e700>): FAIL=<class 'TypeError'>
Testing int(<function <lambda> at 0x79d69dc2b880>): FAIL=<class 'TypeError'>
Testing float(<function <lambda> at 0x79d69dc2b880>): FAIL=<class 'TypeError'>
Testing int(<function <lambda> at 0x79d69dc2a520>): FAIL=<class 'TypeError'>
Testing float(<function <lambda> at 0x79d69dc2a520>): FAIL=<class 'TypeError'>
Testing int(-18446744073709551612): SUCCESS
Testing int(-9223372036854775808): SUCCESS
Testing int(0): SUCCESS
Testing int((1+2j)): FAIL=<class 'TypeError'>
Testing float((1+2j)): FAIL=<class 'TypeError'>
Testing int(1): SUCCESS
Testing int(3.14): SUCCESS
Testing int(42): SUCCESS
Testing int(Ellipsis): FAIL=<class 'TypeError'>
Testing float(Ellipsis): FAIL=<class 'TypeError'>
Testing int(False): SUCCESS
Testing int(None): FAIL=<class 'TypeError'>
Testing float(None): FAIL=<class 'TypeError'>
Testing int(NotImplemented): FAIL=<class 'TypeError'>
Testing float(NotImplemented): FAIL=<class 'TypeError'>
Testing int(True): SUCCESS
Testing int([]): FAIL=<class 'TypeError'>
Testing float([]): FAIL=<class 'TypeError'>
Testing int(b''): FAIL=<class 'ValueError'>
Testing float(b''): FAIL=<class 'ValueError'>
Testing int(b'0'): SUCCESS
Testing int(b'1\n... '): FAIL=<class 'ValueError'>
Testing float(b'1\n... '): FAIL=<class 'ValueError'>
Testing int(b'abc'): FAIL=<class 'ValueError'>
Testing float(b'abc'): FAIL=<class 'ValueError'>
Testing int(b'bytes'): FAIL=<class 'ValueError'>
Testing float(b'bytes'): FAIL=<class 'ValueError'>
Testing int(bytearray(b'ba')): FAIL=<class 'ValueError'>
Testing float(bytearray(b'ba')): FAIL=<class 'ValueError'>
Testing int(<enumerate object at 0x79d69daa7ec0>): FAIL=<class 'TypeError'>
Testing float(<enumerate object at 0x79d69daa7ec0>): FAIL=<class 'TypeError'>
Testing int(<filter object at 0x79d69dc3f910>): FAIL=<class 'TypeError'>
Testing float(<filter object at 0x79d69dc3f910>): FAIL=<class 'TypeError'>
Testing int(frozenset({3, 4})): FAIL=<class 'TypeError'>
Testing float(frozenset({3, 4})): FAIL=<class 'TypeError'>
Testing int(<map object at 0x79d69c58e830>): FAIL=<class 'TypeError'>
Testing float(<map object at 0x79d69c58e830>): FAIL=<class 'TypeError'>
Testing int(<memory at 0x79d69f208580>): FAIL=<class 'ValueError'>
Testing float(<memory at 0x79d69f208580>): FAIL=<class 'ValueError'>
Testing int(<object object at 0x79d6a51f3f60>): FAIL=<class 'TypeError'>
Testing float(<object object at 0x79d6a51f3f60>): FAIL=<class 'TypeError'>
Testing int(range(0, 0)): FAIL=<class 'TypeError'>
Testing float(range(0, 0)): FAIL=<class 'TypeError'>
Testing int(<list_reverseiterator object at 0x79d69dafeb00>): FAIL=<class 'TypeError'>
Testing float(<list_reverseiterator object at 0x79d69dafeb00>): FAIL=<class 'TypeError'>
Testing int(slice(0, 1, 1)): FAIL=<class 'TypeError'>
Testing float(slice(0, 1, 1)): FAIL=<class 'TypeError'>
Testing int(18446744073709551614): SUCCESS
Testing int(9223372036854775807): SUCCESS
Testing int(<class 'type'>): FAIL=<class 'TypeError'>
Testing float(<class 'type'>): FAIL=<class 'TypeError'>
Testing int(<zip object at 0x79d69f117f00>): FAIL=<class 'TypeError'>
Testing float(<zip object at 0x79d69f117f00>): FAIL=<class 'TypeError'>
Testing int({'k': 'v'}): FAIL=<class 'TypeError'>
Testing float({'k': 'v'}): FAIL=<class 'TypeError'>
Testing int({1, 2}): FAIL=<class 'TypeError'>
Testing float({1, 2}): FAIL=<class 'TypeError'>
Testing int(inf): FAIL=<class 'OverflowError'>
Testing float(inf): SUCCESS
Testing int(inf): FAIL=<class 'OverflowError'>
Testing float(inf): SUCCESS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought of any answers to your ValueError question, but I can authoritatively state that you should at least have except Exception:. Otherwise you'll catch SystemExit and KeyboardInterrupt:

$ cat test.py
import sys

try:
    sys.exit(0)
except:
    print("Haha, can't quit me!")

print("Logic continues")

$ python3 test.py
Haha, can't quit me!
Logic continues

Ah good point. I will fix this.

def serialize(self) -> str | None:
if self.value is None:
return NULL
return f"'{self.value}'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): Isn't this vulnerable to single-quoted strings?

In [36]: mystr = "'Hello world'"; print(f"'{mystr}'")
''Hello world''

I think repr(self.value) might be more appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this vulnerable to single-quoted strings?

Yes. Thanks for calling this out, I didn't consider this issue when String is used directly. My assumption had been that these types are always behind the sanitization that occurs with the serialization methods, but of course that is incorrect because:

  1. It's a "publicly"-named type of the module, so anyone should expect they can use them directly
  2. I explicitly stated in this PR and demonstrate in tests that using these types directly is appropriate and sometimes necessary

I think repr(self.value) might be more appropriate.

Can you think of any issue with base Python considering __repr__ overrides? I am not overriding the method in this lib, however, I also didn't mark these classes as @final, so someone could end up doing that. Are there any other repr(...) considerations to be aware of?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if someone has the ability to subclass the serialization types you've already lost, right? There's no need to do code injection if you already control the process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone has the ability to subclass the serialization types

In this case I meant developers, not attackers. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Then no, I don't think there are big issues with overriding __repr__. It's always possible for developers to mess it up, but that's on them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into using repr for this and found that its main purpose is to return a "printable representation" of some object. That causes problems for the sort of serialization I'm doing here because I'm not concerned with printable representations.

For example, the string a'b'c should be one of (ignoring the questionable behavior of whether sanitizing occurs or not):

  • 'abc' if sanitized (with string)
  • or 'a\\'b\\'c' if not sanitized (with String - but which currently returns a'b'c)

repr returns "a\'b\'c"

Note that I've excluded the enclosing quotes from Python's REPL for all cases, to reduce noise.

Here's the approach I've adopted that appears to do what I need, and removes the problem of quotes. csv.DictWriter (cause I send the data as CSV, because dataframes) then serializes the data transfer format, which uses additional backslashes for the backslashes.

class String(SerializedType):
    @override
    def serialize(self) -> str | None:
        if self.value is None:
            return NULL
        if getattr(self.value, "translate", None):
            translated = self.value.translate(str.maketrans({"'": r"\'"}))
            return f"'{translated}'"
        else:
            return f"'{self.value}'"

Thoughts?

aholmes added 2 commits June 30, 2025 14:45
This is for when `string(...)` is _not_ used, which otherwise
handles quotes. This is for the use of `String` directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants