You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`NA` in datar sets to `numpy.nan`, which is a float. So that it causes problems for other dtypes of data, because setting a value to NA (float) in an array with other dtype is not compatible. Unlink R, python does not have missing value type for other dtypes.
5
+
6
+
pandas has introduced it's own `NA` and some `NA` compatible dtypes. However, `numpy` is still not aware of it, which causes problems for internal computations.
7
+
8
+
- string
9
+
10
+
When initialize a string array intentionally: `numpy.array(['a', NA])`, the `NA` will be converted to a string `'nan'`. That may not be what we want sometimes. To avoid that, use `None` or `NULL` instead:
11
+
12
+
```python
13
+
>>> numpy.array(['a', None])
14
+
array(['a', None], dtype=object)
15
+
```
16
+
17
+
Just pay attention that the dtype falls back to object.
18
+
19
+
20
+
-`NaN`
21
+
22
+
Since `NA`is already a float, `NaN` here is equivalent to `NA`.
Most APIs from tidyverse packages ignore/reset the index (row names) of data frames, so do the APIs from `datar`. So when selecting rows, row indices are always used. With most APIs, the indices of the data frames are dropped, so they are actually ranging from 0 to `nrow(df) - 1`.
3
+
4
+
!!! Note
5
+
6
+
when using 1-based indexing (default), 1 selects the first row. Even though the first row shows index 0 when it's printed.
7
+
8
+
No `MultiIndex` indices/column names are supported for the APIs to select or manipulate data frames and the data frames generated by the APIs will not have `MultiIndex` indices/column names. However, since it's still pandas DataFrame, you can always do it in pandas way:
`datar` doesn't use `pandas`' `DataFrameGroupBy`/`SeriesGroupBy` classes. Instead, we have our own `DataFrameGroupBy` class, which is actually a subclass of `DataFrame`, with 3 extra properties: `_group_data`, `_group_vars` and `_group_drop`, carring the grouping data, grouping variables/columns and whether drop the non-observable values. This is very similar to `grouped_df` from `dplyr`.
3
+
4
+
The reasons that we implement this are:
5
+
6
+
1. Pandas DataFrameGroupBy cannot handle mutilpe categorical columns as
7
+
groupby variables with non-obserable values
8
+
2. It is very hard to retrieve group indices and data when doing apply
`%in%` in R is a shortcut for `is.element()` to test if the elements are in a container.
2
+
3
+
```r
4
+
r$> c(1,3,5) %in%1:4
5
+
[1] TRUETRUEFALSE
6
+
7
+
r$> is.element(c(1,3,5), 1:4)
8
+
[1] TRUETRUEFALSE
9
+
```
10
+
11
+
However, `in` in python acts differently:
12
+
13
+
```python
14
+
>>>import numpy as np
15
+
>>>
16
+
>>> arr = np.array([1,2,3,4])
17
+
>>> elts = np.array([1,3,5])
18
+
>>>
19
+
>>> elts in arr
20
+
/.../bin/bpython:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
21
+
#!/.../bin/python
22
+
False
23
+
>>> [1,2] in [1,2,3]
24
+
False
25
+
```
26
+
27
+
It simply tests if the element on the left side of `in` is equal to any of the elements in the right side. Regardless of whether the element on the left side is scalar or not.
28
+
29
+
Yes, we can redefine the behavior of this by writing your own `__contains__()` methods of the right object. For example:
30
+
31
+
```python
32
+
>>>classMyList(list):
33
+
...def__contains__(self, key):
34
+
...# Just an example to let it return the reversed result
35
+
...returnnotsuper().__contains__(key)
36
+
...
37
+
>>>1in MyList([1,2,3])
38
+
False
39
+
>>>4in MyList([1,2,3])
40
+
True
41
+
```
42
+
43
+
But the problem is that the result `__contains__()` is forced to be a scalar bool by python. In this sense, we cannot let `x in y` to be evaluated as a bool array or even a pipda `Expression` object.
44
+
```python
45
+
>>>classMyList(list):
46
+
...def__contains__(self, key):
47
+
...# Just an example
48
+
...return [True, False, True] # logically True in python
49
+
...
50
+
>>>1in MyList([1,2,3])
51
+
True
52
+
>>>4in MyList([1,2,3])
53
+
True
54
+
```
55
+
56
+
So instead, we ported `is.element()` from R:
57
+
58
+
```python
59
+
>>>import numpy as np
60
+
>>>from datar.base import is_element
61
+
>>>
62
+
>>> arr = np.array([1,2,3,4])
63
+
>>> elts = np.array([1,3,5])
64
+
>>>
65
+
>>> is_element(elts, arr)
66
+
>>> is_element(elts, arr)
67
+
array([ True, True, False])
68
+
```
69
+
70
+
So, as @rleyvasal pointed out in https://github.com/pwwang/datar/issues/31#issuecomment-877499212,
0 commit comments