generated from r4ds/bookclub-template
-
Notifications
You must be signed in to change notification settings - Fork 18
/
05_notes.qmd
413 lines (264 loc) · 8.56 KB
/
05_notes.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
# Notes {.unnumbered}
## Introduction
:::{.callout-note}
This is a long chapter, these notes are intended as a tour of main ideas!
:::
![Panda bus tour!](images/pandabus.jpg){width=300px}
* Pandas is a major tool in Python data analysis
* Works with Numpy, adding support for tabular / heterogenous data
## Import conventions:
```{python}
import numpy as np
import pandas as pd
```
## Panda's primary data structures
* Series: One dimensional object containing a sequence of values of the same type.
* DataFrame: Tabular data, similar (and inspired by) R dataframe.
* Other structures will be introduced as they arise, e.g. Index and Groupby objects.
### Series
```{python}
obj = pd.Series([4,7,-4,3], index = ["A","B","C","D"])
obj
```
The `index` is optional, if not specified it will default to 0 through N-1
#### Selection
Select elements or sub-Series by labels, sets of labels, boolean arrays ...
```{python}
obj['A']
```
```{python}
obj[['A','C']]
```
```{python}
obj[obj > 3]
```
#### Other things you can do
* Numpy functions and Numpy-like operations work as expected:
```{python}
obj*3
```
```{python}
np.exp(obj)
```
* Series can be created from and converted to a dictionary
```{python}
obj.to_dict()
```
* Series can be converted to numpy array:
```{python}
obj.to_numpy()
```
### DataFrame
* Represents table of data
* Has row index *index* and column index *column*
* Common way to create is from a dictionary, but see *Table 5.1* for more!
```{python}
test = pd.DataFrame({"cars":['Chevy','Ford','Dodge','BMW'],'MPG':[14,15,16,12], 'Year':[1979, 1980, 2001, 2020]})
test
```
* If you want a non-default index, it can be specified just like with Series.
* `head(n)` / `tail(n)` - return the first / last n rows, 5 by default
#### Selecting
* Can retrieve columns or sets of columns by using `obj[...]`:
```{python}
test['cars']
```
Note that we got a `Series` here.
```{python}
test[['cars','MPG']]
```
* Dot notation can also be used (`test.cars`) as long as the column names are valid identifiers
* *Rows* can be retrieved with `iloc[...]` and `loc[...]`:
- `loc` retrieves by index
- `iloc` retrieves by position.
#### Modifying / Creating Columns
* Columns can be modified (and created) by assignment:
```{python}
test['MPG^2'] = test['MPG']**2
test
```
* `del` keyword can be used to drop columns, or `drop` method can be used to do so non-destructively
### Index object
* Index objects are used for holding axis labels and other metadata
```{python}
test.index
```
* Can change the index, in this case replacing the default:
```{python}
# Create index from one of the columns
test.index = test['cars']
# remove 'cars' column since i am using as an index now. s
test=test.drop('cars', axis = "columns") # or axis = 1
test
```
* Note the `axis` keyword argument above, many DataFrame methods use this.
* Above I changed a column into an index. Often you want to go the other way, this can be done with `reset_index`:
```{python}
test.reset_index() # Note this doesn't actually change test
```
* Columns are an index as well:
```{python}
test.columns
```
* Indexes act like immutable sets, see *Table 5.2* in book for Index methods and properties
## Essential Functionality
### Reindexing and dropping
* `reindex` creats a *new* object with the values arranged according to the new index. Missing values are used if necessary, or you can use optional fill methods. You can use `iloc` and `loc` to reindex as well.
```{python}
s = pd.Series([1,2,3,4,5], index = list("abcde"))
s2 = s.reindex(list("abcfu")) # not a song by GAYLE
s2
```
* Missing values and can be tested for with `isna` or `notna` methods
```{python}
pd.isna(s2)
```
* `drop` , illustrated above can drop rows or columns. In addition to using `axis` you can use `columns` or `index`. Again these make copies.
```{python}
test.drop(columns = 'MPG')
```
```{python}
test.drop(index = ['Ford', 'BMW'])
```
### Indexing, Selection and Filtering
#### Series
* For Series, indexing is similar to Numpy, except you can use the index as well as integers.
```{python}
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj[0:3]
```
```{python}
obj['a':'c']
```
```{python}
obj[obj<2]
```
```{python}
obj[['a','d']]
```
* *However*, preferred way is to use `loc` for selection by *index* and `iloc` for selection by position. This is to avoid the issue where the `index` is itself integers.
```{python}
obj.loc[['a','d']]
```
```{python}
obj.iloc[1]
```
:::{.callout-note}
Note if a range or a set of indexes is used, a Series is returned. If a single item is requested, you get just that item.
:::
#### DataFrame
* Selecting with `df[...]` for a DataFrame retrieves one or more columns as we have seen, if you select a single column you get a Series
* There are some special cases, indexing with a boolean selects *rows*, as does selecting with a slice:
```{python}
test[0:1]
```
```{python}
test[test['MPG'] < 15]
```
* `iloc` and `loc` can be used to select rows as illustrated before, but can also be used to select columns or subsets of rows/columns
```{python}
test.loc[:,['Year','MPG']]
```
```{python}
test.loc['Ford','MPG']
```
* These work with slices and booleans as well! The following says "give me all the rows with MPG more then 15, and the columns starting from Year"
```{python}
test.loc[test['MPG'] > 15, 'Year':]
```
* Indexing options are fully illustrated in the book and *Table 5.4*
* Be careful with *chained indexing*:
```{python}
test[test['MPG']> 15].loc[:,'MPG'] = 18
```
Here we are assigning to a 'slice', which is probably not what is intended. You will get a warning and a recommendation to fix it by using one `loc`:
```{python}
test.loc[test['MPG']> 15 ,'MPG'] = 18
test
```
:::{.callout-tip}
### Rule of Thumb
Avoid chained indexing when doing assignments
:::
### Arithmetic and Data Alignment
* Pandas can make it simpler to work with objects that have different indexes, usually 'doing the right thing'
```{python}
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s1+s2
```
* Fills can be specified by using methods:
```{python}
s1.add(s2, fill_value = 0)
```
* See *Table 5.5* for list of these methods.
* You can also do arithmetic between *DataFrame*s and *Series* in a way that is similar to Numpy.
### Function Application and Mapping
* Numpy *ufuncs* also work with Pandas objects.
```{python}
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
columns=list("bde"),
index=["Utah", "Ohio", "Texas", "Oregon"])
frame
```
```{python}
np.abs(frame)
```
* `apply` can be used to apply a function on 1D arrays to each column or row:
```{python}
frame.apply(np.max, axis = 'rows') #'axis' is optional here, default is rows
```
Applying accross columns is common, especially to combine different columns in some way:
```{python}
frame['max'] = frame.apply(np.max, axis = 'columns')
frame
```
* Many more examples of this in the book.
### Sorting and Ranking
* `sort_index` will sort with the index (on either axis for *DataFrame*)
* `sort_values` is used to sort by values or a particular column
```{python}
test.sort_values('MPG')
```
* `rank` will assign ranks from on through the number of data points.
## Summarizing and Computing Descriptive Statistics
```{python}
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=["a", "b", "c", "d"],
columns=["one", "two"])
df
```
Some Examples:
Sum over rows:
```{python}
df.sum()
```
Sum over columns:
```{python}
# Sum Rows
df.sum(axis="columns")
```
Extremely useful is `describe`:
```{python}
df.describe()
```
**Book chapter contains *many* more examples and a full list of summary statistics and related methods.**
## Summary
* Primary Panda's data structures:
- Series
- DataFrame
* Many ways to access and transform these objects. Key ones are:
- `[]` : access an element(s) of a `Series` or columns(s) of a `DataFrame`
- `loc[r ,c]` : access a row / column / cell by the `index`.
- `iloc[i, j]` : access ar row / column / cell by the integer position.
* [Online reference.](https://pandas.pydata.org/docs/reference/index.html)
:::{.callout-tip}
## Suggestion
Work though the chapter's code and try stuff!
:::
## References
* [Chapter's code.](https://nbviewer.org/github/pydata/pydata-book/blob/3rd-edition/ch05.ipynb)
* [Panda reference.](https://pandas.pydata.org/docs/reference/index.html)
## Next Chapter
* Loading and writing data sets!