Skip to content

Commit

Permalink
Merge pull request #10 from posit-conf-2024/combine-setup
Browse files Browse the repository at this point in the history
Combine + finish Workbench setup slides
  • Loading branch information
stephhazlitt authored Aug 9, 2024
2 parents bd50467 + b7f781c commit 501be15
Show file tree
Hide file tree
Showing 26 changed files with 285 additions and 201 deletions.
8 changes: 3 additions & 5 deletions _freeze/materials/0_housekeeping/execute-results/html.json

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
{
"hash": "d3eac455cb5520916f95814d16fd2f5c",
"hash": "9b2f8504afc7cf55e4e47830692dcd35",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Hello Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\neditor: source\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi <- open_dataset(\"data/nyc-taxi/\")\n```\n:::\n\n\n\n\n\n\n\n\n::: {#exercise-hello-nyc-taxi .callout-tip}\n## First dplyr pipeline with Arrow\n\n::: panel-tabset\n## Problems\n\n1. Calculate the longest trip distance for every month in 2019\n\n2. How long did this query take to run?\n\n## Solution 1\n\nLongest trip distance for every month in 2019:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n filter(year == 2019) |>\n group_by(month) |>\n summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n arrange(month) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 12 × 2\n month longest_trip\n <int> <dbl>\n 1 1 832.\n 2 2 702.\n 3 3 237.\n 4 4 831.\n 5 5 401.\n 6 6 45977.\n 7 7 312.\n 8 8 602.\n 9 9 604.\n10 10 308.\n11 11 701.\n12 12 19130.\n```\n\n\n:::\n:::\n\n\n\n\n## Solution 2\n\nCompute time:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nnyc_taxi |> \n filter(year == 2019) |>\n group_by(month) |>\n summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n arrange(month) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 12 × 2\n month longest_trip\n <int> <dbl>\n 1 1 832.\n 2 2 702.\n 3 3 237.\n 4 4 831.\n 5 5 401.\n 6 6 45977.\n 7 7 312.\n 8 8 602.\n 9 9 604.\n10 10 308.\n11 11 701.\n12 12 19130.\n```\n\n\n:::\n\n```{.r .cell-code}\ntoc()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.392 sec elapsed\n```\n\n\n:::\n:::\n\n\n\n\nor \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n filter(year == 2019) |>\n group_by(month) |>\n summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n arrange(month) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 2.565 0.148 0.376 \n```\n\n\n:::\n:::\n\n\n\n\n:::\n:::\n",
"supporting": [],
"markdown": "---\ntitle: \"Hello Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\neditor: source\n---\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\nlibrary(tictoc)\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi <- open_dataset(\"data/nyc-taxi/\")\n```\n:::\n\n\n\n\n\n\n::: {#exercise-hello-nyc-taxi .callout-tip}\n## First dplyr pipeline with Arrow\n\n::: panel-tabset\n## Problems\n\n1. Calculate the longest trip distance for every month in 2019\n\n2. How long did this query take to run?\n\n## Solution 1\n\nLongest trip distance for every month in 2019:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n filter(year == 2019) |>\n group_by(month) |>\n summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n arrange(month) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 12 × 2\n month longest_trip\n <int> <dbl>\n 1 1 832.\n 2 2 702.\n 3 3 237.\n 4 4 831.\n 5 5 401.\n 6 6 45977.\n 7 7 312.\n 8 8 602.\n 9 9 604.\n10 10 308.\n11 11 701.\n12 12 19130.\n```\n\n\n:::\n:::\n\n\n## Solution 2\n\nCompute time:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nnyc_taxi |> \n filter(year == 2019) |>\n group_by(month) |>\n summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n arrange(month) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 12 × 2\n month longest_trip\n <int> <dbl>\n 1 1 832.\n 2 2 702.\n 3 3 237.\n 4 4 831.\n 5 5 401.\n 6 6 45977.\n 7 7 312.\n 8 8 602.\n 9 9 604.\n10 10 308.\n11 11 701.\n12 12 19130.\n```\n\n\n:::\n\n```{.r .cell-code}\ntoc()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n0.56 sec elapsed\n```\n\n\n:::\n:::\n\n\nor \n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n filter(year == 2019) |>\n group_by(month) |>\n summarize(longest_trip = max(trip_distance, na.rm = TRUE)) |>\n arrange(month) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 2.532 0.316 0.452 \n```\n\n\n:::\n:::\n\n\n:::\n:::\n",
"supporting": [
"1_hello_arrow-exercises_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
Expand Down
8 changes: 5 additions & 3 deletions _freeze/materials/1_hello_arrow/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
{
"hash": "52b72c4ea2bf073068a11a6b2fd0050c",
"hash": "eaac9c74be87e906e161433327bd4053",
"result": {
"engine": "knitr",
"markdown": "---\nfooter: \"[🔗 pos.it/arrow-conf24](https://pos.it/arrow-conf24)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n# Hello Arrow {#hello-arrow}\n\n\n## Kick-off Q&A\n\n<br>\n\n- What brings you to this workshop?\n- What challenges have you faced related to larger-than-memory data in R?\n- What is one thing you want to learn or achieve from today's workshop?\n- ...?\n\n\n## Poll: Arrow\n\n<br>\n\n**Have you used or experimented with Arrow before today?**\n\nVote using emojis on the #workshop-arrow discord channel! <br> \n\n1️⃣ Not yet\n\n2️⃣ Not yet, but I have read about it!\n\n3️⃣ A little\n\n4️⃣ A lot\n\n\n## Hello Arrow<br>Demo\n\n<br>\n\n![](images/logo.png){.absolute top=\"0\" left=\"250\" width=\"600\" height=\"800\"}\n\n## Some \"Big\" Data\n\n![](images/nyc-taxi-homepage.png){.absolute left=\"200\" width=\"600\"}\n\n::: {style=\"font-size: 60%; margin-top: 550px; margin-left: 200px;\"}\n<https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>\n:::\n\n## NYC Taxi Data\n\n- *big* NYC Taxi data set (\\~40GBs on disk)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"s3://voltrondata-labs-datasets/nyc-taxi\") |>\n filter(year %in% 2012:2021) |>\n write_dataset(\"data/nyc-taxi\", partitioning = c(\"year\", \"month\"))\n```\n:::\n\n\n- *tiny* NYC Taxi data set (\\<1GB on disk)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload.file(url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/nyc-taxi-tiny.zip\",\n destfile = \"data/nyc-taxi-tiny.zip\")\n\nunzip(\n zipfile = \"data/nyc-taxi-tiny.zip\",\n exdir = \"data/\"\n)\n```\n:::\n\n\n## Posit Workbench 🛠️\n\n- Join Workbench via URL in the #workshop-arrow Discord channel\n- You can use your GitHub credentials to log in\n\n![](images/wb-signin.png){.absolute left=\"200\" width=\"300\"}\n![](images/use-gh-creds.png){.absolute left=\"500\" width=\"300\"}\n\n\n## Larger-Than-Memory Data\n\n<br>\n\n`arrow::open_dataset()`\n\n<br>\n\n::: notes\nArrow Datasets allow you to query against data that has been split across multiple files. This division of data into multiple files may indicate partitioning, which can accelerate queries that only touch some partitions (files). Call open_dataset() to point to a directory of data files and return a Dataset, then use dplyr methods to query it.\n:::\n\n## NYC Taxi Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\n\nnyc_taxi <- open_dataset(\"data/nyc-taxi\")\n```\n:::\n\n\n## NYC Taxi Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1150352666\n```\n\n\n:::\n:::\n\n\n<br>\n\n1.15 billion rows 🤯\n\n## NYC Taxi Dataset: A question\n\n<br>\n\nWhat percentage of taxi rides each year had more than 1 passenger?\n\n## NYC Taxi Dataset: A dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\nnyc_taxi |>\n group_by(year) |>\n summarise(\n all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n ) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 4\n year all_trips shared_trips pct_shared\n <int> <int> <int> <dbl>\n 1 2012 178544324 53313752 29.9\n 2 2013 173179759 51215013 29.6\n 3 2014 165114361 48816505 29.6\n 4 2015 146112989 43081091 29.5\n 5 2016 131165043 38163870 29.1\n 6 2017 113495512 32296166 28.5\n 7 2018 102797401 28796633 28.0\n 8 2019 84393604 23515989 27.9\n 9 2020 24647055 5837960 23.7\n10 2021 30902618 7221844 23.4\n```\n\n\n:::\n:::\n\n\n## NYC Taxi Dataset: A dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nnyc_taxi |>\n group_by(year) |>\n summarise(\n all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n ) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect()\ntoc()\n```\n:::\n\n\n> 6.077 sec elapsed\n\n## Your Turn\n\n1. Calculate the longest trip distance for every month in 2019\n\n2. How long did this query take to run?\n\n➡️ [Hello Arrow Exercises Page](1_hello_arrow-exercises.html)\n\n## What is Apache Arrow?\n\n::: columns\n::: {.column width=\"50%\"}\n> A multi-language toolbox for accelerated data interchange and in-memory processing\n:::\n\n::: {.column width=\"50%\"}\n> Arrow is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another\n:::\n:::\n\n::: {style=\"font-size: 70%;\"}\n<https://arrow.apache.org/overview/>\n:::\n\n## Apache Arrow Specification\n\nIn-memory columnar format: a standardized, language-agnostic specification for representing structured, table-like data sets in-memory.\n\n<br>\n\n![](images/arrow-rectangle.png){.absolute left=\"200\"}\n\n## A Multi-Language Toolbox\n\n![](images/arrow-libraries-structure.png)\n\n## Accelerated Data Interchange\n\n![](images/data-interchange-with-arrow.png)\n\n## Accelerated In-Memory Processing\n\nArrow's Columnar Format is Fast\n\n![](images/columnar-fast.png){.absolute top=\"120\" left=\"200\" height=\"600\"}\n\n::: notes\nThe contiguous columnar layout enables vectorization using the latest SIMD (Single Instruction, Multiple Data) operations included in modern processors.\n:::\n\n## arrow 📦\n\n<br>\n\n![](images/arrow-r-pkg.png){.absolute top=\"0\" left=\"300\" width=\"700\" height=\"900\"}\n\n## arrow 📦\n\n![](images/arrow-read-write-updated.png)\n\n## Today\n\n- Module 1: Larger-than-memory data manipulation with Arrow---Part I\n- Module 2: Data engineering with Arrow\n- Module 3: In-memory workflows in R with Arrow\n- Module 4: Larger-than-memory data manipulation with Arrow---Part II\n\n",
"supporting": [],
"markdown": "---\nfooter: \"[🔗 pos.it/arrow-conf24](https://pos.it/arrow-conf24)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n# Hello Arrow {#hello-arrow}\n\n\n## Kick-off Q&A\n\n<br>\n\n- What brings you to this workshop?\n- What challenges have you faced related to larger-than-memory data in R?\n- What is one thing you want to learn or achieve from today's workshop?\n- ...?\n\n\n## Poll: Arrow\n\n<br>\n\n**Have you used or experimented with Arrow before today?**\n\nVote using emojis on the #workshop-arrow discord channel! <br> \n\n1️⃣ Not yet\n\n2️⃣ Not yet, but I have read about it!\n\n3️⃣ A little\n\n4️⃣ A lot\n\n\n## Hello Arrow<br>Demo\n\n<br>\n\n![](images/logo.png){.absolute top=\"0\" left=\"250\" width=\"600\" height=\"800\"}\n\n## Some \"Big\" Data\n\n![](images/nyc-taxi-homepage.png){.absolute left=\"200\" width=\"600\"}\n\n::: {style=\"font-size: 60%; margin-top: 550px; margin-left: 200px;\"}\n<https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>\n:::\n\n## NYC Taxi Data\n\n- *big* NYC Taxi data set (\\~40GBs on disk)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"s3://voltrondata-labs-datasets/nyc-taxi\") |>\n filter(year %in% 2012:2021) |>\n write_dataset(\"data/nyc-taxi\", partitioning = c(\"year\", \"month\"))\n```\n:::\n\n\n- *tiny* NYC Taxi data set (\\<1GB on disk)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndownload.file(url = \"https://github.com/posit-conf-2023/arrow/releases/download/v0.1.0/nyc-taxi-tiny.zip\",\n destfile = \"data/nyc-taxi-tiny.zip\")\n\nunzip(\n zipfile = \"data/nyc-taxi-tiny.zip\",\n exdir = \"data/\"\n)\n```\n:::\n\n\n## Larger-Than-Memory Data\n\n<br>\n\n`arrow::open_dataset()`\n\n<br>\n\n::: notes\nArrow Datasets allow you to query against data that has been split across multiple files. This division of data into multiple files may indicate partitioning, which can accelerate queries that only touch some partitions (files). Call open_dataset() to point to a directory of data files and return a Dataset, then use dplyr methods to query it.\n:::\n\n## NYC Taxi Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\n\nnyc_taxi <- open_dataset(\"data/nyc-taxi\")\n```\n:::\n\n\n## NYC Taxi Dataset\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnyc_taxi |> \n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1150352666\n```\n\n\n:::\n:::\n\n\n<br>\n\n1.15 billion rows 🤯\n\n## NYC Taxi Dataset: A question\n\n<br>\n\nWhat percentage of taxi rides each year had more than 1 passenger?\n\n## NYC Taxi Dataset: A dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\nnyc_taxi |>\n group_by(year) |>\n summarise(\n all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n ) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 10 × 4\n year all_trips shared_trips pct_shared\n <int> <int> <int> <dbl>\n 1 2012 178544324 53313752 29.9\n 2 2013 173179759 51215013 29.6\n 3 2014 165114361 48816505 29.6\n 4 2015 146112989 43081091 29.5\n 5 2017 113495512 32296166 28.5\n 6 2018 102797401 28796633 28.0\n 7 2019 84393604 23515989 27.9\n 8 2020 24647055 5837960 23.7\n 9 2021 30902618 7221844 23.4\n10 2016 131165043 38163870 29.1\n```\n\n\n:::\n:::\n\n\n## NYC Taxi Dataset: A dplyr pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tictoc)\n\ntic()\nnyc_taxi |>\n group_by(year) |>\n summarise(\n all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)\n ) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect()\ntoc()\n```\n:::\n\n\n> 6.077 sec elapsed\n\n## Your Turn\n\n1. Calculate the longest trip distance for every month in 2019\n\n2. How long did this query take to run?\n\n➡️ [Hello Arrow Exercises Page](1_hello_arrow-exercises.html)\n\n## What is Apache Arrow?\n\n::: columns\n::: {.column width=\"50%\"}\n> A multi-language toolbox for accelerated data interchange and in-memory processing\n:::\n\n::: {.column width=\"50%\"}\n> Arrow is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another\n:::\n:::\n\n::: {style=\"font-size: 70%;\"}\n<https://arrow.apache.org/overview/>\n:::\n\n## Apache Arrow Specification\n\nIn-memory columnar format: a standardized, language-agnostic specification for representing structured, table-like data sets in-memory.\n\n<br>\n\n![](images/arrow-rectangle.png){.absolute left=\"200\"}\n\n## A Multi-Language Toolbox\n\n![](images/arrow-libraries-structure.png)\n\n## Accelerated Data Interchange\n\n![](images/data-interchange-with-arrow.png)\n\n## Accelerated In-Memory Processing\n\nArrow's Columnar Format is Fast\n\n![](images/columnar-fast.png){.absolute top=\"120\" left=\"200\" height=\"600\"}\n\n::: notes\nThe contiguous columnar layout enables vectorization using the latest SIMD (Single Instruction, Multiple Data) operations included in modern processors.\n:::\n\n## arrow 📦\n\n<br>\n\n![](images/arrow-r-pkg.png){.absolute top=\"0\" left=\"300\" width=\"700\" height=\"900\"}\n\n## arrow 📦\n\n![](images/arrow-read-write-updated.png)\n\n## Today\n\n- Module 1: Larger-than-memory data manipulation with Arrow---Part I\n- Module 2: Data engineering with Arrow\n- Module 3: In-memory workflows in R with Arrow\n- Module 4: Larger-than-memory data manipulation with Arrow---Part II\n\n",
"supporting": [
"1_hello_arrow_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
Expand Down
Loading

0 comments on commit 501be15

Please sign in to comment.