diff --git a/_freeze/materials/0_housekeeping/execute-results/html.json b/_freeze/materials/0_housekeeping/execute-results/html.json index aa195bc..9211f02 100644 --- a/_freeze/materials/0_housekeeping/execute-results/html.json +++ b/_freeze/materials/0_housekeeping/execute-results/html.json @@ -1,9 +1,11 @@ { - "hash": "73e91f1bd3eb564aac096eb7f28e2119", + "hash": "98eca0fd351522cd2f53c0e060df7006", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Big Data in R with Arrow\"\nsubtitle: \"posit::conf(2024) 1-day workshop\"\nauthor: \"Nic Crane + Steph Hazlitt\"\nfooter: \"[🔗 pos.it/arrow-conf24](https://pos.it/arrow-conf24)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n\n\n# Welcome 👋\n\n## \n\n### WiFi ``{=html}\n\n- Username: **Posit Conf 2024**\n- Password: **conf2024**\n\n
\n\n### Workshop ``{=html}\n\n- Website: [pos.it/arrow-conf24](https://pos.it/arrow-conf24)\n- GitHub: [github.com/posit-conf-2024/arrow](https://github.com/posit-conf-2024/arrow)\n\n## Housekeeping\n\n
\n\n### Gender Neutral Bathrooms ``{=html}\n\n- Located on levels 3, 4, 5, 6 & 7\n\n### Specialty Rooms ``{=html}\n\n- Meditation/Prayer Room (503)\n- Lactation Room (509)\n\n*Available Mon & Tues 7am - 7pm, and Wed 7am - 5pm\n\n\n## Photos\n\n
\n\n### Red Lanyards ``{=html}``{=html} **NO** ``{=html}\n\n
\n\nPlease note everyone’s lanyard colors before taking a photo and respect their choices.\n\n## Code of Conduct\n\n
\n\n### ``{=html} [posit.co/code-of-conduct/](https://posit.co/code-of-conduct/)\n\n- Contact any posit::conf staff member, identifiable by their staff t-shirt, or visit the conference general information desk.\n- Send a message to conf\\@posit.com; event organizers will respond promptly.\n- Call +1-844-448-1212; this phone number will be monitored for the duration of the event.\n\n## Meet Your Teaching Team ``{=html}\n\n
\n\n### Co-Instructors\n\n- Nic Crane\n- Steph Hazlitt\n\n### Teaching Assistant\n\n- Jonathan Keane\n\n## Meet Each Other ``{=html}\n\n
\n\n- When did you use R for the first time?\n- What is your favorite R package?\n- Which package hex sticker would you like to find the most during posit::conf(2024)?\n\n## Getting Help Today ``{=html}\n\n
\n\n[TEAL]{style=\"color:teal;\"} sticky note: I am OK / I am done\n\n[PINK]{style=\"color:pink;\"} sticky note: I need support / I am working\n\n
\n\n``{=html} You can ask questions at any time during the workshop\n\n## Discord ``{=html}\n\n- [pos.it/conf-event-portal](http://pos.it/conf-event-portal) (login)\n- Click on \"Join Discord, the virtual networking platform!\"\n- Browse Channels -> `#workshop-arrow`\n\n## We Assume\n\n- You know ``{=html}\n- You are familiar with the [dplyr](https://dplyr.tidyverse.org/) package for data manipulation ``{=html}\n- You have data in your life that is too large to fit into memory or sluggish in memory\n- You want to learn how to engineer your data storage for more performant access and analysis\n", - "supporting": [], + "markdown": "---\ntitle: \"Big Data in R with Arrow\"\nsubtitle: \"posit::conf(2024) 1-day workshop\"\nauthor: \"Nic Crane + Steph Hazlitt\"\nfooter: \"[🔗 pos.it/arrow-conf24](https://pos.it/arrow-conf24)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n# Welcome 👋\n\n## \n\n### WiFi ``{=html}\n\n- Username: **Posit Conf 2024**\n- Password: **conf2024**\n\n
\n\n### Workshop ``{=html}\n\n- Website: [pos.it/arrow-conf24](https://pos.it/arrow-conf24)\n- GitHub: [github.com/posit-conf-2024/arrow](https://github.com/posit-conf-2024/arrow)\n\n## Housekeeping\n\n
\n\n### Gender Neutral Bathrooms ``{=html}\n\n- Located on levels 3, 4, 5, 6 & 7\n\n### Specialty Rooms ``{=html}\n\n- Meditation/Prayer Room (503)\n- Lactation Room (509)\n\n*Available Mon & Tues 7am - 7pm, and Wed 7am - 5pm\n\n\n## Photos\n\n
\n\n### Red Lanyards ``{=html}``{=html} **NO** ``{=html}\n\n
\n\nPlease note everyone’s lanyard colors before taking a photo and respect their choices.\n\n## Code of Conduct\n\n
\n\n### ``{=html} [posit.co/code-of-conduct/](https://posit.co/code-of-conduct/)\n\n- Contact any posit::conf staff member, identifiable by their staff t-shirt, or visit the conference general information desk.\n- Send a message to conf\\@posit.com; event organizers will respond promptly.\n- Call +1-844-448-1212; this phone number will be monitored for the duration of the event.\n\n## Meet Your Teaching Team ``{=html}\n\n
\n\n### Co-Instructors\n\n- Nic Crane\n- Steph Hazlitt\n\n### Teaching Assistant\n\n- Jonathan Keane\n\n## Meet Each Other ``{=html}\n\n
\n\n- When did you use R for the first time?\n- What is your favorite R package?\n- Which package hex sticker would you like to find the most during posit::conf(2024)?\n\n## Getting Help Today ``{=html}\n\n
\n\n[TEAL]{style=\"color:teal;\"} sticky note: I am OK / I am done\n\n[PINK]{style=\"color:pink;\"} sticky note: I need support / I am working\n\n
\n\n``{=html} You can ask questions at any time during the workshop\n\n## Discord ``{=html}\n\n- [pos.it/conf-event-portal](http://pos.it/conf-event-portal) (login)\n- Click on \"Join Discord, the virtual networking platform!\"\n- Browse Channels -> `#workshop-arrow`\n\n## We Assume\n\n- You know ``{=html}\n- You are familiar with the [dplyr](https://dplyr.tidyverse.org/) package for data manipulation ``{=html}\n- You have data in your life that is too large to fit into memory or sluggish in memory\n- You want to learn how to engineer your data storage for more performant access and analysis\n\n## Setup\n\n- Log onto Workbench at the following URL: \n- Create a new session; **select \"Resource Profile: Large\"**\n- Run `usethis::use_course(\"posit-conf-2024/arrow\")`\n- Open `data/setup.R` and run the script\n", + "supporting": [ + "0_housekeeping_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/materials/3_data_engineering-exercises/execute-results/html.json b/_freeze/materials/3_data_engineering-exercises/execute-results/html.json index 2b558ae..a4b8edc 100644 --- a/_freeze/materials/3_data_engineering-exercises/execute-results/html.json +++ b/_freeze/materials/3_data_engineering-exercises/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "949ad5f0c58f263cb46500cfd640fc1d", + "hash": "fa36122b964d2adb9ad5f21d7c58c8cc", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\neditor: source \n---\n\n\n\n\n# Schemas\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\"\n)\n```\n:::\n\n\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `` (or the alias ``) instead of the `` interpreted by Arrow.\n\n2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n skip = 1,\n schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(), #or utf8()\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n )\n)\n```\n:::\n\n\n\n\nor\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n col_types = schema(ISBN = string()) #utf8()\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear `sum(Checkouts)`\n \n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\n\n\nor\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |> \n count(CheckoutYear, wt = Checkouts) |> \n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear n\n \n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\n\n\nTiming the query:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 11.474 1.084 11.003 \n```\n\n\n:::\n:::\n\n\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nseattle_csv |>\n write_dataset(path = seattle_parquet,\n format = \"parquet\")\n```\n:::\n\n\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nopen_dataset(sources = seattle_parquet, \n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 2.076 0.287 0.646 \n```\n\n\n:::\n:::\n\n\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = seattle_parquet_part,\n format = \"parquet\")\n```\n:::\n\n\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- \"data/seattle-library-checkouts-type\"\n\nseattle_csv |>\n group_by(CheckoutType) |>\n write_dataset(path = seattle_checkouttype,\n format = \"parquet\")\n```\n:::\n\n\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts-type\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.965 0.160 0.409 \n```\n\n\n:::\n:::\n\n\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"data/seattle-library-checkouts\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.058 0.006 0.052 \n```\n\n\n:::\n:::\n\n\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n", + "markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\neditor: source \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\")\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `` (or the alias ``) instead of the `` interpreted by Arrow.\n\n2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(), #or utf8()\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n ),\n skip = 1,\n)\n```\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n col_types = schema(ISBN = string()) # or utf8()\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear `sum(Checkouts)`\n \n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |> \n count(CheckoutYear, wt = Checkouts) |> \n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear n\n \n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 10.651 1.091 10.333 \n```\n\n\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nseattle_csv |>\n write_dataset(path = seattle_parquet,\n format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nopen_dataset(sources = seattle_parquet, \n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 1.634 0.345 0.558 \n```\n\n\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = seattle_parquet_part,\n format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- \"data/seattle-library-checkouts-type\"\n\nseattle_csv |>\n group_by(CheckoutType) |>\n write_dataset(path = seattle_checkouttype,\n format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts-type\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.777 0.072 0.296 \n```\n\n\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"data/seattle-library-checkouts\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.034 0.005 0.030 \n```\n\n\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/materials/3_data_engineering/execute-results/html.json b/_freeze/materials/3_data_engineering/execute-results/html.json index c7ca5bd..74eb5f1 100644 --- a/_freeze/materials/3_data_engineering/execute-results/html.json +++ b/_freeze/materials/3_data_engineering/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "6a1fef61889b21e92adbcf1ca896a4f9", + "hash": "bccca98d2edd2bc29031ab127391891b", "result": { "engine": "knitr", - "markdown": "---\nfooter: \"[🔗 pos.it/arrow-conf24](https://pos.it/arrow-conf24)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n::: {.cell}\n\n:::\n\n\n\n\n# Data Engineering with Arrow {#data-eng-storage}\n\n## Data Engineering\n\n
\n\n![](images/data-engineering.png)\n\n
\n\n::: {style=\"font-size: 70%;\"}\n\n:::\n\n## .NORM Files\n\n![](images/norm_normal_file_format_2x.png){.absolute top=\"0\" left=\"400\"}\n\n
\n\n::: {style=\"font-size: 70%;\"}\n\n:::\n\n## Poll: Formats\n\n
\n\n**Which file formats do you use most often?**\n\n
\n\n- 1️⃣ CSV (.csv)\n- 2️⃣ MS Excel (.xls and .xlsx)\n- 3️⃣ Parquet (.parquet)\n- 4️⃣ Something else\n\n\n## Arrow & File Formats\n\n![](images/arrow-read-write-updated.png)\n\n## Seattle
Checkouts
Big CSV\n\n![](images/seattle-checkouts.png){.absolute top=\"0\" left=\"300\"}\n\n::: {style=\"font-size: 60%; margin-top: 440px; margin-left: 330px;\"}\n\n:::\n\n## Dataset contents\n\n![](images/datapreview.png){height=\"550\"}\n\n## arrow::open_dataset() with a CSV\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n\n\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\")\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: null\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n\n\n## arrow::schema()\n\n> Create a schema or extract one from an object.\n\n
\n\nLet's extract the schema:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(seattle_csv)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nSchema\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: null\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n\n\n## Arrow Data Types\n\nArrow has a rich data type system, including direct analogs of many R data types\n\n- `` == ``\n- `` == `` OR `` (aliases)\n- `` == ``\n\n
\n\n\n\n## Parsing the Metadata\n\n
\n\nArrow scans 👀 1MB of data to impute or \"guess\" the data types\n\n::: {style=\"font-size: 80%; margin-top: 200px;\"}\n📚 arrow vs readr blog post: \n:::\n\n## Parsers Are Not Always Right\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(seattle_csv)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nSchema\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: null\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n\n\n![](images/data-dict.png){.absolute top=\"300\" left=\"330\" width=\"700\"}\n\n::: notes\nInternational Standard Book Number (ISBN) is a 13-digit number that uniquely identifies books and book-like products published internationally.\n\nData Dictionaries, metadata in data catalogues should provide this info.\n\nThe number or rows used to infer the schema will vary depending on the data in each column, total number of columns, and how many bytes each value takes up in memory. \n\nIf all of the values in a column that lie within the first 1MB of the file are missing values, arrow will classify this data as null type. ISBN! Phone numbers, zip codes, leading zeros...\n\nRecommended specifying a schema when working with CSV datasets to avoid potential issues like this\n:::\n\n## Let's Control the Schema\n\nCreating a schema manually:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(), #utf8()\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n)\n```\n:::\n\n\n\n\n
\n\nThis will take a lot of typing with 12 columns 😢\n\n## Let's Control the Schema\n\nUse the code() method to extract the code from the schema:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv$schema$code() \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nschema(UsageClass = utf8(), CheckoutType = utf8(), MaterialType = utf8(), \n CheckoutYear = int64(), CheckoutMonth = int64(), Checkouts = int64(), \n Title = utf8(), ISBN = null(), Creator = utf8(), Subjects = utf8(), \n Publisher = utf8(), PublicationYear = utf8())\n```\n\n\n:::\n:::\n\n\n\n\n
\n\n🤩\n\n## Let's Control the Schema\n\nSchema defines column names and types, so we need to skip the first row (skip = 1):\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"|12\"}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n skip = 1,\n schema = schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(), #utf8()\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n )\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n\n\n## Let's Control the Schema\n\nSupply column types for a subset of columns by providing a partial schema:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(\n sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n col_types = schema(ISBN = string()) #utf8()\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n\n\n\n## Your Turn\n\n1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `` instead of the `` interpreted by Arrow.\n\n2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n➡️ [Data Storage Engineering Exercises Page](3_data_engineering-exercises.html)\n\n## 9GB CSV file + arrow + dplyr\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear `sum(Checkouts)`\n \n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\n\n\n## 9GB CSV file + arrow + dplyr\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6\"}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 11.581 1.136 11.117 \n```\n\n\n:::\n:::\n\n\n\n\n42 million rows -- not bad, but could be faster....\n\n## File Format: Apache Parquet\n\n![](images/apache-parquet.png){.absolute top=\"100\" left=\"200\" width=\"700\"}\n\n::: {style=\"font-size: 60%; margin-top: 450px;\"}\n\n:::\n\n\n## Parquet Files: \"row-chunked\"\n\n![](images/parquet-chunking.png)\n\n## Parquet Files: \"row-chunked & column-oriented\"\n\n![](images/parquet-columnar.png)\n\n## Parquet\n\n- compression and encoding == usually much smaller than equivalent CSV file, less data to move from disk to memory\n- rich type system & stores the schema along with the data == more robust pipelines\n- \"row-chunked & column-oriented\" == work on different parts of the file at the same time or skip some chunks all together, better performance than row-by-row\n\n::: notes\n- efficient encodings to keep file size down, and supports file compression, less data to move from disk to memory\n- CSV has no info about data types, inferred by each parser\n:::\n\n\n## Writing to Parquet\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nseattle_csv |>\n write_dataset(path = seattle_parquet,\n format = \"parquet\")\n```\n:::\n\n\n\n\n## Storage: Parquet vs CSV\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile <- list.files(seattle_parquet)\nfile.size(file.path(seattle_parquet, file)) / 10**9\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 4.423348\n```\n\n\n:::\n:::\n\n\n\n\n
\n\nParquet about half the size of the CSV file on-disk 💾\n\n## Your Turn\n\n1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n➡️ [Data Storage Engineering Exercises Page](3_data_engineering-exercises.html)\n\n## 4.5GB Parquet file + arrow + dplyr\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = seattle_parquet, \n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 2.018 0.265 0.595 \n```\n\n\n:::\n:::\n\n\n\n\n42 million rows -- much better! But could be *even* faster....\n\n## File Storage:
Partitioning\n\n
\n\n::: columns\n::: {.column width=\"50%\"}\nDividing data into smaller pieces, making it more easily accessible and manageable\n:::\n\n::: {.column width=\"50%\"}\n![](images/partitions.png){.absolute top=\"0\"}\n:::\n:::\n\n::: notes\nalso called multi-files or sometimes shards\n:::\n\n## Poll: Partitioning?\n\nHave you partitioned your data or used partitioned data before today?\n\n
\n\n- 1️⃣ Yes\n- 2️⃣ No\n- 3️⃣ Not sure, the data engineers sort that out!\n\n## Art & Science of Partitioning\n\n
\n\n- avoid files \\< 20MB and \\> 2GB\n- avoid \\> 10,000 files (🤯)\n- partition on variables used in `filter()`\n\n::: notes\n- guidelines not rules, results vary\n- experiment, especially with cloud\n- arrow suggests avoid files smaller than 20MB and larger than 2GB\n- avoid partitions that produce more than 10,000 files\n- partition by variables that you filter by, allows arrow to only read relevant files\n:::\n\n## Rewriting the Data Again\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = seattle_parquet_part,\n format = \"parquet\")\n```\n:::\n\n\n\n\n## What Did We \"Engineer\"?\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nsizes <- tibble(\n files = list.files(seattle_parquet_part, recursive = TRUE),\n size_GB = file.size(file.path(seattle_parquet_part, files)) / 10**9\n)\n\nsizes\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n files size_GB\n \n 1 CheckoutYear=2005/part-0.parquet 0.114\n 2 CheckoutYear=2006/part-0.parquet 0.172\n 3 CheckoutYear=2007/part-0.parquet 0.186\n 4 CheckoutYear=2008/part-0.parquet 0.204\n 5 CheckoutYear=2009/part-0.parquet 0.224\n 6 CheckoutYear=2010/part-0.parquet 0.233\n 7 CheckoutYear=2011/part-0.parquet 0.250\n 8 CheckoutYear=2012/part-0.parquet 0.261\n 9 CheckoutYear=2013/part-0.parquet 0.282\n10 CheckoutYear=2014/part-0.parquet 0.296\n11 CheckoutYear=2015/part-0.parquet 0.308\n12 CheckoutYear=2016/part-0.parquet 0.315\n13 CheckoutYear=2017/part-0.parquet 0.319\n14 CheckoutYear=2018/part-0.parquet 0.306\n15 CheckoutYear=2019/part-0.parquet 0.302\n16 CheckoutYear=2020/part-0.parquet 0.158\n17 CheckoutYear=2021/part-0.parquet 0.240\n18 CheckoutYear=2022/part-0.parquet 0.252\n```\n\n\n:::\n:::\n\n\n\n\n## 4.5GB partitioned Parquet files + arrow + dplyr\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nopen_dataset(sources = seattle_parquet_part,\n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 1.640 0.220 0.267 \n```\n\n\n:::\n:::\n\n\n\n\n
\n\n42 million rows -- not too shabby!\n\n## Your Turn\n\n1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n➡️ [Data Storage Engineering Exercises Page](3_data_engineering-exercises.html)\n\n\n## Partitions & NA Values\n\nADD content\n\n## Partition Design\n\n::: columns\n::: {.column width=\"50%\"}\n- Partitioning on variables commonly used in `filter()` often faster\n- Number of partitions also important (Arrow reads the metadata of each file)\n:::\n\n::: {.column width=\"50%\"}\n![](images/partitions.png){.absolute top=\"0\"}\n:::\n:::\n\n## Performance Review: Single CSV\n\nHow long does it take to calculate the number of books checked out in each month of 2021?\n\n
\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts.csv\", \n format = \"csv\") |> \n\n filter(CheckoutYear == 2021, MaterialType == \"BOOK\") |>\n group_by(CheckoutMonth) |>\n summarise(TotalCheckouts = sum(Checkouts)) |>\n arrange(desc(CheckoutMonth)) |>\n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 13.362 1.763 12.438 \n```\n\n\n:::\n:::\n\n\n\n\n## Performance Review: Partitioned Parquet\n\nHow long does it take to calculate the number of books checked out in each month of 2021?\n\n
\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts\",\n format = \"parquet\") |> \n filter(CheckoutYear == 2021, MaterialType == \"BOOK\") |>\n group_by(CheckoutMonth) |>\n summarise(TotalCheckouts = sum(Checkouts)) |>\n arrange(desc(CheckoutMonth)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.330 0.039 0.091 \n```\n\n\n:::\n:::\n\n\n\n\n## Engineering Data Tips for Improved Storage & Performance\n\n
\n\n- consider \"column-oriented\" file formats like Parquet\n- consider partitioning, experiment to get an appropriate partition design 🗂️\n- watch your schemas 👀\n\n## R for Data Science (2e)\n\n::: columns\n::: {.column width=\"50%\"}\n![](images/r4ds-cover.jpg){.absolute top=\"100\" width=\"400\"}\n:::\n\n::: {.column width=\"50%\"}\n
\n\n[Chapter 23: Arrow](https://r4ds.hadley.nz/arrow.html)\n\n
\n\n\n:::\n:::\n", + "markdown": "---\nfooter: \"[🔗 pos.it/arrow-conf24](https://pos.it/arrow-conf24)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n::: {.cell}\n\n:::\n\n\n# Data Engineering with Arrow {#data-eng-storage}\n\n## Data Engineering\n\n
\n\n![](images/data-engineering.png)\n\n
\n\n::: {style=\"font-size: 70%;\"}\n\n:::\n\n## .NORM Files\n\n![](images/norm_normal_file_format_2x.png){.absolute top=\"0\" left=\"400\"}\n\n
\n\n::: {style=\"font-size: 70%;\"}\n\n:::\n\n## Poll: Formats\n\n
\n\n**Which file formats do you use most often?**\n\n
\n\n- 1️⃣ CSV (.csv)\n- 2️⃣ MS Excel (.xls and .xlsx)\n- 3️⃣ Parquet (.parquet)\n- 4️⃣ Something else\n\n## Arrow & File Formats\n\n![](images/arrow-read-write-updated.png)\n\n## Seattle
Checkouts
Big CSV\n\n![](images/seattle-checkouts.png){.absolute top=\"0\" left=\"300\"}\n\n::: {style=\"font-size: 60%; margin-top: 440px; margin-left: 330px;\"}\n\n:::\n\n## Dataset contents\n\n![](images/datapreview.png){height=\"550\"}\n\n## arrow::open_dataset() with a CSV\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n\n\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\")\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: null\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n## arrow::schema()\n\n> Create a schema or extract one from an object.\n\n
\n\nLet's extract the schema:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(seattle_csv)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nSchema\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: null\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n## Arrow Data Types\n\nArrow has a rich data type system, including direct analogs of many R data types\n\n- `` == ``\n- `` == `` OR `` (aliases)\n- `` == ``\n\n
\n\n\n\n## Parsing the Metadata\n\n
\n\nArrow scans 👀 1MB of data to impute or \"guess\" the data types\n\n::: {style=\"font-size: 80%; margin-top: 200px;\"}\n📚 arrow vs readr blog post: \n:::\n\n## Parsers Are Not Always Right\n\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(seattle_csv)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nSchema\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: null\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n![](images/data-dict.png){.absolute top=\"300\" left=\"330\" width=\"700\"}\n\n::: notes\nInternational Standard Book Number (ISBN) is a 13-digit number that uniquely identifies books and book-like products published internationally.\n\nData Dictionaries, metadata in data catalogues should provide this info.\n\nThe number or rows used to infer the schema will vary depending on the data in each column, total number of columns, and how many bytes each value takes up in memory. \n\nIf all of the values in a column that lie within the first 1MB of the file are missing values, arrow will classify this data as null type. ISBN! Phone numbers, zip codes, leading zeros...\n\nRecommended specifying a schema when working with CSV datasets to avoid potential issues like this\n:::\n\n## Let's Control the Schema\n\nCreating a schema manually:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(), #utf8()\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n)\n```\n:::\n\n\n
\n\nThis will take a lot of typing with 12 columns 😢\n\n## Let's Control the Schema\n\nUse the code() method to extract the code from the schema:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv$schema$code() \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nschema(UsageClass = utf8(), CheckoutType = utf8(), MaterialType = utf8(), \n CheckoutYear = int64(), CheckoutMonth = int64(), Checkouts = int64(), \n Title = utf8(), ISBN = null(), Creator = utf8(), Subjects = utf8(), \n Publisher = utf8(), PublicationYear = utf8())\n```\n\n\n:::\n:::\n\n\n
\n\n🤩\n\n## Let's Control the Schema\n\nSchema defines column names and types, so we need to skip the first row (skip = 1):\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"|11|17\"}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n schema = schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(), #utf8()\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n ),\n skip = 1,\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n## Let's Control the Schema\n\nSupply column types for a subset of columns by providing a partial schema:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(\n sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n col_types = schema(ISBN = string()) #utf8()\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n\n## Your Turn\n\n1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `` (or the alias ``) instead of the `` interpreted by Arrow.\n\n2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n➡️ [Data Storage Engineering Exercises Page](3_data_engineering-exercises.html)\n\n## 9GB CSV file + arrow + dplyr\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear `sum(Checkouts)`\n \n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\n## 9GB CSV file + arrow + dplyr\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6\"}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 10.688 1.099 10.451 \n```\n\n\n:::\n:::\n\n\n42 million rows -- not bad, but could be faster....\n\n## File Format: Apache Parquet\n\n![](images/apache-parquet.png){.absolute top=\"100\" left=\"200\" width=\"700\"}\n\n::: {style=\"font-size: 60%; margin-top: 450px;\"}\n\n:::\n\n## Parquet Files: \"row-chunked\"\n\n![](images/parquet-chunking.png)\n\n## Parquet Files: \"row-chunked & column-oriented\"\n\n![](images/parquet-columnar.png)\n\n## Parquet\n\n- compression and encoding == usually much smaller than equivalent CSV file, less data to move from disk to memory\n- rich type system & stores the schema along with the data == more robust pipelines\n- \"row-chunked & column-oriented\" == work on different parts of the file at the same time or skip some chunks all together, better performance than row-by-row\n\n::: notes\n- efficient encodings to keep file size down, and supports file compression, less data to move from disk to memory\n- CSV has no info about data types, inferred by each parser\n:::\n\n## Writing to Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nseattle_csv |>\n write_dataset(path = seattle_parquet,\n format = \"parquet\")\n```\n:::\n\n\n## Storage: Parquet vs CSV\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfile <- list.files(seattle_parquet)\nfile.size(file.path(seattle_parquet, file)) / 10**9\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 4.424267\n```\n\n\n:::\n:::\n\n\n
\n\nParquet about half the size of the CSV file on-disk 💾\n\n## Your Turn\n\n1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n➡️ [Data Storage Engineering Exercises Page](3_data_engineering-exercises.html)\n\n## 4.5GB Parquet file + arrow + dplyr\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = seattle_parquet, \n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 1.771 0.431 0.568 \n```\n\n\n:::\n:::\n\n\n42 million rows -- much better! But could be *even* faster....\n\n## File Storage:
Partitioning\n\n
\n\n::: columns\n::: {.column width=\"50%\"}\nDividing data into smaller pieces, making it more easily accessible and manageable\n:::\n\n::: {.column width=\"50%\"}\n![](images/partitions.png){.absolute top=\"0\"}\n:::\n:::\n\n::: notes\nalso called multi-files or sometimes shards\n:::\n\n## Poll: Partitioning?\n\nHave you partitioned your data or used partitioned data before today?\n\n
\n\n- 1️⃣ Yes\n- 2️⃣ No\n- 3️⃣ Not sure, the data engineers sort that out!\n\n## Art & Science of Partitioning\n\n
\n\n- avoid files \\< 20MB and \\> 2GB\n- avoid \\> 10,000 files (🤯)\n- partition on variables used in `filter()`\n\n::: notes\n- guidelines not rules, results vary\n- experiment, especially with cloud\n- arrow suggests avoid files smaller than 20MB and larger than 2GB\n- avoid partitions that produce more than 10,000 files\n- partition by variables that you filter by, allows arrow to only read relevant files\n:::\n\n## Rewriting the Data Again\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = seattle_parquet_part,\n format = \"parquet\")\n```\n:::\n\n\n## What Did We \"Engineer\"?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nsizes <- tibble(\n files = list.files(seattle_parquet_part, recursive = TRUE),\n size_GB = file.size(file.path(seattle_parquet_part, files)) / 10**9\n)\n\nsizes\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n files size_GB\n \n 1 CheckoutYear=2005/part-0.parquet 0.115\n 2 CheckoutYear=2006/part-0.parquet 0.172\n 3 CheckoutYear=2007/part-0.parquet 0.186\n 4 CheckoutYear=2008/part-0.parquet 0.204\n 5 CheckoutYear=2009/part-0.parquet 0.224\n 6 CheckoutYear=2010/part-0.parquet 0.233\n 7 CheckoutYear=2011/part-0.parquet 0.250\n 8 CheckoutYear=2012/part-0.parquet 0.261\n 9 CheckoutYear=2013/part-0.parquet 0.282\n10 CheckoutYear=2014/part-0.parquet 0.296\n11 CheckoutYear=2015/part-0.parquet 0.308\n12 CheckoutYear=2016/part-0.parquet 0.315\n13 CheckoutYear=2017/part-0.parquet 0.319\n14 CheckoutYear=2018/part-0.parquet 0.306\n15 CheckoutYear=2019/part-0.parquet 0.303\n16 CheckoutYear=2020/part-0.parquet 0.158\n17 CheckoutYear=2021/part-0.parquet 0.240\n18 CheckoutYear=2022/part-0.parquet 0.252\n```\n\n\n:::\n:::\n\n\n## 4.5GB partitioned Parquet files + arrow + dplyr\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nopen_dataset(sources = seattle_parquet_part,\n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 1.742 0.385 0.366 \n```\n\n\n:::\n:::\n\n\n
\n\n42 million rows -- not too shabby!\n\n## Your Turn\n\n1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n➡️ [Data Storage Engineering Exercises Page](3_data_engineering-exercises.html)\n\n## Partition Design\n\n::: columns\n::: {.column width=\"50%\"}\n- Partitioning on variables commonly used in `filter()` often faster\n- Number of partitions also important (Arrow reads the metadata of each file)\n:::\n\n::: {.column width=\"50%\"}\n![](images/partitions.png){.absolute top=\"0\"}\n:::\n:::\n\n## Partitions & NA Values\n\nDefault:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npartition_na_default_path <- \"data/na-partition-default\"\n\nwrite_dataset(starwars,\n partition_na_default_path,\n partitioning = \"hair_color\")\n\nlist.files(partition_na_default_path)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"hair_color=__HIVE_DEFAULT_PARTITION__\"\n [2] \"hair_color=auburn\" \n [3] \"hair_color=auburn%2C%20grey\" \n [4] \"hair_color=auburn%2C%20white\" \n [5] \"hair_color=black\" \n [6] \"hair_color=blond\" \n [7] \"hair_color=blonde\" \n [8] \"hair_color=brown\" \n [9] \"hair_color=brown%2C%20grey\" \n[10] \"hair_color=grey\" \n[11] \"hair_color=none\" \n[12] \"hair_color=white\" \n```\n\n\n:::\n:::\n\n\n## Partitions & NA Values\n\nCustom:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npartition_na_custom_path <- \"data/na-partition-custom\"\n\nwrite_dataset(starwars,\n partition_na_custom_path,\n partitioning = hive_partition(hair_color = string(),\n null_fallback = \"no_color\"))\n\nlist.files(partition_na_custom_path)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"hair_color=auburn\" \"hair_color=auburn%2C%20grey\" \n [3] \"hair_color=auburn%2C%20white\" \"hair_color=black\" \n [5] \"hair_color=blond\" \"hair_color=blonde\" \n [7] \"hair_color=brown\" \"hair_color=brown%2C%20grey\" \n [9] \"hair_color=grey\" \"hair_color=no_color\" \n[11] \"hair_color=none\" \"hair_color=white\" \n```\n\n\n:::\n:::\n\n\n## Performance Review: Single CSV\n\nHow long does it take to calculate the number of books checked out in each month of 2021?\n\n
\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts.csv\", \n format = \"csv\") |> \n\n filter(CheckoutYear == 2021, MaterialType == \"BOOK\") |>\n group_by(CheckoutMonth) |>\n summarise(TotalCheckouts = sum(Checkouts)) |>\n arrange(desc(CheckoutMonth)) |>\n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 11.718 1.106 11.250 \n```\n\n\n:::\n:::\n\n\n## Performance Review: Partitioned Parquet\n\nHow long does it take to calculate the number of books checked out in each month of 2021?\n\n
\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts\",\n format = \"parquet\") |> \n filter(CheckoutYear == 2021, MaterialType == \"BOOK\") |>\n group_by(CheckoutMonth) |>\n summarise(TotalCheckouts = sum(Checkouts)) |>\n arrange(desc(CheckoutMonth)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.224 0.040 0.068 \n```\n\n\n:::\n:::\n\n\n## Engineering Data Tips for Improved Storage & Performance\n\n
\n\n- consider \"column-oriented\" file formats like Parquet\n- consider partitioning, experiment to get an appropriate partition design 🗂️\n- watch your schemas 👀\n\n## R for Data Science (2e)\n\n::: columns\n::: {.column width=\"50%\"}\n![](images/r4ds-cover.jpg){.absolute top=\"100\" width=\"400\"}\n:::\n\n::: {.column width=\"50%\"}\n
\n\n[Chapter 23: Arrow](https://r4ds.hadley.nz/arrow.html)\n\n
\n\n\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/materials/5_arrow_single_file/execute-results/html.json b/_freeze/materials/5_arrow_single_file/execute-results/html.json index e2238ec..1ad4ad9 100644 --- a/_freeze/materials/5_arrow_single_file/execute-results/html.json +++ b/_freeze/materials/5_arrow_single_file/execute-results/html.json @@ -1,9 +1,11 @@ { - "hash": "635c69d2b9051934482ef06f39f1cc44", + "hash": "8e987b3003a6c0a3bf5337526f5d6598", "result": { "engine": "knitr", - "markdown": "---\nfooter: \"[🔗 pos.it/arrow-conf24](https://pos.it/arrow-conf24)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n\n\n# Arrow in R: In-Memory Workflows {#single-file-api}\n\n\n\n\n::: {.cell}\n\n:::\n\n\n\n\n## arrow 📦\n\n![](images/arrow-read-write-updated.png)\n\n## Arrow & Single Files\n\n
\n\n`library(arrow)`\n\n- `read_parquet()`\n- `read_csv_arrow()`\n- `read_feather()`\n- `read_json_arrow()`\n\n**Value**: `tibble` (the default), or an Arrow Table if `as_data_frame = FALSE` --- both *in-memory*\n\n## Your Turn\n\n1. Read in a single NYC Taxi parquet file using `read_parquet()` as an Arrow Table\n\n2. Convert your Arrow Table object to a `data.frame` or a `tibble`\n\n## Read a Parquet File (`tibble`)\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\n\nparquet_file <- \"data/nyc-taxi/year=2019/month=9/part-0.parquet\"\n\ntaxi_df <- read_parquet(file = parquet_file)\ntaxi_df\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6,567,396 × 22\n vendor_name pickup_datetime dropoff_datetime passenger_count\n \n 1 VTS 2019-09-01 06:14:09 2019-09-01 06:31:52 2\n 2 VTS 2019-09-01 06:36:17 2019-09-01 07:12:44 1\n 3 VTS 2019-09-01 06:29:19 2019-09-01 06:54:13 1\n 4 CMT 2019-09-01 06:33:09 2019-09-01 06:52:14 2\n 5 VTS 2019-09-01 06:57:43 2019-09-01 07:26:21 1\n 6 CMT 2019-09-01 06:59:16 2019-09-01 07:28:12 1\n 7 CMT 2019-09-01 06:20:06 2019-09-01 06:52:19 1\n 8 CMT 2019-09-01 06:27:54 2019-09-01 06:32:56 0\n 9 CMT 2019-09-01 06:35:08 2019-09-01 06:55:51 0\n10 CMT 2019-09-01 06:19:37 2019-09-01 06:30:52 1\n# ℹ 6,567,386 more rows\n# ℹ 18 more variables: trip_distance , pickup_longitude ,\n# pickup_latitude , rate_code , store_and_fwd ,\n# dropoff_longitude , dropoff_latitude , payment_type ,\n# fare_amount , extra , mta_tax , tip_amount ,\n# tolls_amount , total_amount , improvement_surcharge ,\n# congestion_surcharge , pickup_location_id , …\n```\n\n\n:::\n:::\n\n\n\n\n## Read a Parquet File (`Table`)\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_table <- read_parquet(file = parquet_file, as_data_frame = FALSE)\ntaxi_table\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTable\n6567396 rows x 22 columns\n$vendor_name \n$pickup_datetime \n$dropoff_datetime \n$passenger_count \n$trip_distance \n$pickup_longitude \n$pickup_latitude \n$rate_code \n$store_and_fwd \n$dropoff_longitude \n$dropoff_latitude \n$payment_type \n$fare_amount \n$extra \n$mta_tax \n$tip_amount \n$tolls_amount \n$total_amount \n$improvement_surcharge \n$congestion_surcharge \n...\n2 more columns\nUse `schema()` to see entire schema\n```\n\n\n:::\n:::\n\n\n\n\n## `tibble` \\<-\\> `Table` \\<-\\> `data.frame`\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\n#change a df to a table\narrow_table(taxi_df)\n\n#change a table to a tibble\ntaxi_table |> collect()\nas_tibble(taxi_table)\n\n#change a table to a data.frame\nas.data.frame(taxi_table)\n```\n:::\n\n\n\n\n
\n\n- `data.frame` & `tibble` are R objects *in-memory*\n- `Table` is an Arrow object *in-memory*\n\n## Data frames\n\n![](images/tabular-structures-r.png)\n\n## Arrow Tables\n\n![](images/tabular-structures-arrow-1.png)\n\n::: notes\nArrow Tables are collections of chunked arrays\n:::\n\n## Table \\| Dataset: A `dplyr` pipeline\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparquet_file |>\n read_parquet(as_data_frame = FALSE) |>\n group_by(vendor_name) |>\n summarise(all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 4\n vendor_name all_trips shared_trips pct_shared\n \n1 VTS 4238808 1339478 31.6\n2 CMT 2294473 470344 20.5\n3 34115 0 0 \n```\n\n\n:::\n:::\n\n\n\n\n
\n\nFunctions available in Arrow dplyr queries: \n\n::: notes\nAll the same capabilities as you practiced with Arrow Dataset\n:::\n\n## Arrow for Efficient In-Memory Processing\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparquet_file |>\n read_parquet() |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 6567396\n```\n\n\n:::\n:::\n\n\n\n\n
\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"|2,8\"}\nparquet_file |>\n read_parquet() |>\n group_by(vendor_name) |>\n summarise(all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 2.214 0.575 0.814 \n```\n\n\n:::\n:::\n\n\n\n\n## Arrow for Efficient In-Memory Processing\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparquet_file |>\n read_parquet(as_data_frame = FALSE) |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 6567396\n```\n\n\n:::\n:::\n\n\n\n\n
\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"|2,8\"}\nparquet_file |>\n read_parquet(as_data_frame = FALSE) |>\n group_by(vendor_name) |>\n summarise(all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 1.995 0.343 0.366 \n```\n\n\n:::\n:::\n\n\n\n\n## Read a Parquet File Selectively\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparquet_file |>\n read_parquet(\n col_select = c(\"vendor_name\", \"passenger_count\"),\n as_data_frame = FALSE\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTable\n6567396 rows x 2 columns\n$vendor_name \n$passenger_count \n```\n\n\n:::\n:::\n\n\n\n\n## Selective Reads Are Faster\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"|2,3,4,11\"}\nparquet_file |>\n read_parquet(\n col_select = c(\"vendor_name\", \"passenger_count\"),\n as_data_frame = FALSE\n ) |> \n group_by(vendor_name) |>\n summarise(all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.323 0.088 0.234 \n```\n\n\n:::\n:::\n\n\n\n\n\n:::notes\nNotes: row-based format readers often allow you to specify which columns to read in but the entire row must be read in and the unwanted columns discarded. Parquet’s columnar format allows you to read in only the columns you need, which is faster when you only need a subset of the data.\n:::\n\n## Arrow Table or Dataset?\n\n![](images/2022-09-decision-map.png){.absolute left=\"200\" height=\"550\"}\n\n::: {style=\"font-size: 60%; margin-top: 575px; margin-left: 250px;\"}\n\n:::\n\n## Arrow for Improving Those Sluggish Worklows\n\n- a \"drop-in\" for many dplyr workflows (Arrow Table or Dataset)\n- works when your tabular data get too big for your RAM (Arrow Dataset)\n- provides tools for re-engineering data storage for better performance (`arrow::write_dataset()`)\n\n::: notes\nLot's of ways to speed up sluggish workflows e.g. [writing more performant tidyverse code](https://www.tidyverse.org/blog/2023/04/performant-packages/), use other data frame libraries like data.table or polars, use duckDB or other databases, Spark + splarklyr ... However, Arrow offers some attractive features for tackling this challenge, especially for dplyr users.\n:::\n", - "supporting": [], + "markdown": "---\nfooter: \"[🔗 pos.it/arrow-conf24](https://pos.it/arrow-conf24)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n# Arrow in R: In-Memory Workflows {#single-file-api}\n\n\n::: {.cell}\n\n:::\n\n\n## arrow 📦\n\n![](images/arrow-read-write-updated.png)\n\n## Arrow & Single Files\n\n
\n\n`library(arrow)`\n\n- `read_parquet()`\n- `read_csv_arrow()`\n- `read_feather()`\n- `read_json_arrow()`\n\n**Value**: `tibble` (the default), or an Arrow Table if `as_data_frame = FALSE` --- both *in-memory*\n\n## Your Turn\n\n1. Read in a single NYC Taxi parquet file using `read_parquet()` as an Arrow Table\n\n2. Convert your Arrow Table object to a `data.frame` or a `tibble`\n\n## Read a Parquet File (`tibble`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\n\nparquet_file <- \"data/nyc-taxi/year=2019/month=9/part-0.parquet\"\n\ntaxi_df <- read_parquet(file = parquet_file)\ntaxi_df\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6,567,396 × 22\n vendor_name pickup_datetime dropoff_datetime passenger_count\n \n 1 CMT 2019-08-31 18:09:30 2019-08-31 18:15:42 1\n 2 CMT 2019-08-31 18:26:30 2019-08-31 18:44:31 1\n 3 CMT 2019-08-31 18:39:35 2019-08-31 19:15:55 2\n 4 VTS 2019-08-31 18:12:26 2019-08-31 18:15:17 4\n 5 VTS 2019-08-31 18:43:16 2019-08-31 18:53:50 1\n 6 VTS 2019-08-31 18:26:13 2019-08-31 18:45:35 1\n 7 CMT 2019-08-31 18:34:52 2019-08-31 18:42:03 1\n 8 CMT 2019-08-31 18:50:02 2019-08-31 18:58:16 1\n 9 CMT 2019-08-31 18:08:02 2019-08-31 18:14:44 0\n10 VTS 2019-08-31 18:11:38 2019-08-31 18:26:47 1\n# ℹ 6,567,386 more rows\n# ℹ 18 more variables: trip_distance , pickup_longitude ,\n# pickup_latitude , rate_code , store_and_fwd ,\n# dropoff_longitude , dropoff_latitude , payment_type ,\n# fare_amount , extra , mta_tax , tip_amount ,\n# tolls_amount , total_amount , improvement_surcharge ,\n# congestion_surcharge , pickup_location_id , …\n```\n\n\n:::\n:::\n\n\n## Read a Parquet File (`Table`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_table <- read_parquet(file = parquet_file, as_data_frame = FALSE)\ntaxi_table\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTable\n6567396 rows x 22 columns\n$vendor_name \n$pickup_datetime \n$dropoff_datetime \n$passenger_count \n$trip_distance \n$pickup_longitude \n$pickup_latitude \n$rate_code \n$store_and_fwd \n$dropoff_longitude \n$dropoff_latitude \n$payment_type \n$fare_amount \n$extra \n$mta_tax \n$tip_amount \n$tolls_amount \n$total_amount \n$improvement_surcharge \n$congestion_surcharge \n...\n2 more columns\nUse `schema()` to see entire schema\n```\n\n\n:::\n:::\n\n\n## `tibble` \\<-\\> `Table` \\<-\\> `data.frame`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dplyr)\n\n#change a df to a table\narrow_table(taxi_df)\n\n#change a table to a tibble\ntaxi_table |> collect()\nas_tibble(taxi_table)\n\n#change a table to a data.frame\nas.data.frame(taxi_table)\n```\n:::\n\n\n
\n\n- `data.frame` & `tibble` are R objects *in-memory*\n- `Table` is an Arrow object *in-memory*\n\n## Watch Your Schemas 👀\n\n\n:::: {.columns}\n\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(taxi_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nSchema\nvendor_name: string\npickup_datetime: timestamp[us, tz=America/Vancouver]\ndropoff_datetime: timestamp[us, tz=America/Vancouver]\npassenger_count: int32\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int32\ndropoff_location_id: int32\n```\n\n\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\nschema(taxi_table)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nSchema\nvendor_name: string\npickup_datetime: timestamp[ms]\ndropoff_datetime: timestamp[ms]\npassenger_count: int64\ntrip_distance: double\npickup_longitude: double\npickup_latitude: double\nrate_code: string\nstore_and_fwd: string\ndropoff_longitude: double\ndropoff_latitude: double\npayment_type: string\nfare_amount: double\nextra: double\nmta_tax: double\ntip_amount: double\ntolls_amount: double\ntotal_amount: double\nimprovement_surcharge: double\ncongestion_surcharge: double\npickup_location_id: int64\ndropoff_location_id: int64\n```\n\n\n:::\n:::\n\n:::\n\n::::\n\n## Data frames\n\n![](images/tabular-structures-r.png)\n\n## Arrow Tables\n\n![](images/tabular-structures-arrow-1.png)\n\n::: notes\nArrow Tables are collections of chunked arrays\n:::\n\n## Table \\| Dataset: A `dplyr` pipeline\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparquet_file |>\n read_parquet(as_data_frame = FALSE) |>\n group_by(vendor_name) |>\n summarise(all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 3 × 4\n vendor_name all_trips shared_trips pct_shared\n \n1 VTS 4238808 1339478 31.6\n2 CMT 2294473 470344 20.5\n3 34115 0 0 \n```\n\n\n:::\n:::\n\n\n
\n\nFunctions available in Arrow dplyr queries: \n\n::: notes\nAll the same capabilities as you practiced with Arrow Dataset\n:::\n\n## Arrow for Efficient In-Memory Processing\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparquet_file |>\n read_parquet() |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 6567396\n```\n\n\n:::\n:::\n\n\n
\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"|2,8\"}\nparquet_file |>\n read_parquet() |>\n group_by(vendor_name) |>\n summarise(all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 1.157 0.261 0.509 \n```\n\n\n:::\n:::\n\n\n## Arrow for Efficient In-Memory Processing\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparquet_file |>\n read_parquet(as_data_frame = FALSE) |>\n nrow()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 6567396\n```\n\n\n:::\n:::\n\n\n
\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"|2,8\"}\nparquet_file |>\n read_parquet(as_data_frame = FALSE) |>\n group_by(vendor_name) |>\n summarise(all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 1.047 0.203 0.220 \n```\n\n\n:::\n:::\n\n\n## Read a Parquet File Selectively\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparquet_file |>\n read_parquet(\n col_select = c(\"vendor_name\", \"passenger_count\"),\n as_data_frame = FALSE\n )\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTable\n6567396 rows x 2 columns\n$vendor_name \n$passenger_count \n```\n\n\n:::\n:::\n\n\n## Selective Reads Are Faster\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"|2,3,4,11\"}\nparquet_file |>\n read_parquet(\n col_select = c(\"vendor_name\", \"passenger_count\"),\n as_data_frame = FALSE\n ) |> \n group_by(vendor_name) |>\n summarise(all_trips = n(),\n shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>\n mutate(pct_shared = shared_trips / all_trips * 100) |>\n collect() |>\n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.258 0.011 0.131 \n```\n\n\n:::\n:::\n\n\n\n:::notes\nNotes: row-based format readers often allow you to specify which columns to read in but the entire row must be read in and the unwanted columns discarded. Parquet’s columnar format allows you to read in only the columns you need, which is faster when you only need a subset of the data.\n:::\n\n## Arrow Table or Dataset?\n\n![](images/2022-09-decision-map.png){.absolute left=\"200\" height=\"550\"}\n\n::: {style=\"font-size: 60%; margin-top: 575px; margin-left: 250px;\"}\n\n:::\n\n## Arrow for Improving Those Sluggish Worklows\n\n- a \"drop-in\" for many dplyr workflows (Arrow Table or Dataset)\n- works when your tabular data get too big for your RAM (Arrow Dataset)\n- provides tools for re-engineering data storage for better performance (`arrow::write_dataset()`)\n\n::: notes\nLot's of ways to speed up sluggish workflows e.g. [writing more performant tidyverse code](https://www.tidyverse.org/blog/2023/04/performant-packages/), use other data frame libraries like data.table or polars, use duckDB or other databases, Spark + splarklyr ... However, Arrow offers some attractive features for tackling this challenge, especially for dplyr users.\n:::\n", + "supporting": [ + "5_arrow_single_file_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/materials/6_wrapping_up/execute-results/html.json b/_freeze/materials/6_wrapping_up/execute-results/html.json index 98dcfb7..b202311 100644 --- a/_freeze/materials/6_wrapping_up/execute-results/html.json +++ b/_freeze/materials/6_wrapping_up/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "9d920c068e646165f1cf8a7318ab23f8", + "hash": "295b9a54170accc4d78ea58e63476007", "result": { "engine": "knitr", - "markdown": "---\nfooter: \"[🔗 pos.it/arrow-conf24](https://pos.it/arrow-conf24)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n\n\n# Wrapping Up: 'Big' Data Analysis Pipelines with R {#wrapping-up}\n\n## Arrow\n\n- efficiently read + filter + join + summarise 1.15 billion rows\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\nlibrary(janitor)\nlibrary(stringr)\n\nnyc_taxi_zones <- read_csv_arrow(\"data/taxi_zone_lookup.csv\",\n as_data_frame = FALSE) |>\n clean_names()\n \nairport_zones <- nyc_taxi_zones |>\n filter(str_detect(zone, \"Airport\")) |>\n pull(location_id, as_vector = TRUE)\n\ndropoff_zones <- nyc_taxi_zones |>\n select(dropoff_location_id = location_id,\n dropoff_zone = zone) |> \n compute() # run the query but don't pull results into R session\n\nairport_pickups <- open_dataset(\"data/nyc-taxi/\") |>\n filter(pickup_location_id %in% airport_zones) |>\n select(\n matches(\"datetime\"),\n matches(\"location_id\")\n ) |>\n left_join(dropoff_zones) |>\n count(dropoff_zone) |>\n arrange(desc(n)) |>\n collect()\n```\n:::\n\n\n\n\n## R\n\n- read + wrangle spatial data + 🤩 graphics\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(sf)\nlibrary(ggplot2)\nlibrary(ggrepel)\nlibrary(stringr)\nlibrary(scales)\n\nmap <- read_sf(\"data/taxi_zones/taxi_zones.shp\") |>\n clean_names() |>\n left_join(airport_pickups,\n by = c(\"zone\" = \"dropoff_zone\")) |>\n arrange(desc(n))\n\narrow_r_together <- ggplot(data = map, aes(fill = n)) +\n geom_sf(size = .1) +\n scale_fill_distiller(\n name = \"Number of trips\",\n labels = label_comma(),\n palette = \"Reds\",\n direction = 1\n ) +\n geom_label_repel(\n stat = \"sf_coordinates\",\n data = map |>\n mutate(zone_label = case_when(\n str_detect(zone, \"Airport\") ~ zone,\n str_detect(zone, \"Times\") ~ zone,\n .default = \"\"\n )),\n mapping = aes(label = zone_label, geometry = geometry),\n max.overlaps = 60,\n label.padding = .3,\n fill = \"white\"\n ) +\n theme_void()\n```\n:::\n\n\n\n\n## Arrow + R Together: {arrow}\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\narrow_r_together\n```\n\n::: {.cell-output-display}\n![](6_wrapping_up_files/figure-revealjs/arrow_r_together-1.png){width=960}\n:::\n:::\n", + "markdown": "---\nfooter: \"[🔗 pos.it/arrow-conf24](https://pos.it/arrow-conf24)\"\nlogo: \"images/logo.png\"\nexecute:\n echo: true\nformat:\n revealjs: \n theme: default\nengine: knitr\neditor: source\n---\n\n\n# Wrapping Up: 'Big' Data Analysis Pipelines with R {#wrapping-up}\n\n## Arrow\n\n- efficiently read + filter + join + summarise 1.15 billion rows\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\nlibrary(janitor)\nlibrary(stringr)\n\nnyc_taxi_zones <- read_csv_arrow(\"data/taxi_zone_lookup.csv\",\n as_data_frame = FALSE) |>\n clean_names()\n \nairport_zones <- nyc_taxi_zones |>\n filter(str_detect(zone, \"Airport\")) |>\n pull(location_id, as_vector = TRUE)\n\ndropoff_zones <- nyc_taxi_zones |>\n select(dropoff_location_id = location_id,\n dropoff_zone = zone) |> \n collect(as_data_frame = FALSE)\n\nairport_pickups <- open_dataset(\"data/nyc-taxi/\") |>\n filter(pickup_location_id %in% airport_zones) |>\n select(\n matches(\"datetime\"),\n matches(\"location_id\")\n ) |>\n left_join(dropoff_zones) |>\n count(dropoff_zone) |>\n arrange(desc(n)) |>\n collect()\n```\n:::\n\n\n## R\n\n- read + wrangle spatial data + 🤩 graphics\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(sf)\nlibrary(ggplot2)\nlibrary(ggrepel)\nlibrary(stringr)\nlibrary(scales)\n\nmap <- read_sf(\"data/taxi_zones/taxi_zones.shp\") |>\n clean_names() |>\n left_join(airport_pickups,\n by = c(\"zone\" = \"dropoff_zone\")) |>\n arrange(desc(n))\n\narrow_r_together <- ggplot(data = map, aes(fill = n)) +\n geom_sf(size = .1) +\n scale_fill_distiller(\n name = \"Number of trips\",\n labels = label_comma(),\n palette = \"Reds\",\n direction = 1\n ) +\n geom_label_repel(\n stat = \"sf_coordinates\",\n data = map |>\n mutate(zone_label = case_when(\n str_detect(zone, \"Airport\") ~ zone,\n str_detect(zone, \"Times\") ~ zone,\n .default = \"\"\n )),\n mapping = aes(label = zone_label, geometry = geometry),\n max.overlaps = 60,\n label.padding = .3,\n fill = \"white\"\n ) +\n theme_void()\n```\n:::\n\n\n## Arrow + R Together: {arrow}\n\n\n::: {.cell}\n\n```{.r .cell-code}\narrow_r_together\n```\n\n::: {.cell-output-display}\n![](6_wrapping_up_files/figure-revealjs/arrow_r_together-1.png){width=960}\n:::\n:::\n", "supporting": [ "6_wrapping_up_files" ], diff --git a/_freeze/materials/6_wrapping_up/figure-revealjs/arrow_r_together-1.png b/_freeze/materials/6_wrapping_up/figure-revealjs/arrow_r_together-1.png index e27b280..358c16e 100644 Binary files a/_freeze/materials/6_wrapping_up/figure-revealjs/arrow_r_together-1.png and b/_freeze/materials/6_wrapping_up/figure-revealjs/arrow_r_together-1.png differ diff --git a/_site/materials/0_housekeeping.html b/_site/materials/0_housekeeping.html index c76174c..c12d2f6 100644 --- a/_site/materials/0_housekeeping.html +++ b/_site/materials/0_housekeeping.html @@ -434,6 +434,15 @@

We Assume

  • You have data in your life that is too large to fit into memory or sluggish in memory
  • You want to learn how to engineer your data storage for more performant access and analysis
  • + +
    +

    Setup

    +
      +
    • Log onto Workbench at the following URL:
    • +
    • Create a new session; select “Resource Profile: Large”
    • +
    • Run usethis::use_course("posit-conf-2024/arrow")
    • +
    • Open data/setup.R and run the script
    • +

    diff --git a/_site/materials/3_data_engineering-exercises.html b/_site/materials/3_data_engineering-exercises.html index 5c768cb..0fe945e 100644 --- a/_site/materials/3_data_engineering-exercises.html +++ b/_site/materials/3_data_engineering-exercises.html @@ -267,8 +267,7 @@

    Schemas

    seattle_csv <- open_dataset(sources = "data/seattle-library-checkouts.csv",
    -  format = "csv"
    -)
    + format = "csv")
    @@ -293,28 +292,28 @@

    Schemas

    seattle_csv <- open_dataset(sources = "data/seattle-library-checkouts.csv",
       format = "csv",
    -  skip = 1,
    -  schema(
    -    UsageClass = utf8(),
    -    CheckoutType = utf8(),
    -    MaterialType = utf8(),
    -    CheckoutYear = int64(),
    -    CheckoutMonth = int64(),
    -    Checkouts = int64(),
    -    Title = utf8(),
    -    ISBN = string(), #or utf8()
    -    Creator = utf8(),
    -    Subjects = utf8(),
    -    Publisher = utf8(),
    -    PublicationYear = utf8()
    -  )
    +  schema(
    +    UsageClass = utf8(),
    +    CheckoutType = utf8(),
    +    MaterialType = utf8(),
    +    CheckoutYear = int64(),
    +    CheckoutMonth = int64(),
    +    Checkouts = int64(),
    +    Title = utf8(),
    +    ISBN = string(), #or utf8()
    +    Creator = utf8(),
    +    Subjects = utf8(),
    +    Publisher = utf8(),
    +    PublicationYear = utf8()
    +  ),
    +    skip = 1,
     )

    or

    seattle_csv <- open_dataset(sources = "data/seattle-library-checkouts.csv",
       format = "csv",
    -  col_types = schema(ISBN = string()) #utf8()
    +  col_types = schema(ISBN = string()) # or utf8()
     )
     seattle_csv
    @@ -407,7 +406,7 @@

    Schemas

    system.time()
       user  system elapsed 
    - 11.474   1.084  11.003 
    + 10.651 1.091 10.333

    Querying 42 million rows of data stored in a CSV on disk in ~10 seconds, not too bad.

    @@ -457,7 +456,7 @@

    Parquet

    system.time()
       user  system elapsed 
    -  2.076   0.287   0.646 
    + 1.634 0.345 0.558

    A much faster compute time for the query when the on-disk data is stored in the Parquet format.

    @@ -517,7 +516,7 @@

    Partitioning

    system.time()
       user  system elapsed 
    -  0.965   0.160   0.409 
    + 0.777 0.072 0.296

    Total number of Checkouts in September of 2019 using partitioned Parquet data by CheckoutYear and CheckoutMonth:

    @@ -529,7 +528,7 @@

    Partitioning

    system.time()
       user  system elapsed 
    -  0.058   0.006   0.052 
    + 0.034 0.005 0.030

    Faster compute time because the filter() call is based on the partitions.

    diff --git a/_site/materials/3_data_engineering.html b/_site/materials/3_data_engineering.html index 5259795..757fada 100644 --- a/_site/materials/3_data_engineering.html +++ b/_site/materials/3_data_engineering.html @@ -447,7 +447,7 @@

    arrow::open_dataset() with a CSV

    seattle_csv <- open_dataset(sources = "data/seattle-library-checkouts.csv", - format = "csv") + format = "csv") seattle_csv
    FileSystemDataset with 1 csv file
    @@ -592,23 +592,23 @@ 

    Let’s Control the Schema

    Let’s Control the Schema

    Schema defines column names and types, so we need to skip the first row (skip = 1):

    -
    seattle_csv <- open_dataset(sources = "data/seattle-library-checkouts.csv",
    +
    seattle_csv <- open_dataset(sources = "data/seattle-library-checkouts.csv",
       format = "csv",
    -  skip = 1,
    -  schema = schema(
    -    UsageClass = utf8(),
    -    CheckoutType = utf8(),
    -    MaterialType = utf8(),
    -    CheckoutYear = int64(),
    -    CheckoutMonth = int64(),
    -    Checkouts = int64(),
    -    Title = utf8(),
    -    ISBN = string(), #utf8()
    -    Creator = utf8(),
    -    Subjects = utf8(),
    -    Publisher = utf8(),
    -    PublicationYear = utf8()
    -  )
    +  schema = schema(
    +    UsageClass = utf8(),
    +    CheckoutType = utf8(),
    +    MaterialType = utf8(),
    +    CheckoutYear = int64(),
    +    CheckoutMonth = int64(),
    +    Checkouts = int64(),
    +    Title = utf8(),
    +    ISBN = string(), #utf8()
    +    Creator = utf8(),
    +    Subjects = utf8(),
    +    Publisher = utf8(),
    +    PublicationYear = utf8()
    +  ),
    +    skip = 1,
     )
     seattle_csv
    @@ -660,7 +660,7 @@

    Let’s Control the Schema

    Your Turn

      -
    1. The first few thousand rows of ISBN are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with open_dataset() and ensure the correct data type for ISBN is <string> instead of the <null> interpreted by Arrow.

    2. +
    3. The first few thousand rows of ISBN are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with open_dataset() and ensure the correct data type for ISBN is <string> (or the alias <utf8>) instead of the <null> interpreted by Arrow.

    4. Once you have a Dataset object with the metadata you are after, count the number of Checkouts by CheckoutYear and arrange the result by CheckoutYear.

    ➡️ Data Storage Engineering Exercises Page

    @@ -709,7 +709,7 @@

    9GB CSV file + arrow + dplyr

    system.time()
       user  system elapsed 
    - 11.581   1.136  11.117 
    + 10.688 1.099 10.451

    42 million rows – not bad, but could be faster….

    @@ -769,7 +769,7 @@

    Storage: Parquet vs CSV

    file <- list.files(seattle_parquet)
     file.size(file.path(seattle_parquet, file)) / 10**9
    -
    [1] 4.423348
    +
    [1] 4.424267


    @@ -794,7 +794,7 @@

    4.5GB Parquet file + arrow + dplyr

    system.time()
       user  system elapsed 
    -  2.018   0.265   0.595 
    + 1.771 0.431 0.568

    42 million rows – much better! But could be even faster….

    @@ -887,7 +887,7 @@

    What Did We “Engineer”?

    # A tibble: 18 × 2
        files                            size_GB
        <chr>                              <dbl>
    - 1 CheckoutYear=2005/part-0.parquet   0.114
    + 1 CheckoutYear=2005/part-0.parquet   0.115
      2 CheckoutYear=2006/part-0.parquet   0.172
      3 CheckoutYear=2007/part-0.parquet   0.186
      4 CheckoutYear=2008/part-0.parquet   0.204
    @@ -901,7 +901,7 @@ 

    What Did We “Engineer”?

    12 CheckoutYear=2016/part-0.parquet 0.315 13 CheckoutYear=2017/part-0.parquet 0.319 14 CheckoutYear=2018/part-0.parquet 0.306 -15 CheckoutYear=2019/part-0.parquet 0.302 +15 CheckoutYear=2019/part-0.parquet 0.303 16 CheckoutYear=2020/part-0.parquet 0.158 17 CheckoutYear=2021/part-0.parquet 0.240 18 CheckoutYear=2022/part-0.parquet 0.252
    @@ -922,7 +922,7 @@

    4.5GB partitioned Parquet files + arrow + dplyr

    system.time()
       user  system elapsed 
    -  1.640   0.220   0.267 
    + 1.742 0.385 0.366


    @@ -936,10 +936,6 @@

    Your Turn

    ➡️ Data Storage Engineering Exercises Page

    -
    -

    Partitions & NA Values

    -

    ADD content

    -

    Partition Design

    @@ -953,23 +949,72 @@

    Partition Design

    +
    +

    Partitions & NA Values

    +

    Default:

    +
    +
    partition_na_default_path <- "data/na-partition-default"
    +
    +write_dataset(starwars,
    +              partition_na_default_path,
    +              partitioning = "hair_color")
    +
    +list.files(partition_na_default_path)
    +
    +
     [1] "hair_color=__HIVE_DEFAULT_PARTITION__"
    + [2] "hair_color=auburn"                    
    + [3] "hair_color=auburn%2C%20grey"          
    + [4] "hair_color=auburn%2C%20white"         
    + [5] "hair_color=black"                     
    + [6] "hair_color=blond"                     
    + [7] "hair_color=blonde"                    
    + [8] "hair_color=brown"                     
    + [9] "hair_color=brown%2C%20grey"           
    +[10] "hair_color=grey"                      
    +[11] "hair_color=none"                      
    +[12] "hair_color=white"                     
    +
    +
    +
    +
    +

    Partitions & NA Values

    +

    Custom:

    +
    +
    partition_na_custom_path <- "data/na-partition-custom"
    +
    +write_dataset(starwars,
    +              partition_na_custom_path,
    +              partitioning = hive_partition(hair_color = string(),
    +                                            null_fallback = "no_color"))
    +
    +list.files(partition_na_custom_path)
    +
    +
     [1] "hair_color=auburn"            "hair_color=auburn%2C%20grey" 
    + [3] "hair_color=auburn%2C%20white" "hair_color=black"            
    + [5] "hair_color=blond"             "hair_color=blonde"           
    + [7] "hair_color=brown"             "hair_color=brown%2C%20grey"  
    + [9] "hair_color=grey"              "hair_color=no_color"         
    +[11] "hair_color=none"              "hair_color=white"            
    +
    +
    +

    Performance Review: Single CSV

    How long does it take to calculate the number of books checked out in each month of 2021?


    -
    open_dataset(sources = "data/seattle-library-checkouts.csv", 
    -  format = "csv") |> 
    -
    -  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
    -  group_by(CheckoutMonth) |>
    -  summarise(TotalCheckouts = sum(Checkouts)) |>
    -  arrange(desc(CheckoutMonth)) |>
    -  collect() |>
    -  system.time()
    +
    open_dataset(sources = "data/seattle-library-checkouts.csv", 
    +  format = "csv") |> 
    +
    +  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
    +  group_by(CheckoutMonth) |>
    +  summarise(TotalCheckouts = sum(Checkouts)) |>
    +  arrange(desc(CheckoutMonth)) |>
    +  collect() |>
    +  system.time()
       user  system elapsed 
    - 13.362   1.763  12.438 
    + 11.718 1.106 11.250
    @@ -978,17 +1023,17 @@

    Performance Review: Partitioned Parquet

    How long does it take to calculate the number of books checked out in each month of 2021?


    -
    open_dataset(sources = "data/seattle-library-checkouts",
    -             format = "parquet") |> 
    -  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
    -  group_by(CheckoutMonth) |>
    -  summarise(TotalCheckouts = sum(Checkouts)) |>
    -  arrange(desc(CheckoutMonth)) |>
    -  collect() |> 
    -  system.time()
    +
    open_dataset(sources = "data/seattle-library-checkouts",
    +             format = "parquet") |> 
    +  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
    +  group_by(CheckoutMonth) |>
    +  summarise(TotalCheckouts = sum(Checkouts)) |>
    +  arrange(desc(CheckoutMonth)) |>
    +  collect() |> 
    +  system.time()
       user  system elapsed 
    -  0.330   0.039   0.091 
    + 0.224 0.040 0.068
    diff --git a/_site/materials/5_arrow_single_file.html b/_site/materials/5_arrow_single_file.html index 06a9782..bbc5840 100644 --- a/_site/materials/5_arrow_single_file.html +++ b/_site/materials/5_arrow_single_file.html @@ -431,16 +431,16 @@

    Read a Parquet File (tibble)

    # A tibble: 6,567,396 × 22
        vendor_name pickup_datetime     dropoff_datetime    passenger_count
        <chr>       <dttm>              <dttm>                        <int>
    - 1 VTS         2019-09-01 06:14:09 2019-09-01 06:31:52               2
    - 2 VTS         2019-09-01 06:36:17 2019-09-01 07:12:44               1
    - 3 VTS         2019-09-01 06:29:19 2019-09-01 06:54:13               1
    - 4 CMT         2019-09-01 06:33:09 2019-09-01 06:52:14               2
    - 5 VTS         2019-09-01 06:57:43 2019-09-01 07:26:21               1
    - 6 CMT         2019-09-01 06:59:16 2019-09-01 07:28:12               1
    - 7 CMT         2019-09-01 06:20:06 2019-09-01 06:52:19               1
    - 8 CMT         2019-09-01 06:27:54 2019-09-01 06:32:56               0
    - 9 CMT         2019-09-01 06:35:08 2019-09-01 06:55:51               0
    -10 CMT         2019-09-01 06:19:37 2019-09-01 06:30:52               1
    + 1 CMT         2019-08-31 18:09:30 2019-08-31 18:15:42               1
    + 2 CMT         2019-08-31 18:26:30 2019-08-31 18:44:31               1
    + 3 CMT         2019-08-31 18:39:35 2019-08-31 19:15:55               2
    + 4 VTS         2019-08-31 18:12:26 2019-08-31 18:15:17               4
    + 5 VTS         2019-08-31 18:43:16 2019-08-31 18:53:50               1
    + 6 VTS         2019-08-31 18:26:13 2019-08-31 18:45:35               1
    + 7 CMT         2019-08-31 18:34:52 2019-08-31 18:42:03               1
    + 8 CMT         2019-08-31 18:50:02 2019-08-31 18:58:16               1
    + 9 CMT         2019-08-31 18:08:02 2019-08-31 18:14:44               0
    +10 VTS         2019-08-31 18:11:38 2019-08-31 18:26:47               1
     # ℹ 6,567,386 more rows
     # ℹ 18 more variables: trip_distance <dbl>, pickup_longitude <dbl>,
     #   pickup_latitude <dbl>, rate_code <chr>, store_and_fwd <chr>,
    @@ -506,6 +506,70 @@ 

    tibble <-> Table <-> data.frame<
  • Table is an Arrow object in-memory
  • +
    +

    Watch Your Schemas 👀

    +
    +
    +
    +
    schema(taxi_df)
    +
    +
    Schema
    +vendor_name: string
    +pickup_datetime: timestamp[us, tz=America/Vancouver]
    +dropoff_datetime: timestamp[us, tz=America/Vancouver]
    +passenger_count: int32
    +trip_distance: double
    +pickup_longitude: double
    +pickup_latitude: double
    +rate_code: string
    +store_and_fwd: string
    +dropoff_longitude: double
    +dropoff_latitude: double
    +payment_type: string
    +fare_amount: double
    +extra: double
    +mta_tax: double
    +tip_amount: double
    +tolls_amount: double
    +total_amount: double
    +improvement_surcharge: double
    +congestion_surcharge: double
    +pickup_location_id: int32
    +dropoff_location_id: int32
    +
    +
    +
    +
    +
    schema(taxi_table)
    +
    +
    Schema
    +vendor_name: string
    +pickup_datetime: timestamp[ms]
    +dropoff_datetime: timestamp[ms]
    +passenger_count: int64
    +trip_distance: double
    +pickup_longitude: double
    +pickup_latitude: double
    +rate_code: string
    +store_and_fwd: string
    +dropoff_longitude: double
    +dropoff_latitude: double
    +payment_type: string
    +fare_amount: double
    +extra: double
    +mta_tax: double
    +tip_amount: double
    +tolls_amount: double
    +total_amount: double
    +improvement_surcharge: double
    +congestion_surcharge: double
    +pickup_location_id: int64
    +dropoff_location_id: int64
    +
    +
    +
    +
    +

    Data frames

    @@ -530,13 +594,13 @@

    Arrow Tables

    Table | Dataset: A dplyr pipeline

    -
    parquet_file |>
    -  read_parquet(as_data_frame = FALSE) |>
    -  group_by(vendor_name) |>
    -  summarise(all_trips = n(),
    -            shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>
    -  mutate(pct_shared = shared_trips / all_trips * 100) |>
    -  collect()
    +
    parquet_file |>
    +  read_parquet(as_data_frame = FALSE) |>
    +  group_by(vendor_name) |>
    +  summarise(all_trips = n(),
    +            shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>
    +  mutate(pct_shared = shared_trips / all_trips * 100) |>
    +  collect()
    # A tibble: 3 × 4
       vendor_name all_trips shared_trips pct_shared
    @@ -565,63 +629,63 @@ 

    Table | Dataset: A dplyr pipeline

    Arrow for Efficient In-Memory Processing

    -
    parquet_file |>
    -  read_parquet() |>
    -  nrow()
    +
    parquet_file |>
    +  read_parquet() |>
    +  nrow()
    [1] 6567396


    -
    parquet_file |>
    -  read_parquet() |>
    -  group_by(vendor_name) |>
    -  summarise(all_trips = n(),
    -            shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>
    -  mutate(pct_shared = shared_trips / all_trips * 100) |>
    -  collect() |>
    -  system.time()
    +
    parquet_file |>
    +  read_parquet() |>
    +  group_by(vendor_name) |>
    +  summarise(all_trips = n(),
    +            shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>
    +  mutate(pct_shared = shared_trips / all_trips * 100) |>
    +  collect() |>
    +  system.time()
       user  system elapsed 
    -  2.214   0.575   0.814 
    + 1.157 0.261 0.509

    Arrow for Efficient In-Memory Processing

    -
    parquet_file |>
    -  read_parquet(as_data_frame = FALSE) |>
    -  nrow()
    +
    parquet_file |>
    +  read_parquet(as_data_frame = FALSE) |>
    +  nrow()
    [1] 6567396


    -
    parquet_file |>
    -  read_parquet(as_data_frame = FALSE) |>
    -  group_by(vendor_name) |>
    -  summarise(all_trips = n(),
    -            shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>
    -  mutate(pct_shared = shared_trips / all_trips * 100) |>
    -  collect() |>
    -  system.time()
    +
    parquet_file |>
    +  read_parquet(as_data_frame = FALSE) |>
    +  group_by(vendor_name) |>
    +  summarise(all_trips = n(),
    +            shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>
    +  mutate(pct_shared = shared_trips / all_trips * 100) |>
    +  collect() |>
    +  system.time()
       user  system elapsed 
    -  1.995   0.343   0.366 
    + 1.047 0.203 0.220

    Read a Parquet File Selectively

    -
    parquet_file |>
    -  read_parquet(
    -    col_select = c("vendor_name", "passenger_count"),
    -    as_data_frame = FALSE
    -  )
    +
    parquet_file |>
    +  read_parquet(
    +    col_select = c("vendor_name", "passenger_count"),
    +    as_data_frame = FALSE
    +  )
    Table
     6567396 rows x 2 columns
    @@ -633,20 +697,20 @@ 

    Read a Parquet File Selectively

    Selective Reads Are Faster

    -
    parquet_file |>
    -  read_parquet(
    -    col_select = c("vendor_name", "passenger_count"),
    -    as_data_frame = FALSE
    -  ) |> 
    -  group_by(vendor_name) |>
    -  summarise(all_trips = n(),
    -            shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>
    -  mutate(pct_shared = shared_trips / all_trips * 100) |>
    -  collect() |>
    -  system.time()
    +
    parquet_file |>
    +  read_parquet(
    +    col_select = c("vendor_name", "passenger_count"),
    +    as_data_frame = FALSE
    +  ) |> 
    +  group_by(vendor_name) |>
    +  summarise(all_trips = n(),
    +            shared_trips = sum(passenger_count > 1, na.rm = TRUE)) |>
    +  mutate(pct_shared = shared_trips / all_trips * 100) |>
    +  collect() |>
    +  system.time()
       user  system elapsed 
    -  0.323   0.088   0.234 
    + 0.258 0.011 0.131