Skip to content

Commit

Permalink
Merge pull request #7 from posit-conf-2024/review-tweaks
Browse files Browse the repository at this point in the history
Minor tweaks from practice review
  • Loading branch information
stephhazlitt authored Aug 8, 2024
2 parents 160ab27 + 7a11c39 commit aa439f8
Show file tree
Hide file tree
Showing 17 changed files with 366 additions and 181 deletions.
8 changes: 5 additions & 3 deletions _freeze/materials/0_housekeeping/execute-results/html.json

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"hash": "949ad5f0c58f263cb46500cfd640fc1d",
"hash": "fa36122b964d2adb9ad5f21d7c58c8cc",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\neditor: source \n---\n\n\n\n\n# Schemas\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\"\n)\n```\n:::\n\n\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` (or the alias `<utf8>`) instead of the `<null>` interpreted by Arrow.\n\n2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n skip = 1,\n schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(), #or utf8()\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n )\n)\n```\n:::\n\n\n\n\nor\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n col_types = schema(ISBN = string()) #utf8()\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear `sum(Checkouts)`\n <int> <int>\n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\n\n\nor\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |> \n count(CheckoutYear, wt = Checkouts) |> \n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear n\n <int> <int>\n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\n\n\nTiming the query:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 11.474 1.084 11.003 \n```\n\n\n:::\n:::\n\n\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nseattle_csv |>\n write_dataset(path = seattle_parquet,\n format = \"parquet\")\n```\n:::\n\n\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nopen_dataset(sources = seattle_parquet, \n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 2.076 0.287 0.646 \n```\n\n\n:::\n:::\n\n\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = seattle_parquet_part,\n format = \"parquet\")\n```\n:::\n\n\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- \"data/seattle-library-checkouts-type\"\n\nseattle_csv |>\n group_by(CheckoutType) |>\n write_dataset(path = seattle_checkouttype,\n format = \"parquet\")\n```\n:::\n\n\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts-type\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.965 0.160 0.409 \n```\n\n\n:::\n:::\n\n\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"data/seattle-library-checkouts\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.058 0.006 0.052 \n```\n\n\n:::\n:::\n\n\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n",
"markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\neditor: source \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\")\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` (or the alias `<utf8>`) instead of the `<null>` interpreted by Arrow.\n\n2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(), #or utf8()\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n ),\n skip = 1,\n)\n```\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n col_types = schema(ISBN = string()) # or utf8()\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear `sum(Checkouts)`\n <int> <int>\n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |> \n count(CheckoutYear, wt = Checkouts) |> \n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear n\n <int> <int>\n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 10.651 1.091 10.333 \n```\n\n\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nseattle_csv |>\n write_dataset(path = seattle_parquet,\n format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nopen_dataset(sources = seattle_parquet, \n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 1.634 0.345 0.558 \n```\n\n\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = seattle_parquet_part,\n format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- \"data/seattle-library-checkouts-type\"\n\nseattle_csv |>\n group_by(CheckoutType) |>\n write_dataset(path = seattle_checkouttype,\n format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts-type\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.777 0.072 0.296 \n```\n\n\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"data/seattle-library-checkouts\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.034 0.005 0.030 \n```\n\n\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
Expand Down

Large diffs are not rendered by default.

Loading

0 comments on commit aa439f8

Please sign in to comment.