Merge branch 'main' into nic-plane-updates

posit-conf-2024 · Aug 12, 2024 · 62eaad4 · 62eaad4
2 parents 78feeb2 + e0d88d5
commit 62eaad4
Show file tree

Hide file tree

Showing 9 changed files with 196 additions and 104 deletions.
diff --git a/_freeze/materials/3_data_engineering-exercises/execute-results/html.json b/_freeze/materials/3_data_engineering-exercises/execute-results/html.json
@@ -1,8 +1,8 @@
 {
-  "hash": "f7719c7779de7c3a13a10a213a5cadda",
+  "hash": "509cb6a2aff452bfa64feede8e3063c0",
   "result": {
     "engine": "knitr",
-    "markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n  echo: true\n  messages: false\n  warning: false\neditor: source  \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n                            format = \"csv\")\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1.  The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` (or the alias `<utf8>`) instead of the `<null>` interpreted by Arrow.\n\n2.  Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n  format = \"csv\",\n  schema(\n    UsageClass = utf8(),\n    CheckoutType = utf8(),\n    MaterialType = utf8(),\n    CheckoutYear = int64(),\n    CheckoutMonth = int64(),\n    Checkouts = int64(),\n    Title = utf8(),\n    ISBN = string(), #or utf8()\n    Creator = utf8(),\n    Subjects = utf8(),\n    Publisher = utf8(),\n    PublicationYear = utf8()\n  ),\n    skip = 1,\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n  format = \"csv\",\n  col_types = schema(ISBN = string()) # or utf8()\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n   CheckoutYear `sum(Checkouts)`\n          <int>            <int>\n 1         2005          3798685\n 2         2006          6599318\n 3         2007          7126627\n 4         2008          8438486\n 5         2009          9135167\n 6         2010          8608966\n 7         2011          8321732\n 8         2012          8163046\n 9         2013          9057096\n10         2014          9136081\n11         2015          9084179\n12         2016          9021051\n13         2017          9231648\n14         2018          9149176\n15         2019          9199083\n16         2020          6053717\n17         2021          7361031\n18         2022          7001989\n```\n\n\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |> \n  count(CheckoutYear, wt = Checkouts) |> \n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n   CheckoutYear       n\n          <int>   <int>\n 1         2005 3798685\n 2         2006 6599318\n 3         2007 7126627\n 4         2008 8438486\n 5         2009 9135167\n 6         2010 8608966\n 7         2011 8321732\n 8         2012 8163046\n 9         2013 9057096\n10         2014 9136081\n11         2015 9084179\n12         2016 9021051\n13         2017 9231648\n14         2018 9149176\n15         2019 9199083\n16         2020 6053717\n17         2021 7361031\n18         2022 7001989\n```\n\n\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   user  system elapsed \n 10.696   1.227  10.449 \n```\n\n\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nseattle_csv |>\n  write_dataset(path = seattle_parquet,\n                format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1.  Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nopen_dataset(sources = seattle_parquet, \n             format = \"parquet\") |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   user  system elapsed \n  1.576   0.377   0.548 \n```\n\n\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  write_dataset(path = seattle_parquet_part,\n                format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1.  Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2.  Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- \"data/seattle-library-checkouts-type\"\n\nseattle_csv |>\n  group_by(CheckoutType) |>\n  write_dataset(path = seattle_checkouttype,\n                format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts-type\") |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   user  system elapsed \n  0.779   0.080   0.301 \n```\n\n\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"data/seattle-library-checkouts\") |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   user  system elapsed \n  0.035   0.005   0.029 \n```\n\n\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n",
+    "markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n  echo: true\n  messages: false\n  warning: false\neditor: source  \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n                            format = \"csv\")\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1.  The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` (or the alias `<utf8>`) instead of the `<null>` interpreted by Arrow.\n\n2.  Once you have a `Dataset` object with the correct data types, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n  format = \"csv\",\n  schema(\n    UsageClass = utf8(),\n    CheckoutType = utf8(),\n    MaterialType = utf8(),\n    CheckoutYear = int64(),\n    CheckoutMonth = int64(),\n    Checkouts = int64(),\n    Title = utf8(),\n    ISBN = string(), #or utf8()\n    Creator = utf8(),\n    Subjects = utf8(),\n    Publisher = utf8(),\n    PublicationYear = utf8()\n  ),\n    skip = 1,\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n  format = \"csv\",\n  col_types = schema(ISBN = string()) # or utf8()\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n   CheckoutYear `sum(Checkouts)`\n          <int>            <int>\n 1         2005          3798685\n 2         2006          6599318\n 3         2007          7126627\n 4         2008          8438486\n 5         2009          9135167\n 6         2010          8608966\n 7         2011          8321732\n 8         2012          8163046\n 9         2013          9057096\n10         2014          9136081\n11         2015          9084179\n12         2016          9021051\n13         2017          9231648\n14         2018          9149176\n15         2019          9199083\n16         2020          6053717\n17         2021          7361031\n18         2022          7001989\n```\n\n\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |> \n  count(CheckoutYear, wt = Checkouts) |> \n  arrange(CheckoutYear) |> \n  collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n   CheckoutYear       n\n          <int>   <int>\n 1         2005 3798685\n 2         2006 6599318\n 3         2007 7126627\n 4         2008 8438486\n 5         2009 9135167\n 6         2010 8608966\n 7         2011 8321732\n 8         2012 8163046\n 9         2013 9057096\n10         2014 9136081\n11         2015 9084179\n12         2016 9021051\n13         2017 9231648\n14         2018 9149176\n15         2019 9199083\n16         2020 6053717\n17         2021 7361031\n18         2022 7001989\n```\n\n\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   user  system elapsed \n 10.660   1.244  10.399 \n```\n\n\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nseattle_csv |>\n  write_dataset(path = seattle_parquet,\n                format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1.  Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nopen_dataset(sources = seattle_parquet, \n             format = \"parquet\") |>\n  group_by(CheckoutYear) |>\n  summarise(sum(Checkouts)) |>\n  arrange(CheckoutYear) |> \n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   user  system elapsed \n  1.758   0.445   0.557 \n```\n\n\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nseattle_csv |>\n  group_by(CheckoutYear) |>\n  write_dataset(path = seattle_parquet_part,\n                format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1.  Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2.  Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- \"data/seattle-library-checkouts-type\"\n\nseattle_csv |>\n  group_by(CheckoutType) |>\n  write_dataset(path = seattle_checkouttype,\n                format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts-type\") |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   user  system elapsed \n  0.781   0.085   0.297 \n```\n\n\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"data/seattle-library-checkouts\") |> \n  filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n  summarise(TotalCheckouts = sum(Checkouts)) |>\n  collect() |> \n  system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n   user  system elapsed \n  0.033   0.006   0.031 \n```\n\n\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n",
     "supporting": [],
     "filters": [
       "rmarkdown/pagebreak.lua"

diff --git a/_freeze/materials/3_data_engineering/execute-results/html.json b/_freeze/materials/3_data_engineering/execute-results/html.json
diff --git a/_site/materials/3_data_engineering-exercises.html b/_site/materials/3_data_engineering-exercises.html
@@ -285,7 +285,7 @@ <h1>Schemas</h1>
 <div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab">
 <ol type="1">
 <li><p>The first few thousand rows of <code>ISBN</code> are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with <code>open_dataset()</code> and ensure the correct data type for <code>ISBN</code> is <code>&lt;string&gt;</code> (or the alias <code>&lt;utf8&gt;</code>) instead of the <code>&lt;null&gt;</code> interpreted by Arrow.</p></li>
-<li><p>Once you have a <code>Dataset</code> object with the metadata you are after, count the number of <code>Checkouts</code> by <code>CheckoutYear</code> and arrange the result by <code>CheckoutYear</code>.</p></li>
+<li><p>Once you have a <code>Dataset</code> object with the correct data types, count the number of <code>Checkouts</code> by <code>CheckoutYear</code> and arrange the result by <code>CheckoutYear</code>.</p></li>
 </ol>
 </div>
 <div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab">
@@ -423,7 +423,7 @@ <h1>Schemas</h1>
 <span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a>  <span class="fu">system.time</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>   user  system elapsed 
- 10.696   1.227  10.449 </code></pre>
+ 10.660   1.244  10.399 </code></pre>
 </div>
 </div>
 <p>Querying 42 million rows of data stored in a CSV on disk in ~10 seconds, not too bad.</p>
@@ -457,7 +457,7 @@ <h1>Parquet</h1>
 <div class="tab-content">
 <div id="tabset-2-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-2-1-tab">
 <ol type="1">
-<li>Re-run the query counting the number of <code>Checkouts</code> by <code>CheckoutYear</code> and arranging the result by <code>CheckoutYear</code>, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?</li>
+<li>Re-run the query counting the number of <code>Checkouts</code> by <code>CheckoutYear</code> and arranging the result by <code>CheckoutYear</code>, this time using the Seattle Checkout data saved to disk as a single Parquet file. Did you notice a difference in compute time?</li>
 </ol>
 </div>
 <div id="tabset-2-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-2-2-tab">
@@ -473,7 +473,7 @@ <h1>Parquet</h1>
 <span id="cb14-9"><a href="#cb14-9" aria-hidden="true" tabindex="-1"></a>  <span class="fu">system.time</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>   user  system elapsed 
-  1.576   0.377   0.548 </code></pre>
+  1.758   0.445   0.557 </code></pre>
 </div>
 </div>
 <p>A <em>much</em> faster compute time for the query when the on-disk data is stored in the Parquet format.</p>
@@ -533,7 +533,7 @@ <h1>Partitioning</h1>
 <span id="cb18-5"><a href="#cb18-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">system.time</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>   user  system elapsed 
-  0.779   0.080   0.301 </code></pre>
+  0.781   0.085   0.297 </code></pre>
 </div>
 </div>
 <p>Total number of Checkouts in September of 2019 using partitioned Parquet data by <code>CheckoutYear</code> and <code>CheckoutMonth</code>:</p>
@@ -545,7 +545,7 @@ <h1>Partitioning</h1>
 <span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a>  <span class="fu">system.time</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code>   user  system elapsed 
-  0.035   0.005   0.029 </code></pre>
+  0.033   0.006   0.031 </code></pre>
 </div>
 </div>
 <p>Faster compute time because the <code>filter()</code> call is based on the partitions.</p>