Skip to content

Commit

Permalink
Merge branch 'main' into nic-plane-updates
Browse files Browse the repository at this point in the history
  • Loading branch information
thisisnic authored Aug 12, 2024
2 parents 78feeb2 + e0d88d5 commit 62eaad4
Show file tree
Hide file tree
Showing 9 changed files with 196 additions and 104 deletions.
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"hash": "f7719c7779de7c3a13a10a213a5cadda",
"hash": "509cb6a2aff452bfa64feede8e3063c0",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\neditor: source \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\")\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` (or the alias `<utf8>`) instead of the `<null>` interpreted by Arrow.\n\n2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(), #or utf8()\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n ),\n skip = 1,\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n col_types = schema(ISBN = string()) # or utf8()\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear `sum(Checkouts)`\n <int> <int>\n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |> \n count(CheckoutYear, wt = Checkouts) |> \n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear n\n <int> <int>\n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 10.696 1.227 10.449 \n```\n\n\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nseattle_csv |>\n write_dataset(path = seattle_parquet,\n format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nopen_dataset(sources = seattle_parquet, \n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 1.576 0.377 0.548 \n```\n\n\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = seattle_parquet_part,\n format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- \"data/seattle-library-checkouts-type\"\n\nseattle_csv |>\n group_by(CheckoutType) |>\n write_dataset(path = seattle_checkouttype,\n format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts-type\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.779 0.080 0.301 \n```\n\n\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"data/seattle-library-checkouts\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.035 0.005 0.029 \n```\n\n\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n",
"markdown": "---\ntitle: \"Data Engineering with Arrow Exercises\"\nexecute:\n echo: true\n messages: false\n warning: false\neditor: source \n---\n\n\n# Schemas\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(arrow)\nlibrary(dplyr)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\")\n```\n:::\n\n\n::: {#exercise-schema .callout-tip}\n# Data Types & Controlling the Schema\n\n::: panel-tabset\n## Problems\n\n1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` (or the alias `<utf8>`) instead of the `<null>` interpreted by Arrow.\n\n2. Once you have a `Dataset` object with the correct data types, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n schema(\n UsageClass = utf8(),\n CheckoutType = utf8(),\n MaterialType = utf8(),\n CheckoutYear = int64(),\n CheckoutMonth = int64(),\n Checkouts = int64(),\n Title = utf8(),\n ISBN = string(), #or utf8()\n Creator = utf8(),\n Subjects = utf8(),\n Publisher = utf8(),\n PublicationYear = utf8()\n ),\n skip = 1,\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(sources = \"data/seattle-library-checkouts.csv\",\n format = \"csv\",\n col_types = schema(ISBN = string()) # or utf8()\n)\nseattle_csv\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nFileSystemDataset with 1 csv file\n12 columns\nUsageClass: string\nCheckoutType: string\nMaterialType: string\nCheckoutYear: int64\nCheckoutMonth: int64\nCheckouts: int64\nTitle: string\nISBN: string\nCreator: string\nSubjects: string\nPublisher: string\nPublicationYear: string\n```\n\n\n:::\n:::\n\n\n## Solution 2\n\nThe number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear `sum(Checkouts)`\n <int> <int>\n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\nor\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |> \n count(CheckoutYear, wt = Checkouts) |> \n arrange(CheckoutYear) |> \n collect()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 18 × 2\n CheckoutYear n\n <int> <int>\n 1 2005 3798685\n 2 2006 6599318\n 3 2007 7126627\n 4 2008 8438486\n 5 2009 9135167\n 6 2010 8608966\n 7 2011 8321732\n 8 2012 8163046\n 9 2013 9057096\n10 2014 9136081\n11 2015 9084179\n12 2016 9021051\n13 2017 9231648\n14 2018 9149176\n15 2019 9199083\n16 2020 6053717\n17 2021 7361031\n18 2022 7001989\n```\n\n\n:::\n:::\n\n\nTiming the query:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 10.660 1.244 10.399 \n```\n\n\n:::\n:::\n\n\nQuerying 42 million rows of data stored in a CSV on disk in \\~10 seconds, not too bad.\n:::\n:::\n\n# Parquet\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nseattle_csv |>\n write_dataset(path = seattle_parquet,\n format = \"parquet\")\n```\n:::\n\n\n::: {#exercise-dataset .callout-tip}\n# Parquet\n\n::: panel-tabset\n## Problem\n\n1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single Parquet file. Did you notice a difference in compute time?\n\n## Solution 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet <- \"data/seattle-library-checkouts-parquet\"\n\nopen_dataset(sources = seattle_parquet, \n format = \"parquet\") |>\n group_by(CheckoutYear) |>\n summarise(sum(Checkouts)) |>\n arrange(CheckoutYear) |> \n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 1.758 0.445 0.557 \n```\n\n\n:::\n:::\n\n\nA *much* faster compute time for the query when the on-disk data is stored in the Parquet format.\n:::\n:::\n\n# Partitioning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_parquet_part <- \"data/seattle-library-checkouts\"\n\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = seattle_parquet_part,\n format = \"parquet\")\n```\n:::\n\n\n::: callout-tip\n# Partitioning\n\n::: panel-tabset\n## Problems\n\n1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.\n\n2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?\n\n## Solution 1\n\nWriting the data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_checkouttype <- \"data/seattle-library-checkouts-type\"\n\nseattle_csv |>\n group_by(CheckoutType) |>\n write_dataset(path = seattle_checkouttype,\n format = \"parquet\")\n```\n:::\n\n\n## Solution 2\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(sources = \"data/seattle-library-checkouts-type\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.781 0.085 0.297 \n```\n\n\n:::\n:::\n\n\nTotal number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nopen_dataset(\"data/seattle-library-checkouts\") |> \n filter(CheckoutYear == 2019, CheckoutMonth == 9) |> \n summarise(TotalCheckouts = sum(Checkouts)) |>\n collect() |> \n system.time()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n user system elapsed \n 0.033 0.006 0.031 \n```\n\n\n:::\n:::\n\n\nFaster compute time because the `filter()` call is based on the partitions.\n:::\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
Expand Down

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions _site/materials/3_data_engineering-exercises.html
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ <h1>Schemas</h1>
<div id="tabset-1-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-1-1-tab">
<ol type="1">
<li><p>The first few thousand rows of <code>ISBN</code> are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with <code>open_dataset()</code> and ensure the correct data type for <code>ISBN</code> is <code>&lt;string&gt;</code> (or the alias <code>&lt;utf8&gt;</code>) instead of the <code>&lt;null&gt;</code> interpreted by Arrow.</p></li>
<li><p>Once you have a <code>Dataset</code> object with the metadata you are after, count the number of <code>Checkouts</code> by <code>CheckoutYear</code> and arrange the result by <code>CheckoutYear</code>.</p></li>
<li><p>Once you have a <code>Dataset</code> object with the correct data types, count the number of <code>Checkouts</code> by <code>CheckoutYear</code> and arrange the result by <code>CheckoutYear</code>.</p></li>
</ol>
</div>
<div id="tabset-1-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-1-2-tab">
Expand Down Expand Up @@ -423,7 +423,7 @@ <h1>Schemas</h1>
<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a> <span class="fu">system.time</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code> user system elapsed
10.696 1.227 10.449 </code></pre>
10.660 1.244 10.399 </code></pre>
</div>
</div>
<p>Querying 42 million rows of data stored in a CSV on disk in ~10 seconds, not too bad.</p>
Expand Down Expand Up @@ -457,7 +457,7 @@ <h1>Parquet</h1>
<div class="tab-content">
<div id="tabset-2-1" class="tab-pane active" role="tabpanel" aria-labelledby="tabset-2-1-tab">
<ol type="1">
<li>Re-run the query counting the number of <code>Checkouts</code> by <code>CheckoutYear</code> and arranging the result by <code>CheckoutYear</code>, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?</li>
<li>Re-run the query counting the number of <code>Checkouts</code> by <code>CheckoutYear</code> and arranging the result by <code>CheckoutYear</code>, this time using the Seattle Checkout data saved to disk as a single Parquet file. Did you notice a difference in compute time?</li>
</ol>
</div>
<div id="tabset-2-2" class="tab-pane" role="tabpanel" aria-labelledby="tabset-2-2-tab">
Expand All @@ -473,7 +473,7 @@ <h1>Parquet</h1>
<span id="cb14-9"><a href="#cb14-9" aria-hidden="true" tabindex="-1"></a> <span class="fu">system.time</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code> user system elapsed
1.576 0.377 0.548 </code></pre>
1.758 0.445 0.557 </code></pre>
</div>
</div>
<p>A <em>much</em> faster compute time for the query when the on-disk data is stored in the Parquet format.</p>
Expand Down Expand Up @@ -533,7 +533,7 @@ <h1>Partitioning</h1>
<span id="cb18-5"><a href="#cb18-5" aria-hidden="true" tabindex="-1"></a> <span class="fu">system.time</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code> user system elapsed
0.779 0.080 0.301 </code></pre>
0.781 0.085 0.297 </code></pre>
</div>
</div>
<p>Total number of Checkouts in September of 2019 using partitioned Parquet data by <code>CheckoutYear</code> and <code>CheckoutMonth</code>:</p>
Expand All @@ -545,7 +545,7 @@ <h1>Partitioning</h1>
<span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a> <span class="fu">system.time</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code> user system elapsed
0.035 0.005 0.029 </code></pre>
0.033 0.006 0.031 </code></pre>
</div>
</div>
<p>Faster compute time because the <code>filter()</code> call is based on the partitions.</p>
Expand Down
Loading

0 comments on commit 62eaad4

Please sign in to comment.