Skip to content

Commit

Permalink
Merge pull request #4290 from szarnyasg/nits-20241207b
Browse files Browse the repository at this point in the history
Nits 20241207b
  • Loading branch information
szarnyasg authored Dec 7, 2024
2 parents 5cb477d + 76ff600 commit 9e95429
Show file tree
Hide file tree
Showing 9 changed files with 22 additions and 17 deletions.
2 changes: 1 addition & 1 deletion _posts/2023-04-14-h2oai.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ The queries have not changed since the benchmark went dormant. The data is gener
| advanced groupby #2 | `SELECT id3, max(v1)-min(v2) AS range_v1_v2 FROM tbl GROUP BY id3` | Range selection over small cardinality groups, grouped by integer |
| advanced groupby #3 | `SELECT id6, v3 AS largest2_v3 FROM (SELECT id6, v3, row_number() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS order_v3 FROM x WHERE v3 IS NOT NULL) sub_query WHERE order_v3 <= 2` |Advanced group by query |
| advanced groupby #4 | `SELECT id2, id4, pow(corr(v1, v2), 2) AS r2 FROM tbl GROUP BY id2, id4` | Arithmetic over medium sized groups, grouped by varchar, integer. |
| advanced groupby #5 | `SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM tbl GROUP BY id1, id2, id3, id4, id5, id6` | Many many small groups, the number of groups is the cardinality of the dataset |
| advanced groupby #5 | `SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM tbl GROUP BY id1, id2, id3, id4, id5, id6` | Many small groups, the number of groups is the cardinality of the dataset |
| join #1 |`SELECT x.*, small.id4 AS small_id4, v2 FROM x JOIN small USING (id1)` | Joining a large table (x) with a small-sized table on integer type |
| join #2 |`SELECT x.*, medium.id1 AS medium_id1, medium.id4 AS medium_id4, medium.id5 AS medium_id5, v2 FROM x JOIN medium USING (id2)` | Joining a large table (x) with a medium-sized table on integer type |
| join #3 |`SELECT x.*, medium.id1 AS medium_id1, medium.id4 AS medium_id4, medium.id5 AS medium_id5, v2 FROM x LEFT JOIN medium USING (id2)` | Left join a large table (x) with a medium-sized table on integer type|
Expand Down
2 changes: 1 addition & 1 deletion _posts/2024-09-27-sql-only-extensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ git push

#### Write Your SQL Macros

It it likely a bit faster to iterate if you test your macros directly in DuckDB.
It is likely a bit faster to iterate if you test your macros directly in DuckDB.
After you have written your SQL, we will move it into the extension.
The example we will use demonstrates how to pull a dynamic set of columns from a dynamic table name (or a view name!).

Expand Down
2 changes: 1 addition & 1 deletion _posts/2024-11-29-duckdb-tricks-part-3.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ We have now a table with all the data from January to October, amounting to almo
## Reordering Parquet Files

Suppose we want to analyze the average delay of the [Intercity Direct trains](https://en.wikipedia.org/wiki/Intercity_Direct) operated by the [Nederlandse Spoorwegen (NS)](https://en.wikipedia.org/wiki/Nederlandse_Spoorwegen), measured at the final destination of the train service.
While we can run this analysis directly on the the `.csv` files, the lack of metadata (such as schema and min-max indexes) will limit the performance.
While we can run this analysis directly on the `.csv` files, the lack of metadata (such as schema and min-max indexes) will limit the performance.
Let's measure this in the CLI client by turning on the [timer]({% link docs/api/cli/dot_commands.md %}):

```plsql
Expand Down
2 changes: 1 addition & 1 deletion _posts/2024-12-05-csv-files-dethroning-parquet-or-not.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ Furthermore, the reader became one of the fastest CSV readers in analytical syst

## Comparing CSV and Parquet

With the large boost boost in usability and performance for the CSV reader, one might ask: what is the actual difference in performance when loading a CSV file compared to a Parquet file into a table? Additionally, how do these formats differ when running queries directly on them?
With the large boost in usability and performance for the CSV reader, one might ask: what is the actual difference in performance when loading a CSV file compared to a Parquet file into a table? Additionally, how do these formats differ when running queries directly on them?

To find out, we will run a few examples using both CSV and Parquet files containing TPC-H data to shed light on their differences. All scripts used to generate the benchmarks of this blogpost can be found in a [repository](https://github.com/pdet/csv_vs_parquet).

Expand Down
2 changes: 1 addition & 1 deletion _posts/2024-12-06-duckdb-tpch-sf100-on-mobile.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ The table contains a summary of the DuckDB benchmark results.

## Historical Context

So why did we set out to run these these experiments in the first place?
So why did we set out to run these experiments in the first place?

Just a few weeks ago, [CWI](https://cwi.nl/), the birthplace of DuckDB, held a ceremony for the [Dijkstra Fellowship](https://www.cwi.nl/en/events/dijkstra-awards/cwi-lectures-dijkstra-fellowship/).
The fellowship was awarded to Marcin Żukowski for his pioneering role in the development of database management systems and his successful entrepreneurial career that resulted in systems such as [VectorWise](https://en.wikipedia.org/wiki/Actian_Vector) and [Snowflake](https://en.wikipedia.org/wiki/Snowflake_Inc.).
Expand Down
2 changes: 1 addition & 1 deletion docs/extensions/spatial/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -1702,7 +1702,7 @@ VARCHAR ST_QuadKey (col0 GEOMETRY, col1 INTEGER)
#### Description

Compute the [quadkey](https://learn.microsoft.com/en-us/bingmaps/articles/bing-maps-tile-system) for a given lon/lat point at a given level.
Note that the the parameter order is **longitude**, **latitude**.
Note that the parameter order is **longitude**, **latitude**.

`level` has to be between 1 and 23, inclusive.

Expand Down
2 changes: 1 addition & 1 deletion docs/extensions/spatial/r-tree_indexes.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ EXPLAIN SELECT count(*) FROM t1 WHERE ST_Within(geom, ST_MakeEnvelope(45, 45, 65

Creating R-trees on top of an already populated table is much faster than first creating the index and then inserting the data. This is because the R-tree will have to periodically rebalance itself and perform a somewhat costly splitting operation when a node reaches max capacity after an insert, potentially causing additional splits to cascade up the tree. However, when the R-tree index is created on an already populated table, a special bottom up "bulk loading algorithm" (Sort-Tile-Recursive) is used, which divides all entries into an already balanced tree as the total number of required nodes can be computed from the beginning.

Additionally, using the bulk loading algorithm tends to create a R-tree with a better structure (less overlap between bounding boxes), which usually leads to better query performance. If you find that the performance of querying the R-tree starts to deteriorate after a large number of of updates or deletions, dropping and re-creating the index might produce a higher quality R-tree.
Additionally, using the bulk loading algorithm tends to create a R-tree with a better structure (less overlap between bounding boxes), which usually leads to better query performance. If you find that the performance of querying the R-tree starts to deteriorate after a large number of updates or deletions, dropping and re-creating the index might produce a higher quality R-tree.

### Memory Usage

Expand Down
7 changes: 5 additions & 2 deletions single-file-document/concatenate_to_single_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def adjust_links_in_doc_body(doc_body):
"]({% link docs/python/overview.md %})"
)

# replace "`, `" (with its typical surroundings) with "`,` " to allow line breaking
# replace "`, `" (with the surrounding characters used for emphasis) with "`,` " to allow line breaking
# see https://stackoverflow.com/questions/76951040/pandoc-preserve-whitespace-in-inline-code
doc_body = doc_body.replace("`*`, `*`", "`*`,` *`")

Expand All @@ -115,8 +115,11 @@ def adjust_links_in_doc_body(doc_body):
# replace links to data sets to point to the website
doc_body = doc_body.replace("](/data/", "](https://duckdb.org/data/")

# remove '<div>' HTML tags
doc_body = re.sub(r'<div[^>]*?>[\n ]*([^§]*?)[\n ]*</div>', r'\1', doc_body, flags=re.MULTILINE)

# replace '<img>' HTML tags with Markdown's '![]()' construct
doc_body = re.sub(r'<img src="([^"]*)"[^§]*?/>', r'![](\1)', doc_body, flags=re.MULTILINE)
doc_body = re.sub(r'<img src="([^"]*)"[^§]*?/>', r'![](\1)\n', doc_body, flags=re.MULTILINE)

# use relative path for images in Markdown
doc_body = doc_body.replace("](/images", "](../images")
Expand Down
18 changes: 10 additions & 8 deletions single-file-document/templates/eisvogel2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -389,15 +389,17 @@
$if(graphics)$
\usepackage{graphicx}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
\newsavebox\pandoc@box
\newcommand*\pandocbounded[1]{% scales image to fit in text height/width
\sbox\pandoc@box{#1}%
\Gscale@div\@tempa{\textheight}{\dimexpr\ht\pandoc@box+\dp\pandoc@box\relax}%
\Gscale@div\@tempb{\linewidth}{\wd\pandoc@box}%
\ifdim\@tempb\p@<\@tempa\p@\let\@tempa\@tempb\fi% select the smaller of both
\ifdim\@tempa\p@<\p@\scalebox{\@tempa}{\usebox\pandoc@box}%
\else\usebox{\pandoc@box}%
\fi%
}
% Set default figure placement to htbp
\makeatletter
% Make use of float-package and set default placement for figures to H.
% The option H means 'PUT IT HERE' (as opposed to the standard h option which means 'You may put it here if you like').
\usepackage{float}
Expand Down

0 comments on commit 9e95429

Please sign in to comment.