Skip to content

Commit 91c9ec0

Browse files
committed
Add synapse specific notebooks
1 parent bff9cd3 commit 91c9ec0

38 files changed

+12862
-17
lines changed

docs/.idea/.gitignore

+10
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/spark-internals

-17
This file was deleted.

docs/spark-internals.md

+83
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
## Bloom Filter Data Structure.
2+
3+
A Bloom filter is a probabilistic data structure that allows you to identify whether an item belong to a data set or
4+
not.
5+
6+
- It either outputs: "definitely not present" or "maybe present" for every data search.
7+
- It can return false positive matches (says an element is in the set when it isn't), but never false negatives (if it
8+
says an element is not in the set, it definitely isn't).
9+
- Bloom filters uses multiple hash functions to map elements to a fixed-size array of bits, which makes it space
10+
efficient as compared to say a list.
11+
12+
They are used widely in database systems to reduce expensive disk lookups. Coming to datalake, we can use bloom filters
13+
to efficiently skip over large portions of Parquet files that are irrelevant to our query,
14+
reducing the amount of data that needs to be read & processed.
15+
16+
- It is adapted by lakehouse table formats such as Delta and Apache Hudi to skip non-relevant row groups from data
17+
files.
18+
- This can be very valuable for improving query performance & reducing I/O operations when dealing with large-scale
19+
data.
20+
21+
Using these statistics together with Hudi's robust multi-modal subsystem can provides significant edge in query
22+
performance.
23+
24+
![bloom filter](./bloom-filter-spark.jpeg)
25+
26+
## Steps to write data to Parquet in cloud object storage:
27+
28+
1. Compute data in memory (may involve spilling to disk)
29+
2. Serialize result set to Parquet format
30+
a. encode pages (e.g. dictionary encoding)
31+
b. compress pages (e.g. Snappy)
32+
c. index pages (calculate min-max stats)
33+
3. Transfer serialized data over wire: compute node(s. ➜ storage node
34+
4. Write serialized data to disk
35+
36+
Steps to read data from Parquet in cloud object storage:
37+
38+
1. Read metadata
39+
2. Read serialized data from disk (use metadata for predicate/projection pushdown to only fetch pages needed for query
40+
at hand)
41+
3. Transfer serialized data over wire: storage node ➜ compute node(s)
42+
4. Deserialize to in memory format:
43+
a. decompress
44+
b. decode
45+
46+
Some writes (UPDATE/DELETE/MERGE. require a read. In that case, both the read and write steps are executed in sequence.
47+
48+
**It's hard to achieve low latency when using Parquet (or Parquet-based table formats. on cloud object stores because of
49+
all these steps.**
50+
51+
## Medallian Architecture
52+
53+
The "Medallion Architecture" is a way to organize data in a lakehouse. It's done differently everywhere, but the general
54+
idea is always the same.
55+
56+
1. data is loaded "as-is" into the "bronze layer" (also often called "raw"). An ingestion pipeline extracts data from
57+
source systems and loads it into tables in the lake, without transformations and schema control—this is "Extract and
58+
Load" (EL)
59+
2. data then moves to the "silver layer" (also often called "refined" or "curated"). A transformation pipeline applies
60+
standard technical adjustments (e.g. column naming convention, type casting, deduplication, ...) and schema control (
61+
enforcement or managed evolution)
62+
3. data finally arrives at the"gold layer". A transformation pipeline applies fit-for-purpose changes to prepare data
63+
for consumption (e.g. interactive analytics, reports/dashboards, machine learning)—One Big Table (OBT) or Kimball
64+
star schema data models make sense here
65+
66+
As data moves through the layers, it changes
67+
68+
- from dirty to clean
69+
- from normalized to denormalized
70+
- from granular to aggregated
71+
- from source-specific to domain-specific
72+
73+
Key points
74+
75+
- The flow is often ELtT-like: smaller transformations in silver ("t"), heavy transformations in gold ("T").
76+
- Bronze and silver are optimized for write performance. data resembles format in sources systems, so no big (slow)
77+
transformations are needed ➜ writes are fast
78+
- Gold is optimized for read performance. big transformations (joins, aggregations, ...) are applied at write time so
79+
they don't need to be applied on-the-fly at read time (e.g. when a user runs an interactive query or when a dashboard
80+
queries the gold table) ➜ reads are fast
81+
82+
> Medallion Architectures come in many shapes and forms. It's common to find more than three layers, or tables that do
83+
> not perfectly fit in either of them.

0 commit comments

Comments
 (0)