|
| 1 | +## Bloom Filter Data Structure. |
| 2 | + |
| 3 | +A Bloom filter is a probabilistic data structure that allows you to identify whether an item belong to a data set or |
| 4 | +not. |
| 5 | + |
| 6 | +- It either outputs: "definitely not present" or "maybe present" for every data search. |
| 7 | +- It can return false positive matches (says an element is in the set when it isn't), but never false negatives (if it |
| 8 | + says an element is not in the set, it definitely isn't). |
| 9 | +- Bloom filters uses multiple hash functions to map elements to a fixed-size array of bits, which makes it space |
| 10 | + efficient as compared to say a list. |
| 11 | + |
| 12 | +They are used widely in database systems to reduce expensive disk lookups. Coming to datalake, we can use bloom filters |
| 13 | +to efficiently skip over large portions of Parquet files that are irrelevant to our query, |
| 14 | +reducing the amount of data that needs to be read & processed. |
| 15 | + |
| 16 | +- It is adapted by lakehouse table formats such as Delta and Apache Hudi to skip non-relevant row groups from data |
| 17 | + files. |
| 18 | +- This can be very valuable for improving query performance & reducing I/O operations when dealing with large-scale |
| 19 | + data. |
| 20 | + |
| 21 | +Using these statistics together with Hudi's robust multi-modal subsystem can provides significant edge in query |
| 22 | +performance. |
| 23 | + |
| 24 | + |
| 25 | + |
| 26 | +## Steps to write data to Parquet in cloud object storage: |
| 27 | + |
| 28 | +1. Compute data in memory (may involve spilling to disk) |
| 29 | +2. Serialize result set to Parquet format |
| 30 | + a. encode pages (e.g. dictionary encoding) |
| 31 | + b. compress pages (e.g. Snappy) |
| 32 | + c. index pages (calculate min-max stats) |
| 33 | +3. Transfer serialized data over wire: compute node(s. ➜ storage node |
| 34 | +4. Write serialized data to disk |
| 35 | + |
| 36 | +Steps to read data from Parquet in cloud object storage: |
| 37 | + |
| 38 | +1. Read metadata |
| 39 | +2. Read serialized data from disk (use metadata for predicate/projection pushdown to only fetch pages needed for query |
| 40 | + at hand) |
| 41 | +3. Transfer serialized data over wire: storage node ➜ compute node(s) |
| 42 | +4. Deserialize to in memory format: |
| 43 | + a. decompress |
| 44 | + b. decode |
| 45 | + |
| 46 | +Some writes (UPDATE/DELETE/MERGE. require a read. In that case, both the read and write steps are executed in sequence. |
| 47 | + |
| 48 | +**It's hard to achieve low latency when using Parquet (or Parquet-based table formats. on cloud object stores because of |
| 49 | +all these steps.** |
| 50 | + |
| 51 | +## Medallian Architecture |
| 52 | + |
| 53 | +The "Medallion Architecture" is a way to organize data in a lakehouse. It's done differently everywhere, but the general |
| 54 | +idea is always the same. |
| 55 | + |
| 56 | +1. data is loaded "as-is" into the "bronze layer" (also often called "raw"). An ingestion pipeline extracts data from |
| 57 | + source systems and loads it into tables in the lake, without transformations and schema control—this is "Extract and |
| 58 | + Load" (EL) |
| 59 | +2. data then moves to the "silver layer" (also often called "refined" or "curated"). A transformation pipeline applies |
| 60 | + standard technical adjustments (e.g. column naming convention, type casting, deduplication, ...) and schema control ( |
| 61 | + enforcement or managed evolution) |
| 62 | +3. data finally arrives at the"gold layer". A transformation pipeline applies fit-for-purpose changes to prepare data |
| 63 | + for consumption (e.g. interactive analytics, reports/dashboards, machine learning)—One Big Table (OBT) or Kimball |
| 64 | + star schema data models make sense here |
| 65 | + |
| 66 | +As data moves through the layers, it changes |
| 67 | + |
| 68 | +- from dirty to clean |
| 69 | +- from normalized to denormalized |
| 70 | +- from granular to aggregated |
| 71 | +- from source-specific to domain-specific |
| 72 | + |
| 73 | +Key points |
| 74 | + |
| 75 | +- The flow is often ELtT-like: smaller transformations in silver ("t"), heavy transformations in gold ("T"). |
| 76 | +- Bronze and silver are optimized for write performance. data resembles format in sources systems, so no big (slow) |
| 77 | + transformations are needed ➜ writes are fast |
| 78 | +- Gold is optimized for read performance. big transformations (joins, aggregations, ...) are applied at write time so |
| 79 | + they don't need to be applied on-the-fly at read time (e.g. when a user runs an interactive query or when a dashboard |
| 80 | + queries the gold table) ➜ reads are fast |
| 81 | + |
| 82 | +> Medallion Architectures come in many shapes and forms. It's common to find more than three layers, or tables that do |
| 83 | +> not perfectly fit in either of them. |
0 commit comments