You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/CONTRIBUTING.md
+29-5Lines changed: 29 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -76,7 +76,7 @@ npm run watch-local-api-ingest-watch
76
76
77
77
### Record storage strategy
78
78
Events/logs/records are stored in S3 in a partitioned manner. The partitioning is dynamic, so all that is left is to store
79
-
the data correctly and that is done by Kinesis Firehose in the format of: site, month and day. The records are buffered
79
+
the data correctly and that is done by Kinesis Firehose in the format of: siteand day(`2023-08-23`). The records are buffered
80
80
and stored in parquet format. We are currently using an Append Only Log (AOL) pattern. This means that we are never
81
81
updating the logs, we are only adding new ones.
82
82
@@ -89,10 +89,6 @@ This means we store almost twice as many records as opposed to updating the reco
89
89
a single record inside a parquet file is not possible, or even if it is we would have to rewrite the whole file. We also
90
90
do not want to do individual PUTs for each record as that would be too expensive.
91
91
92
-
Technically we can do a CTAS query to remove the duplicate records, but my quick tests did not show much improvement on
93
-
query speed. This is why we rather do the group by the page_id and then take the record with the largest time_on_page as
94
-
the latest record for that page view.
95
-
96
92
The reason for writing the first record is that we can not assume the second record will reach us when the user navigates
97
93
away from the page. This is because the user might close the tab or browser before the record is sent. This is mitigated
98
94
by the client that uses the `navigator.sendBeacon(...)` function instead of a normal HTTP request.
@@ -102,6 +98,34 @@ POST server side), it also does not send preflight `OPTION` calls even if it is
102
98
it slightly differently and the probability of it being delivered is greater than fetch with `keepalive` set.
103
99
More on the [topic](https://medium.com/fiverr-engineering/benefits-of-sending-analytical-information-with-sendbeacon-a959cb206a7a).
104
100
101
+
### Vacuum and record format
102
+
103
+
As mentioned above, we are using an AOL pattern. The `page_id` is used to group these records, the first records is when
104
+
the page is navigated to and the second one when they leave. The only value that differs between the two is that the last
105
+
one will have a higher `time_on_page`. So when we do queries we need to group by the `page_id` and then order by the
106
+
biggest `time_on_page` and select the first one.
107
+
108
+
Depending on the Firehose buffer time, we will have multiple records in a single parquet file. If the buffer time is set
109
+
to say 1 minute and we get 1 view every 10 seconds, we will have 1440 parquet files per day. Athena reading S3 files
110
+
can get expensive considering that we do about 8 queries per dashboard load. Let's also assume we look at 30 days data
111
+
and that we view the dashboard 100 times that month. That means we will be scanning 1440 * 8 * 30 * 100 = 34,560,000 records.
112
+
Let's use one of the cheapest region's, `us-east-1`, where 1000 S3 reads cost $0.0004. This means that we will be paying
113
+
$13.82 for the S3 reads and as we know, is the biggest cost driver, see [COST](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/COST.md)
114
+
for more info on cost estimations.
115
+
116
+
This is why we run a vacuum function that will for each site, run the group by and order function mentioned before.
117
+
This results in less parquet files, it should be around 4, and it will also be magnitudes smaller.
118
+
119
+
The vacuum function runs at 01:00 UTC when all the previous day's data has been written to S3. It runs as a CTAS query
120
+
and stores the data back into the original S3 location the Athena query used to select from. This is by design as we
121
+
then delete all the S3 files before the CTAS query ran, making the process idempotent.
122
+
123
+
There will be a brief period when the Firehose parquet files and the CTAS parquet files are both in S3. This is fine
124
+
because of how we query data, group by `page_id` and then order by `time_on_page`. There will just be slightly more
125
+
records scanned before the Firehose parquet files are deleted.
126
+
127
+
See benchmarks on this https://github.com/rehanvdm/serverless-website-analytics/pull/43
0 commit comments