Skip to content

Commit dce58c7

Browse files
authored
Merge pull request #45 from rehanvdm/release/v1.2.0
feat: release v1.2.0
2 parents caeaed3 + 07ef81f commit dce58c7

File tree

18 files changed

+547
-103
lines changed

18 files changed

+547
-103
lines changed

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,10 @@ app.mount('#app');
198198

199199
**SEE THE FULL COST BREAKDOWN AND SPREAD SHEET > [HERE](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/COST.md)**
200200

201+
> [!IMPORTANT]
202+
> We make calculations without considering the daily vacuum cron process which reduces the S3 files stored by magnitudes.
203+
> Real costs will be 10x to 100x lower than the worst case costs.
204+
201205
The worst case projected costs are:
202206

203207
| Views | Cost($) |
@@ -212,7 +216,7 @@ The worst case projected costs are:
212216

213217
The architecture consists of four components: frontend, backend, ingestion API and the client JS library.
214218

215-
![serverless-website-analytics.drawio.png](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs%2Fimgs%2Fserverless-website-analytics.drawio.png)
219+
![serverless-website-analytics.drawio-2023-09-10.png](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs%2Fimgs%2Fserverless-website-analytics.drawio-2023-09-10.png)
216220

217221
See the [highlights](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/CONTRIBUTING.md#highlights)
218222
and [design decisions](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/CONTRIBUTING.md#design-decisions)

docs/API.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,10 @@ app.mount('#app');
198198

199199
**SEE THE FULL COST BREAKDOWN AND SPREAD SHEET > [HERE](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/COST.md)**
200200

201+
> [!IMPORTANT]
202+
> We make calculations without considering the daily vacuum cron process which reduces the S3 files stored by magnitudes.
203+
> Real costs will be 10x to 100x lower than the worst case costs.
204+
201205
The worst case projected costs are:
202206

203207
| Views | Cost($) |
@@ -212,7 +216,7 @@ The worst case projected costs are:
212216

213217
The architecture consists of four components: frontend, backend, ingestion API and the client JS library.
214218

215-
![serverless-website-analytics.drawio.png](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs%2Fimgs%2Fserverless-website-analytics.drawio.png)
219+
![serverless-website-analytics.drawio-2023-09-10.png](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs%2Fimgs%2Fserverless-website-analytics.drawio-2023-09-10.png)
216220

217221
See the [highlights](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/CONTRIBUTING.md#highlights)
218222
and [design decisions](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/CONTRIBUTING.md#design-decisions)

docs/CONTRIBUTING.md

Lines changed: 29 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ npm run watch-local-api-ingest-watch
7676

7777
### Record storage strategy
7878
Events/logs/records are stored in S3 in a partitioned manner. The partitioning is dynamic, so all that is left is to store
79-
the data correctly and that is done by Kinesis Firehose in the format of: site, month and day. The records are buffered
79+
the data correctly and that is done by Kinesis Firehose in the format of: site and day(`2023-08-23`). The records are buffered
8080
and stored in parquet format. We are currently using an Append Only Log (AOL) pattern. This means that we are never
8181
updating the logs, we are only adding new ones.
8282

@@ -89,10 +89,6 @@ This means we store almost twice as many records as opposed to updating the reco
8989
a single record inside a parquet file is not possible, or even if it is we would have to rewrite the whole file. We also
9090
do not want to do individual PUTs for each record as that would be too expensive.
9191

92-
Technically we can do a CTAS query to remove the duplicate records, but my quick tests did not show much improvement on
93-
query speed. This is why we rather do the group by the page_id and then take the record with the largest time_on_page as
94-
the latest record for that page view.
95-
9692
The reason for writing the first record is that we can not assume the second record will reach us when the user navigates
9793
away from the page. This is because the user might close the tab or browser before the record is sent. This is mitigated
9894
by the client that uses the `navigator.sendBeacon(...)` function instead of a normal HTTP request.
@@ -102,6 +98,34 @@ POST server side), it also does not send preflight `OPTION` calls even if it is
10298
it slightly differently and the probability of it being delivered is greater than fetch with `keepalive` set.
10399
More on the [topic](https://medium.com/fiverr-engineering/benefits-of-sending-analytical-information-with-sendbeacon-a959cb206a7a).
104100

101+
### Vacuum and record format
102+
103+
As mentioned above, we are using an AOL pattern. The `page_id` is used to group these records, the first records is when
104+
the page is navigated to and the second one when they leave. The only value that differs between the two is that the last
105+
one will have a higher `time_on_page`. So when we do queries we need to group by the `page_id` and then order by the
106+
biggest `time_on_page` and select the first one.
107+
108+
Depending on the Firehose buffer time, we will have multiple records in a single parquet file. If the buffer time is set
109+
to say 1 minute and we get 1 view every 10 seconds, we will have 1440 parquet files per day. Athena reading S3 files
110+
can get expensive considering that we do about 8 queries per dashboard load. Let's also assume we look at 30 days data
111+
and that we view the dashboard 100 times that month. That means we will be scanning 1440 * 8 * 30 * 100 = 34,560,000 records.
112+
Let's use one of the cheapest region's, `us-east-1`, where 1000 S3 reads cost $0.0004. This means that we will be paying
113+
$13.82 for the S3 reads and as we know, is the biggest cost driver, see [COST](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/COST.md)
114+
for more info on cost estimations.
115+
116+
This is why we run a vacuum function that will for each site, run the group by and order function mentioned before.
117+
This results in less parquet files, it should be around 4, and it will also be magnitudes smaller.
118+
119+
The vacuum function runs at 01:00 UTC when all the previous day's data has been written to S3. It runs as a CTAS query
120+
and stores the data back into the original S3 location the Athena query used to select from. This is by design as we
121+
then delete all the S3 files before the CTAS query ran, making the process idempotent.
122+
123+
There will be a brief period when the Firehose parquet files and the CTAS parquet files are both in S3. This is fine
124+
because of how we query data, group by `page_id` and then order by `time_on_page`. There will just be slightly more
125+
records scanned before the Firehose parquet files are deleted.
126+
127+
See benchmarks on this https://github.com/rehanvdm/serverless-website-analytics/pull/43
128+
105129
### Other
106130

107131
CloudFront has multiple behaviours setup:

docs/COST.md

Lines changed: 5 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ A **worst case cost analysis** is done in this [Google Sheet](https://docs.googl
99
_To use this sheet, ONLY change the values in the yellow cells, these are the hyperparameters that you control. The rest of
1010
the values are calculated based on these hyperparameters or are constants._
1111

12+
> [!IMPORTANT]
13+
> We make calculations without considering the daily vacuum cron process which reduces the S3 files stored by magnitudes.
14+
> Real costs will be 10x to 100x lower than the worst case costs.
15+
1216
With the hyperparameters set to:
1317
- 1 site
1418
- 15 minutes of firehose buffer interval
@@ -37,20 +41,7 @@ sites this might be worth it, but on high volume sites it will be too expensive.
3741
The number of sites also influences the cost, because each site is a multiplier to the number of partitions written. In
3842
other words if we have 100 views and 1 site, then all 100 views are written into 1 partition. If we have 100 views and 2
3943
sites, assuming an equal split in views, then 50 views are written into 1 partition and 50 views are written into another,
40-
resulting in 2 partitions.
41-
42-
## Improvements
43-
44-
Currently, we are partitioning by month. This was because we were adding partitions manually and didn't want to burden the
45-
user with having to do this through the frontend on a daily basis. Since we switched to dynamic partitioning, we can change this to be
46-
daily. This will result in more partitions, but not more S3 files written because the firehose buffer interval is still
47-
less than a day. With this change we will limit the amount of data scanned by Athena significantly, which will reduce the
48-
cost. Given the assumption that on average, you do not query the full month's data, but only today or a few days, this will be a
49-
significant cost reduction.
50-
51-
We will also create janitors/background workers that run daily to optimize the partitions. This will use the CTAS query
52-
of Athena to optimize the multiple daily Firehose parquet files into just a few files. This will also see a big reduction
53-
in the amount of data scanned by Athena, which will reduce the cost.
44+
resulting in 2 partitions
5445

5546
## FAQ
5647

194 KB
Loading
-170 KB
Loading

scripts/index.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@ async function buildTsLambdas()
107107
const tsLambdaDirectories = [
108108
"api-front",
109109
"api-ingest",
110+
"cron-vacuum",
110111
];
111112

112113
for( let lambdaDir of tsLambdaDirectories)

0 commit comments

Comments
 (0)