rehanvdm
diff --git a/‎README.md‎
Lines changed: 5 additions & 1 deletion b/‎README.md‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎docs/API.md‎
Lines changed: 5 additions & 1 deletion b/‎docs/API.md‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎docs/CONTRIBUTING.md‎
Lines changed: 29 additions & 5 deletions b/‎docs/CONTRIBUTING.md‎
Lines changed: 29 additions & 5 deletions
diff --git a/‎docs/COST.md‎
Lines changed: 5 additions & 14 deletions b/‎docs/COST.md‎
Lines changed: 5 additions & 14 deletions
diff --git a/‎docs/imgs/serverless-website-analytics.drawio-2023-09-10.png‎
194 KB b/‎docs/imgs/serverless-website-analytics.drawio-2023-09-10.png‎
194 KB
diff --git a/‎docs/imgs/serverless-website-analytics.drawio.png‎
-170 KB b/‎docs/imgs/serverless-website-analytics.drawio.png‎
-170 KB
diff --git a/‎scripts/index.ts‎
Lines changed: 1 addition & 0 deletions b/‎scripts/index.ts‎
Lines changed: 1 addition & 0 deletions
@@ -198,6 +198,10 @@ app.mount('#app');
 
 **SEE THE FULL COST BREAKDOWN AND SPREAD SHEET > [HERE](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/COST.md)**
 
+> [!IMPORTANT]
+> We make calculations without considering the daily vacuum cron process which reduces the S3 files stored by magnitudes.
+> Real costs will be 10x to 100x lower than the worst case costs.
+
 The worst case projected costs are:
 
 | Views       | Cost($) |
@@ -212,7 +216,7 @@ The worst case projected costs are:
 
 The architecture consists of four components: frontend, backend, ingestion API and the client JS library.
 
-![serverless-website-analytics.drawio.png](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs%2Fimgs%2Fserverless-website-analytics.drawio.png)
+![serverless-website-analytics.drawio-2023-09-10.png](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs%2Fimgs%2Fserverless-website-analytics.drawio-2023-09-10.png)
 
 See the [highlights](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/CONTRIBUTING.md#highlights)
 and [design decisions](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/CONTRIBUTING.md#design-decisions)
 
@@ -198,6 +198,10 @@ app.mount('#app');
 
 **SEE THE FULL COST BREAKDOWN AND SPREAD SHEET > [HERE](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/COST.md)**
 
+> [!IMPORTANT]
+> We make calculations without considering the daily vacuum cron process which reduces the S3 files stored by magnitudes.
+> Real costs will be 10x to 100x lower than the worst case costs.
+
 The worst case projected costs are:
 
 | Views       | Cost($) |
@@ -212,7 +216,7 @@ The worst case projected costs are:
 
 The architecture consists of four components: frontend, backend, ingestion API and the client JS library.
 
-![serverless-website-analytics.drawio.png](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs%2Fimgs%2Fserverless-website-analytics.drawio.png)
+![serverless-website-analytics.drawio-2023-09-10.png](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs%2Fimgs%2Fserverless-website-analytics.drawio-2023-09-10.png)
 
 See the [highlights](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/CONTRIBUTING.md#highlights)
 and [design decisions](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/CONTRIBUTING.md#design-decisions)
 
@@ -76,7 +76,7 @@ npm run watch-local-api-ingest-watch
 
 ### Record storage strategy
 Events/logs/records are stored in S3 in a partitioned manner. The partitioning is dynamic, so all that is left is to store
-the data correctly and that is done by Kinesis Firehose in the format of: site, month and day. The records are buffered
+the data correctly and that is done by Kinesis Firehose in the format of: site and day(`2023-08-23`). The records are buffered
 and stored in parquet format. We are currently using an Append Only Log (AOL) pattern. This means that we are never
 updating the logs, we are only adding new ones.
 
@@ -89,10 +89,6 @@ This means we store almost twice as many records as opposed to updating the reco
 a single record inside a parquet file is not possible, or even if it is we would have to rewrite the whole file. We also
 do not want to do individual PUTs for each record as that would be too expensive.
 
-Technically we can do a CTAS query to remove the duplicate records, but my quick tests did not show much improvement on
-query speed. This is why we rather do the group by the page_id and then take the record with the largest time_on_page as
-the latest record for that page view.
-
 The reason for writing the first record is that we can not assume the second record will reach us when the user navigates
 away from the page. This is because the user might close the tab or browser before the record is sent. This is mitigated
 by the client that uses the `navigator.sendBeacon(...)` function instead of a normal HTTP request.
@@ -102,6 +98,34 @@ POST server side), it also does not send preflight `OPTION` calls even if it is
 it slightly differently and the probability of it being delivered is greater than fetch with `keepalive` set.
 More on the [topic](https://medium.com/fiverr-engineering/benefits-of-sending-analytical-information-with-sendbeacon-a959cb206a7a).
 
+### Vacuum and record format
+
+As mentioned above, we are using an AOL pattern. The `page_id` is used to group these records, the first records is when
+the page is navigated to and the second one when they leave. The only value that differs between the two is that the last
+one will have a higher `time_on_page`. So when we do queries we need to group by the `page_id` and then order by the
+biggest `time_on_page` and select the first one.
+
+Depending on the Firehose buffer time, we will have multiple records in a single parquet file. If the buffer time is set
+to say 1 minute and we get 1 view every 10 seconds, we will have 1440 parquet files per day. Athena reading S3 files
+can get expensive considering that we do about 8 queries per dashboard load. Let's also assume we look at 30 days data
+and that we view the dashboard 100 times that month. That means we will be scanning 1440 * 8 * 30 * 100 = 34,560,000 records.
+Let's use one of the cheapest region's, `us-east-1`, where 1000 S3 reads cost $0.0004. This means that we will be paying
+$13.82 for the S3 reads and as we know, is the biggest cost driver, see [COST](https://github.com/rehanvdm/serverless-website-analytics/blob/main/docs/COST.md)
+for more info on cost estimations.
+
+This is why we run a vacuum function that will for each site, run the group by and order function mentioned before.
+This results in less parquet files, it should be around 4, and it will also be magnitudes smaller.
+
+The vacuum function runs at 01:00 UTC when all the previous day's data has been written to S3. It runs as a CTAS query
+and stores the data back into the original S3 location the Athena query used to select from. This is by design as we
+then delete all the S3 files before the CTAS query ran, making the process idempotent.
+
+There will be a brief period when the Firehose parquet files and the CTAS parquet files are both in S3. This is fine
+because of how we query data, group by `page_id` and then order by `time_on_page`. There will just be slightly more
+records scanned before the Firehose parquet files are deleted.
+
+See benchmarks on this https://github.com/rehanvdm/serverless-website-analytics/pull/43
+
 ### Other
 
 CloudFront has multiple behaviours setup:
 
@@ -9,6 +9,10 @@ A **worst case cost analysis** is done in this [Google Sheet](https://docs.googl
 _To use this sheet, ONLY change the values in the yellow cells, these are the hyperparameters that you control. The rest of
 the values are calculated based on these hyperparameters or are constants._
 
+> [!IMPORTANT]
+> We make calculations without considering the daily vacuum cron process which reduces the S3 files stored by magnitudes.
+> Real costs will be 10x to 100x lower than the worst case costs.
+
 With the hyperparameters set to:
 - 1 site
 - 15 minutes of firehose buffer interval
@@ -37,20 +41,7 @@ sites this might be worth it, but on high volume sites it will be too expensive.
 The number of sites also influences the cost, because each site is a multiplier to the number of partitions written. In
 other words if we have 100 views and 1 site, then all 100 views are written into 1 partition. If we have 100 views and 2
 sites, assuming an equal split in views, then 50 views are written into 1 partition and 50 views are written into another,
-resulting in 2 partitions.
-
-## Improvements
-
-Currently, we are partitioning by month. This was because we were adding partitions manually and didn't want to burden the
-user with having to do this through the frontend on a daily basis. Since we switched to dynamic partitioning, we can change this to be
-daily. This will result in more partitions, but not more S3 files written because the firehose buffer interval is still
-less than a day. With this change we will limit the amount of data scanned by Athena significantly, which will reduce the
-cost. Given the assumption that on average, you do not query the full month's data, but only today or a few days, this will be a
-significant cost reduction.
-
-We will also create janitors/background workers that run daily to optimize the partitions. This will use the CTAS query
-of Athena to optimize the multiple daily Firehose parquet files into just a few files. This will also see a big reduction
-in the amount of data scanned by Athena, which will reduce the cost.
+resulting in 2 partitions
 
 ## FAQ
 
 
@@ -107,6 +107,7 @@ async function buildTsLambdas()
   const tsLambdaDirectories = [
     "api-front",
     "api-ingest",
+    "cron-vacuum",
   ];
 
   for( let lambdaDir of tsLambdaDirectories)