|
| 1 | +# Cost |
| 2 | + |
| 3 | +## Current WORST case projected cost |
| 4 | + |
| 5 | +A **worst case cost analysis** is done in this [Google Sheet](https://docs.google.com/spreadsheets/d/1-UFCtBt8HJ0CY7iVGGKzJ6jKGM9GugE6sAfwspobOI8/edit#gid=0) |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +_To use this sheet, ONLY change the values in the yellow cells, these are the hyperparameters that you control. The rest of |
| 10 | +the values are calculated based on these hyperparameters or are constants._ |
| 11 | + |
| 12 | +With the hyperparameters set to: |
| 13 | +- 1 site |
| 14 | +- 15 minutes of firehose buffer interval |
| 15 | +- 200 dashboard views per month |
| 16 | + |
| 17 | +The worst case projected costs are: |
| 18 | + |
| 19 | +| Views | Cost($) | |
| 20 | +|-------------|----------| |
| 21 | +| 10,000 | 2.01 | |
| 22 | +| 100,000 | 3.24 | |
| 23 | +| 1,000,000 | 14.64 | |
| 24 | +| 10,000,000 | 128.74 | |
| 25 | +| 100,000,000 | 1,288.39 | |
| 26 | + |
| 27 | +## Cost breakdown |
| 28 | + |
| 29 | +The majority of the costs come from the Athena S3 reads. We are optimizing for this by using Kinesis Firehose and setting |
| 30 | +the buffer as long as possible (15 minutes). The downside of this is that the data is not available in near real-time, but |
| 31 | +delayed by 15 ±1 minutes. This is a trade-off that we are willing to make, as you do not really need this information in |
| 32 | +real-time. |
| 33 | + |
| 34 | +You can adjust the firehose buffer interval to be shorter, up to 1 minute, but this will increase the cost. On low volume |
| 35 | +sites this might be worth it, but on high volume sites it will be too expensive. |
| 36 | + |
| 37 | +The number of sites also influences the cost, because each site is a multiplier to the number of partitions written. In |
| 38 | +other words if we have 100 views and 1 site, then all 100 views are written into 1 partition. If we have 100 views and 2 |
| 39 | +sites, assuming an equal split in views, then 50 views are written into 1 partition and 50 views are written into another, |
| 40 | +resulting in 2 partitions. |
| 41 | + |
| 42 | +## Improvements |
| 43 | + |
| 44 | +Currently, we are partitioning by month. This was because we were adding partitions manually and didn't want to burden the |
| 45 | +user with having to do this through the frontend on a daily basis. Since we switched to dynamic partitioning, we can change this to be |
| 46 | +daily. This will result in more partitions, but not more S3 files written because the firehose buffer interval is still |
| 47 | +less than a day. With this change we will limit the amount of data scanned by Athena significantly, which will reduce the |
| 48 | +cost. Given the assumption that on average, you do not query the full month's data, but only today or a few days, this will be a |
| 49 | +significant cost reduction. |
| 50 | + |
| 51 | +We will also create janitors/background workers that run daily to optimize the partitions. This will use the CTAS query |
| 52 | +of Athena to optimize the multiple daily Firehose parquet files into just a few files. This will also see a big reduction |
| 53 | +in the amount of data scanned by Athena, which will reduce the cost. |
| 54 | + |
| 55 | +## FAQ |
| 56 | + |
| 57 | +### Why use the Lambda Function URL and Cloudfront? |
| 58 | + |
| 59 | +CloudFront + the Function URL is the cheapest method of exposing your Lambda function to the internet, while giving some form of |
| 60 | +control. By control, we mean that you give it a custom domain name and add WAF to it for security purposes, should we |
| 61 | +ever want to protect against DDOS attacks. We would have reverse proxied through CloudFront anyway to prevent CORS, so |
| 62 | +adding two API gateways would have been more expensive. |
| 63 | + |
| 64 | +Let's consider the cost for ingestion of 1 million views per month: |
| 65 | +- REST API GW: $3.50 |
| 66 | +- HTTP API GW: $1.00 |
| 67 | +- Cloudfront: $1.00 to $1.20 |
| 68 | + |
| 69 | +The pricing of HTTP API GW and CloudFront requests are actually very similar. In hindsight, we could have used HTTP API GW |
| 70 | +but because we are already using CloudFront for all our domains, it just makes things easier. |
| 71 | + |
| 72 | +### Why not use SQS or EventBridge? |
| 73 | + |
| 74 | +Having a `ingest => buffer => process => store` pipeline is nice, but it just adds extra costs. Instead, we made the |
| 75 | +assumption that if you exceed the rate limit, that something is anomalous and that instead of trying to ingest the data |
| 76 | +we just drop it (lambda concurrency). Website analytics is not mission-critical data, your wallet is. When the rate |
| 77 | +limit is exceeded, you will get an alarm, and you can choose to increase the rate limit if it is legitimate traffic. |
| 78 | + |
| 79 | +Also, SQS adds $0.40 per million and Event Bridge adds $1.00 per million. |
| 80 | + |
| 81 | +### Why not use DynamoDB? |
| 82 | + |
| 83 | +DynamoDB (DDB) is the only other serverless database that we could potentially use to store the data and keep true to |
| 84 | +our objectives. However, DDB is not a good fit for this use case because it is not optimized for analytics. It would be |
| 85 | +much more expensive to do lookups on DDB than S3. |
0 commit comments