S3 Log Parser

This project was an attempt at rigorous parsing of full information from raw S3 logs.

This remains an unsolved problem, with others having attempted in the past:

joswr1ight: s3logparse
- Unmaintained for 4 years.
- Minimally tested: https://github.com/joswr1ght/s3logparse/blob/4df6a40e11420c132420336f09ef4604c67cc171/tests/s3logparse_test.py
- Biggest weakness is it reads the entire log file at once, which is sufficient to crash most systems due to some of our files being larger than most RAM chips available.
cocoonlife: s3-log-parse (imported as s3logparse, just to add confusion with above)
- Unmaintained for 6 years.
- Used for a while in the built-in DANDI archive 'download counter'.
- Untested and unvalidated.
- Difficult to install in modern environments: cocoonlife/s3-log-parse#4
- Suffers from some of the same parsing errors we encounter on our worst URIs: cocoonlife/s3-log-parse#1
- Also reads the entire log file at once, which is sufficient to crash most systems due to some of our files being larger than most RAM chips available.

This work has transitioned to s3-log-extraction, which instead focuses on the development of efficient heuristics that extract only the minimal fields we desire for reporting summary activity with the public.

As such, this repository will be left open to allow others to request its revival by opening an issue.

Developed for the DANDI Archive.

Installation

pip install dandi_s3_log_parser

Workflow

The process is comprised of three modular steps.

1. Reduction

Filter out:

Non-success status codes.
Excluded IP addresses.
Operation types other than the one specified (REST.GET.OBJECT by default).

Then, only limit data extraction to a handful of specified fields from each full line of the raw logs; by default, object_key, timestamp, ip_address, and bytes_sent.

The process is designed to be easily parallelized and interruptible, meaning that you can feel free to kill any processes while they are running and restart later without losing most progress.

2. Binning

To make the mapping to Dandisets more efficient, the reduced logs are binned by their object keys (asset blob IDs) for fast lookup. Zarr assets specifically group by the parent blob ID, e.g., a request for zarr/abcdefg/group1/dataset1/0 will be binned by zarr/abcdefg.

This step reduces the total file sizes from step (1) even further by reducing repeated object keys, though it does create a large number of small files.

3. Mapping

The final step, which should be run periodically to keep the desired usage logs per Dandiset up to date, is to scan through all currently known Dandisets and their versions, mapping the asset blob IDs to their filenames and generating the most recently parsed usage logs that can be shared publicly.

Usage

Reduction

To reduce:

reduce_all_dandi_raw_s3_logs \
  --raw_s3_logs_folder_path < base raw S3 logs folder > \
  --reduced_s3_logs_folder_path < reduced S3 logs folder path > \
  --maximum_number_of_workers < number of workers to use > \
  --maximum_buffer_size_in_mb < approximate amount of RAM to use > \
  --excluded_ips < comma-separated list of known IPs to exclude >

Binning

To bin:

bin_all_reduced_s3_logs_by_object_key \
  --reduced_s3_logs_folder_path < reduced S3 logs folder path > \
  --binned_s3_logs_folder_path < binned S3 logs folder path >

This process is not as friendly to random interruption as the reduction step is. If corruption is detected, the target binning folder will have to be cleaned before re-attempting.

The --file_processing_limit < integer > flag can be used to limit the number of files processed in a single run, which can be useful for breaking the process up into smaller pieces, such as:

bin_all_reduced_s3_logs_by_object_key \
  --reduced_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-reduced \
  --binned_s3_logs_folder_path /mnt/backup/dandi/dandiarchive-logs-binned \

Mapping to Dandisets

Required Environment Variables

The map_binned_s3_logs_to_dandisets command requires two environment variables to be set:

IPINFO_CREDENTIALS: An access token for the ipinfo.io service

We use this service to extract general geographic region information (not exact physical addresses) for anonymized geographic statistics
We extract country/region information (e.g. "US/California"), while also specially categorizing requests from known services (GitHub, AWS, GCP, VPN).

IP_HASH_SALT: A salt value for hashing IP addresses

We use hashing to anonymize IP addresses in the logs while still allowing for unique identification
The hashed values are used as keys in our caching system to track regions without storing actual IP addresses

To set IPINFO_CREDENTIALS:

Register at ipinfo.io to get an API access token
After registration, obtain your access token from your account dashboard
Set the IPINFO_CREDENTIALS environment variable to this value

export IPINFO_CREDENTIALS="your_token_here"

To set IP_HASH_SALT:

Use the built-in get_hash_salt function (requires access to the original raw log files)

from dandi_s3_log_parser.testing._helpers import get_hash_salt

# Path to the folder containing the raw log files
raw_logs_path = "/path/to/raw/logs"
salt = get_hash_salt(base_raw_s3_log_folder_path=raw_logs_path)
print(f"Generated IP_HASH_SALT: {salt}")

Set the IP_HASH_SALT environment variable to this generated value

export IP_HASH_SALT="hash_salt_here"

To map:

map_binned_s3_logs_to_dandisets \
  --binned_s3_logs_folder_path < binned S3 logs folder path > \
  --mapped_s3_logs_folder_path < mapped Dandiset logs folder > \
  --excluded_dandisets < comma-separated list of six-digit IDs to exclude > \
  --restrict_to_dandisets < comma-separated list of six-digit IDs to restrict mapping to >

Submit line decoding errors

Please email line decoding errors collected from your local config file (located in ~/.dandi_s3_log_parser/errors) to the core maintainer before raising issues or submitting PRs contributing them as examples, to more easily correct any aspects that might require anonymization.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github		.github
src/dandi_s3_log_parser		src/dandi_s3_log_parser
test_live_services/test_mapping		test_live_services/test_mapping
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
2024_notes.md		2024_notes.md
README.md		README.md
license.txt		license.txt
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

S3 Log Parser

Installation

Workflow

1. Reduction

2. Binning

3. Mapping

Usage

Reduction

Binning

Mapping to Dandisets

Required Environment Variables

Submit line decoding errors

About

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

License

dandi/s3-log-parser

Folders and files

Latest commit

History

Repository files navigation

S3 Log Parser

Installation

Workflow

1. Reduction

2. Binning

3. Mapping

Usage

Reduction

Binning

Mapping to Dandisets

Required Environment Variables

Submit line decoding errors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 3

Uh oh!

Languages