Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -159,4 +159,4 @@ DEPENDENCIES
webrick

BUNDLED WITH
2.6.5
2.7.2
17 changes: 0 additions & 17 deletions _data/fellowships/classifying-user-contributed-images.yaml

This file was deleted.

27 changes: 0 additions & 27 deletions _data/fellowships/expanding-pelican-with-globus.yaml

This file was deleted.

21 changes: 0 additions & 21 deletions _data/fellowships/high-throughput-inference.yaml

This file was deleted.

15 changes: 0 additions & 15 deletions _data/fellowships/measuring-throughput-in-chtc.yaml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
title: Classifying User Contributed Images
type: Facilitation
type: Research Facilitation
summary: |
CHTC’s High Throughput Computing (HTC) system supports hundreds of users and thousands of jobs each day. It is optimized for workloads or sets of jobs, where many jobs can run in parallel as computational capacity becomes available. This project aims to better understand the impact of workload size and requirements on overall throughput through empirical measurement of workloads in CHTC. A key component of the project will be developing tools to a) submit sample workloads and b) gather metrics about their performance. Once these tools are developed, they can be used to run experiments with different workload types.

Expand All @@ -14,6 +14,4 @@ summary: |

- Familiarity with unix and Python
- Familiarity with git
-Familiarity with HTCondor]

sort: 0
- Familiarity with HTCondor
15 changes: 0 additions & 15 deletions _data/fellowships/monitoring-chtc.yaml

This file was deleted.

24 changes: 0 additions & 24 deletions _data/fellowships/pelican-cache-monitoring.yaml

This file was deleted.

21 changes: 0 additions & 21 deletions _data/fellowships/pelican-client-request-tracking.yaml

This file was deleted.

29 changes: 29 additions & 0 deletions _data/fellowships/pelican-request-lifecycle.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
title: Description Distributed Tracing and Log Aggregation for Pelican Request Lifecycle
type: Software Development
sort: 0
summary: |
In a distributed system with multiple services communicating with one another, a key challenge is correlating logging information from different services that handle a single job or client request. This project aims to design and implement a method for aggregating all logs generated during a client request by introducing a unique identifier that acts as a foreign key to link every log entry together. This focused approach will ensure administrators can precisely trace the path of a request through the system, identifying the services involved and pinpointing the exact location of errors or performance-related events recorded in the logs.

The primary objective of this project is to implement a system for auto-aggregation and tracing of request data across [Pelican’s](https://pelicanplatform.org/) distributed architecture. The goal is to move beyond siloed log files to ensure a complete picture of job execution is available for administrators. The core solution involves determining how to aggregate the logging data as well as creating a unique identifier that is generated and propagated throughout the system, acting as a foreign key to link every log entry together. The fellow will be responsible for defining the tracing methodology, augmenting the request ID throughout the application layers, and making critical adjustments in the Pelican code. The fellow will develop client tooling to utilize this trace ID for diagnostics and will learn to inject diagnostic information back into the result ad for retrospective analysis via HTCondor.

Questions the fellow will have to answer in the course of the project: How do we define the foreign key when one pelican command could translate to multiple transfers or jobs? How best can we aggregate the logs into a searchable system? How can the system handle the continuously growing size of the logs?

By the end of the fellowship, the fellow will acquire a comprehensive understanding of distributed data systems and gain hands-on experience designing and implementing a tracing system for log correlation. They will be responsible for defining the auto-aggregation and tracing methodology using this unique identifier, and for augmenting the request ID through all layers of the Pelican code. This work will include adjusting selective places in the Pelican code and developing client tooling to utilize the trace ID. Additionally, the fellow will solidify their practical skills in Python and Go programming.

#### Project Objectives:

The project's specific objectives are broken down to reflect both the high-level design and the necessary low-level implementation:

- Implement UUID-based Tracing: Establish the methodology for UUID generation/propagation and use it as a foreign key for log correlation across all services.
- Augment Service Logs: Adjust selective places in the Pelican code to ensure the UUID is consistently captured.
- Develop Client Tooling: Create tools that run on the client or service hosts to leverage the UUID for direct log retrieval and diagnostics.
- System Integration: Create a system for client-side request tracking that leverages the aggregated data.

#### Prerequisite skills or education that would be good for the Fellow to have to work on the project:

- Python and Golang (required)
- Linux/CLI (required)
- HTTP development (preferred)
- Distributed Computing (preferred)
- Git/GitHub/GitHub Actions (preferred)
- Docker/Kubernetes (preferred)
Loading
Loading