Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Job Performance Metrics #3596

Open
rajadain opened this issue Feb 23, 2023 · 4 comments
Open

Add Job Performance Metrics #3596

rajadain opened this issue Feb 23, 2023 · 4 comments

Comments

@rajadain
Copy link
Member

rajadain commented Feb 23, 2023

Add a CloudWatch (or other) dashboard to monitor app-specific metrics. We run these manually from time to time, but should be available on demand to monitor ongoing health of the application.

Once such metrics are in place, we can also consider adding alarms for when they exceed some bounds.

Examples of current metrics:

  • Number of Jobs per Week:

    WITH week_list AS (
        SELECT DATE_PART('week', week_start_date) AS week
        FROM GENERATE_SERIES('2023-01-02'::DATE, '2024-01-01'::DATE, '1 week'::INTERVAL) AS week_start_date
    ), started_count AS (
        SELECT DATE_PART('week', created_at) AS week, COUNT(*) AS started_count
        FROM core_job
        WHERE status = 'started'
          AND DATE_PART('year', created_at) = '2023'
        GROUP BY week
    ), complete_count AS (
        SELECT DATE_PART('week', created_at) AS week, COUNT(*) AS complete_count
        FROM core_job
        WHERE status = 'complete'
          AND DATE_PART('year', created_at) = '2023'
        GROUP BY week
    ), failed_count AS (
        SELECT DATE_PART('week', created_at) AS week, COUNT(*) AS failed_count
        FROM core_job
        WHERE status = 'failed'
          AND DATE_PART('year', created_at) = '2023'
        GROUP BY week
    )
    
    SELECT DISTINCT wl.week, s.started_count, c.complete_count, f.failed_count
    FROM week_list wl
        LEFT OUTER JOIN started_count s ON wl.week = s.week
        LEFT OUTER JOIN complete_count c ON wl.week = c.week
        LEFT OUTER JOIN failed_count f ON wl.week = f.week
    ORDER BY wl.week;

    image

  • Job Success Rate (completed jobs / all jobs) per Week:

    WITH week_list AS (
        SELECT DATE_PART('week', week_start_date) AS week
        FROM GENERATE_SERIES('2022-01-02'::DATE, '2023-01-01'::DATE, '1 week'::INTERVAL) AS week_start_date
    ), started_count AS (
        SELECT DATE_PART('week', created_at) AS week, COUNT(*) AS started_count
        FROM core_job
        WHERE status = 'started'
          AND DATE_PART('year', created_at) = '2022'
        GROUP BY week
    ), complete_count AS (
        SELECT DATE_PART('week', created_at) AS week, COUNT(*) AS complete_count
        FROM core_job
        WHERE status = 'complete'
          AND DATE_PART('year', created_at) = '2022'
        GROUP BY week
    ), failed_count AS (
        SELECT DATE_PART('week', created_at) AS week, COUNT(*) AS failed_count
        FROM core_job
        WHERE status = 'failed'
          AND DATE_PART('year', created_at) = '2022'
        GROUP BY week
    )
    
    SELECT DISTINCT wl.week, 100.0 * COALESCE(c.complete_count, 0) / (COALESCE(s.started_count, 0) + COALESCE(c.complete_count, 0) + COALESCE(f.failed_count, 0)) AS success_rate
    FROM week_list wl
        LEFT OUTER JOIN started_count s ON wl.week = s.week
        LEFT OUTER JOIN complete_count c ON wl.week = c.week
        LEFT OUTER JOIN failed_count f ON wl.week = f.week
    ORDER BY wl.week;

    image

  • Average Job Completion Time (delivered_at - created_at, in seconds):

    SELECT DATE(created_at), EXTRACT(EPOCH FROM AVG(delivered_at - created_at)) AS avg_seconds
    FROM core_job
    WHERE created_at >= '2023-02-01'
    GROUP BY DATE(created_at)
    ORDER BY DATE(created_at);

    image

@rajadain
Copy link
Member Author

Noticing that the average job runtime is going up, seems to be reset with every production deployment:

SELECT DATE(created_at), EXTRACT(EPOCH FROM AVG(delivered_at - created_at)) AS avg_seconds
FROM core_job
WHERE created_at >= '2023-01-01'
GROUP BY DATE(created_at)
HAVING EXTRACT(EPOCH FROM AVG(delivered_at - created_at)) < 700 -- exclude anomalies
ORDER BY DATE(created_at);

image

@rajadain
Copy link
Member Author

Another version of the query above, but looking at p75 and p95 metrics instead of the average:

WITH query_durations AS (
    SELECT DATE(created_at), EXTRACT(EPOCH FROM delivered_at - created_at) AS duration
    FROM core_job
    WHERE created_at >= '2023-01-01'
)

SELECT date,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY duration) AS p75,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration) AS p95
FROM query_durations
WHERE date NOT IN (
    -- Exclude dates with exceptional values to better see trend
    '2023-01-16',
    '2023-02-16',
    '2023-02-20',
    '2023-02-22')
GROUP BY date
ORDER BY date;

image

@rajadain
Copy link
Member Author

Another version of the above

image

@rajadain
Copy link
Member Author

rajadain commented Sep 6, 2024

Latest version of the above:

image

Interesting that after the last production deployment, the line was flat for a bit before it started going up again. Now it's higher than it has been in record. Should likely do another production deploy to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant