Skip to content

Conversation

@NiWaRe
Copy link
Collaborator

@NiWaRe NiWaRe commented Nov 25, 2025

Summary

Implements analytics tracking for the MCP server to measure:

  1. Unique users - Track organizational adoption
  2. MCP tool call distribution - Understand feature usage
  3. Weekly active users (WAU) & retention - Measure engagement

Architecture: Cloud Run → Cloud Logging → BigQuery → Hex

Changes

New Module: analytics.py

  • Structured JSON logging for Cloud Logging integration
  • Tracks 3 event types: user_session, tool_call, request
  • Automatic parameter sanitization (API keys/tokens redacted)
  • Graceful failure (wrapped in try/except, never breaks server)
  • Can be disabled via MCP_ANALYTICS_DISABLED=true

Auth Middleware: auth.py

  • Tracks user sessions in mcp_auth_middleware()
  • Captures: session_id, user_id, email_domain, API key hash (16 chars)
  • Does NOT log: Full API keys, full email addresses

Tool Call Tracking: tools_utils.py

  • Enhanced log_tool_call() with analytics tracking
  • Captures: tool_name, session_id, user_id, success, duration_ms
  • Parameters sanitized automatically

Security & Privacy

What is Logged

  • ✅ Email domains (e.g., "anthropic.com") - NOT full emails
  • ✅ W&B usernames (already public)
  • ✅ Session IDs (generated hashes)
  • ✅ Tool names and sanitized parameters
  • ✅ API key hash prefix (16 chars, for debugging)

What is NOT Logged

  • ❌ Full email addresses
  • ❌ API keys or Bearer tokens
  • ❌ Passwords
  • ❌ Large data payloads (truncated at 200 chars)
  • ❌ Sensitive parameters (redacted)

Security Features

  • Parameter sanitization in track_tool_call()
  • Graceful degradation (never fails requests)
  • Opt-out via environment variable
  • No direct PII collection

Setup Instructions

1. Create BigQuery Dataset & Log Sink

# Create dataset
bq --project_id=wandb-mcp-production mk \
  --dataset --location=us-central1 mcp_analytics

# Create log sink
gcloud logging sinks create mcp-analytics-sink \
  bigquery.googleapis.com/projects/wandb-mcp-production/datasets/mcp_analytics \
  --log-filter='
    resource.type="cloud_run_revision"
    resource.labels.service_name="wandb-mcp-server"
    jsonPayload.json_fields.event_type=~"user_session|tool_call|request"
  ' \
  --project=wandb-mcp-production

# Grant permissions
SERVICE_ACCOUNT=$(gcloud logging sinks describe mcp-analytics-sink \
  --project=wandb-mcp-production --format='value(writerIdentity)')
gcloud projects add-iam-policy-binding wandb-mcp-production \
  --member="$SERVICE_ACCOUNT" --role="roles/bigquery.dataEditor"

2. Create Analytics View

After logs start flowing (5-10 minutes), create a view:

CREATE OR REPLACE VIEW \`wandb-mcp-production.mcp_analytics.analytics_events\` AS
SELECT
  timestamp,
  jsonPayload.json_fields.event_type as event_type,
  jsonPayload.json_fields.session_id as session_id,
  jsonPayload.json_fields.user_id as user_id,
  jsonPayload.json_fields.email_domain as email_domain,
  jsonPayload.json_fields.tool_name as tool_name,
  jsonPayload.json_fields.success as success,
  jsonPayload.json_fields.duration_ms as duration_ms
FROM \`wandb-mcp-production.mcp_analytics.cloudrun_googleapis_com_stdout_*\`
WHERE jsonPayload.json_fields.event_type IS NOT NULL;

3. Query Metrics

Unique users by domain:

SELECT email_domain, COUNT(DISTINCT user_id) as unique_users
FROM \`wandb-mcp-production.mcp_analytics.analytics_events\`
WHERE event_type = 'user_session' AND email_domain IS NOT NULL
GROUP BY email_domain ORDER BY unique_users DESC;

Tool call distribution:

SELECT tool_name, COUNT(*) as call_count, COUNT(DISTINCT user_id) as unique_users
FROM \`wandb-mcp-production.mcp_analytics.analytics_events\`
WHERE event_type = 'tool_call' GROUP BY tool_name ORDER BY call_count DESC;

Weekly active users:

SELECT DATE_TRUNC(DATE(timestamp), WEEK) as week, COUNT(DISTINCT user_id) as wau
FROM \`wandb-mcp-production.mcp_analytics.analytics_events\`
WHERE event_type IN ('user_session', 'tool_call')
GROUP BY week ORDER BY week DESC;

Testing

  1. Deploy to Cloud Run with this change
  2. Make some API calls to the MCP server
  3. Check Cloud Logging for analytics events:
    ```bash
    gcloud logging read 'jsonPayload.json_fields.event_type' --limit 10
    ```
  4. Wait 5-10 minutes, then check BigQuery:
    ```bash
    bq query "SELECT * FROM `wandb-mcp-production.mcp_analytics.analytics_events` LIMIT 10"
    ```

Review Checklist

  • Security Team: Data collection and privacy implications reviewed
  • Data Science Team: Metrics meet requirements
  • Engineering Team: Code quality and error handling reviewed
  • Legal Team: GDPR/privacy compliance confirmed (if needed)

Questions for Reviewers

  1. Is logging email domains (not full emails) acceptable?
  2. Should we add opt-in consent mechanism for users?
  3. Is 90-day data retention appropriate? (needs configuration)
  4. Are these metrics sufficient or should we track additional events?

Cost Estimate

  • Storage: ~30MB/month = $0.0006/month (negligible)
  • Queries: <$5/month (first 1TB free)

Rollback Plan

If needed, disable analytics without redeployment:
```bash
gcloud run services update wandb-mcp-server --region us-central1
--set-env-vars MCP_ANALYTICS_DISABLED=true
```

Track user sessions, tool calls, and retention via Cloud Logging → BigQuery.

Changes:
- New analytics.py module with structured JSON logging
- Auth middleware tracks user sessions (email domain, user_id)
- Tool call wrapper tracks usage patterns
- Sensitive data sanitized (API keys, tokens redacted)
- Can be disabled via MCP_ANALYTICS_DISABLED=true
@NiWaRe NiWaRe marked this pull request as draft November 25, 2025 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants