Skip to content

Conversation

@hors
Copy link
Collaborator

@hors hors commented Sep 18, 2025

K8SPXC-1621 Powered by Pull Request Badge

CHANGE DESCRIPTION

Implement Fluent-bit buffer settings and smart configuration management

This commit introduces comprehensive Fluent-bit configuration management with
intelligent buffer settings, deterministic configuration generation, and robust
pod restart handling for LogCollector configuration changes.

Problems Fixed

Fluent-bit Buffer Issues

  • Fixed "file requires a larger buffer size, lines are too long. Skipping file" errors
  • Resolved buffer_max_size must be >= buffer_chunk validation errors
  • Addressed insufficient default buffer sizes (32K) for long log lines
  • Fixed Fluent-bit failing to process large log entries

Configuration Management Issues

  • Fixed hardcoded [INPUT] sections in Go code making configuration inflexible
  • Addressed need for users to define multiple [Inputs] in cr.yaml for Fluent-bit options
  • Resolved lack of configurable buffer settings in LogCollectorSpec

Pod Restart Issues

  • Fixed LogCollector container not restarting when buffer settings changed
  • Addressed need for pod restarts when LogCollector configuration is modified

New Features

Fluent-bit Buffer Settings

  • Add configurable buffer settings (BufferChunkSize, BufferMaxSize, MemBufLimit)
  • Version-based defaults: 128k/512k/20MB for CR >= 1.19.0, 64k/256k/10MB for older versions
  • Automatic validation to ensure BufferMaxSize >= BufferChunkSize
  • Buffer settings applied only to tail input plugins

Smart Configuration Management

  • Template-based configuration using embedded fluentbit_template.conf
  • Intelligent section merging that handles duplicate INPUT/OUTPUT sections
  • Custom configuration takes precedence over template settings
  • Hybrid approach: template + custom config + buffer settings
  • Support for custom [INPUT], [OUTPUT], and other Fluent-bit sections

Deterministic Configuration Generation

  • Fixed non-deterministic hash calculation causing restart loops
  • Sorted map keys in hash functions for consistent output
  • Content-based ConfigMap optimization prevents unnecessary updates
  • Deterministic section ordering in merged configurations

Pod Restart Logic

  • LogCollector configuration changes trigger PXC pod restarts
  • ConfigMap hash included in StatefulSet configuration hash
  • Proper debounce mechanism prevents restart loops
  • Early return optimization for non-LogCollector changes

Technical Improvements

Configuration Processing

  • Smart merging by section identifiers (Path for INPUT, Name+Match for OUTPUT)
  • Indentation normalization for custom configurations
  • Environment variable handling (POD_NAMESPACE vs POD_NAMESPASE based on CR version)
  • Fallback to minimal configuration when template loading fails

Performance Optimizations

  • Content-based ConfigMap update prevention
  • Deterministic hash calculation eliminates unnecessary StatefulSet updates
  • Efficient section parsing and merging algorithms
  • Reduced reconciliation overhead

Error Handling

  • Graceful fallback when template cannot be loaded
  • Proper error handling for hash calculation failures
  • Validation for buffer size relationships
  • Comprehensive error wrapping and logging

New Template File

pkg/controller/pxc/fluentbit_template.conf

  • Embedded Fluent-bit configuration template based on Percona Docker patterns
  • Pre-configured OUTPUT section with stdout format for log forwarding
  • Uses environment variables (POD_NAMESPACE, POD_NAME, LOG_DATA_DIR) for dynamic configuration
  • Template serves as base configuration that gets enhanced with buffer settings and custom configs

Testing

Added comprehensive unit tests covering:

  • Deterministic configuration generation
  • Smart section merging with duplicate handling
  • Hash calculation determinism
  • ConfigMap content optimization
  • Complex custom configuration scenarios

Breaking Changes

None. All changes are backward compatible with existing configurations.

Migration

Existing clusters will automatically get the new buffer settings based on their
CR version. Custom configurations will be merged intelligently with the new
template-based approach.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PXC version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/XXL 1000+ lines label Sep 18, 2025
@JNKPercona
Copy link
Collaborator

Test Name Result Time
affinity-8-0 passed 00:00:00
auto-tuning-8-0 passed 00:00:00
cross-site-8-0 passed 00:00:00
custom-users-8-0 passed 00:10:20
demand-backup-cloud-8-0 passed 00:00:00
demand-backup-encrypted-with-tls-8-0 passed 00:41:58
demand-backup-8-0 passed 00:00:00
demand-backup-flow-control-8-0 passed 00:00:00
demand-backup-parallel-8-0 passed 00:00:00
demand-backup-without-passwords-8-0 passed 00:00:00
haproxy-5-7 passed 00:00:00
haproxy-8-0 passed 00:00:00
init-deploy-5-7 failure 00:08:06
init-deploy-8-0 failure 00:07:55
limits-8-0 passed 00:00:00
monitoring-2-0-8-0 passed 00:00:00
monitoring-pmm3-8-0 passed 00:00:00
one-pod-5-7 passed 00:00:00
one-pod-8-0 passed 00:00:00
pitr-8-0 passed 00:00:00
pitr-gap-errors-8-0 passed 00:00:00
proxy-protocol-8-0 passed 00:00:00
proxysql-sidecar-res-limits-8-0 passed 00:00:00
pvc-resize-5-7 passed 00:00:00
pvc-resize-8-0 passed 00:00:00
recreate-8-0 failure 00:15:36
restore-to-encrypted-cluster-8-0 passed 00:00:00
scaling-proxysql-8-0 passed 00:00:00
scaling-8-0 passed 00:00:00
scheduled-backup-5-7 passed 00:00:00
scheduled-backup-8-0 passed 00:00:00
security-context-8-0 passed 00:00:00
smart-update1-8-0 passed 00:00:00
smart-update2-8-0 passed 00:00:00
storage-8-0 passed 00:00:00
tls-issue-cert-manager-ref-8-0 passed 00:00:00
tls-issue-cert-manager-8-0 passed 00:00:00
tls-issue-self-8-0 passed 00:00:00
upgrade-consistency-8-0 failure 00:16:08
upgrade-haproxy-5-7 failure 00:15:15
upgrade-haproxy-8-0 failure 00:13:55
upgrade-proxysql-5-7 failure 00:13:30
upgrade-proxysql-8-0 failure 00:08:57
users-5-7 failure 00:18:19
users-8-0 passed 00:24:57
validation-hook-8-0 passed 00:00:00
We run 46 out of 46 03:15:00

commit: 68dd39b
image: perconalab/percona-xtradb-cluster-operator:PR-2192-68dd39b0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL 1000+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants