Skip to content

feat: Enhancements to ClickHouse property_definitions ingestion#32086

Merged
tkaemming merged 44 commits intoproperty-definitions-table-and-jobfrom
property-definitions-tidy
May 20, 2025
Merged

feat: Enhancements to ClickHouse property_definitions ingestion#32086
tkaemming merged 44 commits intoproperty-definitions-table-and-jobfrom
property-definitions-tidy

Conversation

@tkaemming
Copy link
Contributor

Problem

Extends #30807 with fixes, improvements, and adds some missing functionality.

Changes

  • Cleans up job configuration so that it only is defined in run configuration once, consolidates methods of specifying time range bounds, uses ClickHouse functions for more flexible format parsing, changes from fully open time interval to half-open
  • Switches queries to ClickhouseCluster as it provides default retry policy, consistent configuration, and simplifies test integration
  • Aligns property type detection and filtering with existing implementation and adds unit tests
  • Adds group property type ingestion
  • Adds end to end test for job, updates migrations to support test setup and teardown

Does this work well for both Cloud and self-hosted?

Yes, currently off any critical paths

How did you test this code?

Added them

JSONType(value) IN ('Int64', 'UInt64', 'Double'), 'Numeric',
NULL
)),
JSONExtractKeysAndValuesRaw({self.source_column})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to preserve this behavior or not

if updates.len() > skip_threshold {
warn!(
"Event {} for team {} has more than {} properties, skipping",
event, team_id, skip_threshold
);
metrics::counter!(EVENTS_SKIPPED, &[("reason", "too_many_properties")]).increment(1);
return vec![];
}

event,
(arrayJoin({DetectPropertyTypeExpression('properties')}) as property).1 as name,
property.2 as property_type,
replaceAll(event, '\\0', '\ufffd') as event, -- https://github.com/PostHog/posthog/blob/052f4ea40c5043909115f835f09445e18dd9727c/rust/property-defs-rs/src/types.rs#L172
Copy link
Contributor Author

@tkaemming tkaemming May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this matters for ClickHouse but I kept it for compatibility purposes, since I'd expect we backfill this table with the existing rows to bootstrap it

I'm not sure if this replacement is consistent with other parts of the application, though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question - just since I've been at PH we've seen folks supply crafted and corrupt keys and values in events, at times causing incidents for the team. We definitely see fresh attempts at this sort of thing since landing defensive changes.

I've seen this one (strip null bytes from event name) in several post-capture services so I suspect it's legit. Just my two cents, but unless there are concerns about perf on the CH side, I'd be inclined to leave this in

Comment on lines 118 to 141
insert_query = f"""
INSERT INTO property_definitions
SELECT
team_id,
team_id as project_id,
(arrayJoin(JSONExtractKeysAndValuesRaw(properties)) as x).1 as name,
map(
34, 'String',
98, 'Boolean',
100, 'Numeric',
105, 'Numeric',
117, 'Numeric',
0, NULL
)[JSONType(x.2)] as property_type,
event,
(arrayJoin({DetectPropertyTypeExpression('properties')}) as property).1 as name,
property.2 as property_type,
replaceAll(event, '\\0', '\ufffd') as event, -- https://github.com/PostHog/posthog/blob/052f4ea40c5043909115f835f09445e18dd9727c/rust/property-defs-rs/src/types.rs#L172
NULL as group_type_index,
1 as type,
{int(PropertyDefinition.Type.EVENT)} as type,
-- NOTE: not floored, need to check if needed https://github.com/PostHog/posthog/blob/052f4ea40c5043909115f835f09445e18dd9727c/rust/property-defs-rs/src/types.rs#L175
max(timestamp) as last_seen_at
FROM events_recent
WHERE {time_filter}
WHERE
{time_range.get_expression("timestamp")}
-- https://github.com/PostHog/posthog/blob/052f4ea40c5043909115f835f09445e18dd9727c/rust/property-defs-rs/src/types.rs#L13-L14C52
AND event NOT IN ('$$plugin_metrics')
-- https://github.com/PostHog/posthog/blob/052f4ea40c5043909115f835f09445e18dd9727c/rust/property-defs-rs/src/types.rs#L187-L191
AND length(event) <= 200
AND {COMMON_PROPERTY_FILTERS}
GROUP BY team_id, event, name, property_type
ORDER BY team_id, event, name, property_type NULLS LAST
LIMIT 1 by team_id, event, name
SETTINGS max_execution_time = {config.max_execution_time}
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and other similar SQL queries below) will likely be refactored out so they can be reused on the read path, if not replaced my materialized views entirely

Comment on lines -311 to -349
# Add a config for running a backfill for a specific day
def run_backfill_for_day(date_str: str):
"""
Helper function to create a run config for a backfill for a specific day.

Args:
date_str: Date string in YYYY-MM-DD format

Returns:
Dagster run config for the backfill
"""
start_date = f"{date_str} 00:00:00"
end_date = f"{date_str} 23:59:59"

return {
"ops": {
"ingest_event_properties": {"config": {"start_date": start_date, "end_date": end_date}},
"ingest_person_properties": {"config": {"start_date": start_date, "end_date": end_date}},
}
}


# Helper to run backfill for a specific hour
def run_backfill_for_hour(hour_str: str):
"""
Helper function to create a run config for a backfill for a specific hour.

Args:
hour_str: ISO format datetime string for the hour (e.g., "2023-05-15T14:00:00+00:00")

Returns:
Dagster run config for the backfill
"""
return {
"ops": {
"ingest_event_properties": {"config": {"target_hour": hour_str}},
"ingest_person_properties": {"config": {"target_hour": hour_str}},
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these as the configuration is now a bit less repetitive and it's a bit easier to specify the time range as well

expected: list[tuple[str, str | None]]


def test_detect_property_type_expression(cluster: ClickhouseCluster) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably be good to refactor this into a parameterized test instead of one big mega test but that will require a bit of faffing with fixture setup

Comment on lines +39 to +42
PropertyTypeTestData(
json.dumps({"timestamp": time.time()}),
[("timestamp", "DateTime")], # XXX: can't be parsed back out correctly!
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is consistent with the existing implementation as I'd expect a float can't be converted to an unsigned it - but I could be wrong about this

if let Some(value) = n.as_u64() {
// we could go more conservative here, but you get the idea
let threshold: u64 = (Utc::now().timestamp_millis() as u64 / 1000u64) - SIX_MONTHS_AGO_SECS;
if value >= threshold {
return true;
}

It does seem a little strange to not be able to accept timestamps with subsecond precision, but ClickHouse also rejects these values during date parsing (as used in query-time casting back to the configured property types) so changing this behavior would probably actually need to happen in plugin-server, not here - or the read side behavior would need to be more sophisticated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +54 to +57
PropertyTypeTestData(
'{"a": "true\\n"}',
[("a", "String")], # XXX: inconsistent with existing implementation
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let s = &s.trim();

ClickHouse trimBoth isn't substitutable with Rust's trim https://posthog.slack.com/archives/C08KJMUG1HP/p1745539308195499

Not sure if this is significant enough to be special cased with a slower regex replace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as we're ignoring/removing whitespace from left and right sides as part of the parsing process, we should be gtg here 👍

Comment on lines +58 to +79
PropertyTypeTestData( # weird formats clickhouse allows
'{"a": "23:21:04", "b": "September", "c": "2024", "d": "1"}',
[("a", "String"), ("b", "String"), ("c", "String"), ("d", "String")],
),
PropertyTypeTestData('{"a": "2025-04-24"}', [("a", "DateTime")]),
PropertyTypeTestData('{"a": "04/24/2025"}', [("a", "DateTime")]),
PropertyTypeTestData(
'{"a": "2025-04-24 23:21:04"}', # ISO-8601, space separator, no tz
[("a", "DateTime")],
),
PropertyTypeTestData(
'{"a": "2025-04-24T23:21:04+0000"}', # ISO-8601, t separator, with tz
[("a", "DateTime")],
),
PropertyTypeTestData(
'{"a": "2025-04-24T23:21:04+00:00"}', # RFC 3339
[("a", "DateTime")],
),
PropertyTypeTestData(
'{"a": "Thu, 24 Apr 2025 23:21:04 +0000"}', # RFC 2822
[("a", "DateTime")],
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally consistent with

static DATETIME_PREFIX_REGEX: LazyLock<Regex> = LazyLock::new(|| {
regex::Regex::new(
r#"^(([0-9]{4}[/-][0-2][0-9][/-][0-3][0-9])|([0-2][0-9][/-][0-3][0-9][/-][0-9]{4}))([ T][0-2][0-9]:[0-6][0-9]:[0-6][0-9].*)?$"#
).unwrap()
});
but in some cases a little stricter, in others a little looser

But whatever is parsed into a DateTime here can actually be read as a DateTime by ClickHouse which wasn't necessarily true previously

@tkaemming tkaemming marked this pull request as ready for review May 8, 2025 22:02
@tkaemming tkaemming requested a review from a team as a code owner May 8, 2025 22:02
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Here's my concise review of the recent changes to the property definitions ingestion system:

This PR enhances the ClickHouse property definitions system with improved type detection, configuration management, and comprehensive test coverage.

  • Adds robust property type detection in dags/property_definitions.py aligned with existing Rust implementation, handling edge cases for timestamps, booleans, and various date formats
  • Introduces TimeRange class for consistent time window handling with half-open intervals and flexible format parsing
  • Implements group property ingestion support with proper type handling and deduplication
  • Adds extensive test suite in test_property_definitions.py covering edge cases like null bytes, property name length limits, and special event types
  • Improves code organization by moving SQL definitions from migration to model file for better maintainability

5 file(s) reviewed, 2 comment(s)
Edit PR Review Bot Settings | Greptile

tkaemming and others added 2 commits May 8, 2025 15:05
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@posthog-bot

This comment was marked as outdated.

Copy link
Contributor

@eli-r-ph eli-r-ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice updates! May want another pair of eyes just to verify any in-the-weeds CH details I'm not fully read into yet, but this LGTM 👍

@tkaemming tkaemming merged commit a8074e6 into property-definitions-table-and-job May 20, 2025
93 of 96 checks passed
@tkaemming tkaemming deleted the property-definitions-tidy branch May 20, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants