feat: Enhancements to ClickHouse property_definitions ingestion#32086
feat: Enhancements to ClickHouse property_definitions ingestion#32086tkaemming merged 44 commits intoproperty-definitions-table-and-jobfrom
Conversation
dags/property_definitions.py
Outdated
| JSONType(value) IN ('Int64', 'UInt64', 'Double'), 'Numeric', | ||
| NULL | ||
| )), | ||
| JSONExtractKeysAndValuesRaw({self.source_column}) |
There was a problem hiding this comment.
Not sure if we want to preserve this behavior or not
posthog/rust/property-defs-rs/src/types.rs
Lines 197 to 204 in 052f4ea
| event, | ||
| (arrayJoin({DetectPropertyTypeExpression('properties')}) as property).1 as name, | ||
| property.2 as property_type, | ||
| replaceAll(event, '\\0', '\ufffd') as event, -- https://github.com/PostHog/posthog/blob/052f4ea40c5043909115f835f09445e18dd9727c/rust/property-defs-rs/src/types.rs#L172 |
There was a problem hiding this comment.
I don't think this matters for ClickHouse but I kept it for compatibility purposes, since I'd expect we backfill this table with the existing rows to bootstrap it
I'm not sure if this replacement is consistent with other parts of the application, though
There was a problem hiding this comment.
good question - just since I've been at PH we've seen folks supply crafted and corrupt keys and values in events, at times causing incidents for the team. We definitely see fresh attempts at this sort of thing since landing defensive changes.
I've seen this one (strip null bytes from event name) in several post-capture services so I suspect it's legit. Just my two cents, but unless there are concerns about perf on the CH side, I'd be inclined to leave this in
| insert_query = f""" | ||
| INSERT INTO property_definitions | ||
| SELECT | ||
| team_id, | ||
| team_id as project_id, | ||
| (arrayJoin(JSONExtractKeysAndValuesRaw(properties)) as x).1 as name, | ||
| map( | ||
| 34, 'String', | ||
| 98, 'Boolean', | ||
| 100, 'Numeric', | ||
| 105, 'Numeric', | ||
| 117, 'Numeric', | ||
| 0, NULL | ||
| )[JSONType(x.2)] as property_type, | ||
| event, | ||
| (arrayJoin({DetectPropertyTypeExpression('properties')}) as property).1 as name, | ||
| property.2 as property_type, | ||
| replaceAll(event, '\\0', '\ufffd') as event, -- https://github.com/PostHog/posthog/blob/052f4ea40c5043909115f835f09445e18dd9727c/rust/property-defs-rs/src/types.rs#L172 | ||
| NULL as group_type_index, | ||
| 1 as type, | ||
| {int(PropertyDefinition.Type.EVENT)} as type, | ||
| -- NOTE: not floored, need to check if needed https://github.com/PostHog/posthog/blob/052f4ea40c5043909115f835f09445e18dd9727c/rust/property-defs-rs/src/types.rs#L175 | ||
| max(timestamp) as last_seen_at | ||
| FROM events_recent | ||
| WHERE {time_filter} | ||
| WHERE | ||
| {time_range.get_expression("timestamp")} | ||
| -- https://github.com/PostHog/posthog/blob/052f4ea40c5043909115f835f09445e18dd9727c/rust/property-defs-rs/src/types.rs#L13-L14C52 | ||
| AND event NOT IN ('$$plugin_metrics') | ||
| -- https://github.com/PostHog/posthog/blob/052f4ea40c5043909115f835f09445e18dd9727c/rust/property-defs-rs/src/types.rs#L187-L191 | ||
| AND length(event) <= 200 | ||
| AND {COMMON_PROPERTY_FILTERS} | ||
| GROUP BY team_id, event, name, property_type | ||
| ORDER BY team_id, event, name, property_type NULLS LAST | ||
| LIMIT 1 by team_id, event, name | ||
| SETTINGS max_execution_time = {config.max_execution_time} | ||
| """ |
There was a problem hiding this comment.
This (and other similar SQL queries below) will likely be refactored out so they can be reused on the read path, if not replaced my materialized views entirely
| # Add a config for running a backfill for a specific day | ||
| def run_backfill_for_day(date_str: str): | ||
| """ | ||
| Helper function to create a run config for a backfill for a specific day. | ||
|
|
||
| Args: | ||
| date_str: Date string in YYYY-MM-DD format | ||
|
|
||
| Returns: | ||
| Dagster run config for the backfill | ||
| """ | ||
| start_date = f"{date_str} 00:00:00" | ||
| end_date = f"{date_str} 23:59:59" | ||
|
|
||
| return { | ||
| "ops": { | ||
| "ingest_event_properties": {"config": {"start_date": start_date, "end_date": end_date}}, | ||
| "ingest_person_properties": {"config": {"start_date": start_date, "end_date": end_date}}, | ||
| } | ||
| } | ||
|
|
||
|
|
||
| # Helper to run backfill for a specific hour | ||
| def run_backfill_for_hour(hour_str: str): | ||
| """ | ||
| Helper function to create a run config for a backfill for a specific hour. | ||
|
|
||
| Args: | ||
| hour_str: ISO format datetime string for the hour (e.g., "2023-05-15T14:00:00+00:00") | ||
|
|
||
| Returns: | ||
| Dagster run config for the backfill | ||
| """ | ||
| return { | ||
| "ops": { | ||
| "ingest_event_properties": {"config": {"target_hour": hour_str}}, | ||
| "ingest_person_properties": {"config": {"target_hour": hour_str}}, | ||
| } | ||
| } |
There was a problem hiding this comment.
Removed these as the configuration is now a bit less repetitive and it's a bit easier to specify the time range as well
| expected: list[tuple[str, str | None]] | ||
|
|
||
|
|
||
| def test_detect_property_type_expression(cluster: ClickhouseCluster) -> None: |
There was a problem hiding this comment.
Would probably be good to refactor this into a parameterized test instead of one big mega test but that will require a bit of faffing with fixture setup
| PropertyTypeTestData( | ||
| json.dumps({"timestamp": time.time()}), | ||
| [("timestamp", "DateTime")], # XXX: can't be parsed back out correctly! | ||
| ), |
There was a problem hiding this comment.
I think this is consistent with the existing implementation as I'd expect a float can't be converted to an unsigned it - but I could be wrong about this
posthog/rust/property-defs-rs/src/types.rs
Lines 404 to 409 in 052f4ea
It does seem a little strange to not be able to accept timestamps with subsecond precision, but ClickHouse also rejects these values during date parsing (as used in query-time casting back to the configured property types) so changing this behavior would probably actually need to happen in plugin-server, not here - or the read side behavior would need to be more sophisticated
There was a problem hiding this comment.
Good callout! I think you're all clear wrt accidental float conversion in this case 👍
| PropertyTypeTestData( | ||
| '{"a": "true\\n"}', | ||
| [("a", "String")], # XXX: inconsistent with existing implementation | ||
| ), |
There was a problem hiding this comment.
posthog/rust/property-defs-rs/src/types.rs
Line 356 in 052f4ea
ClickHouse trimBoth isn't substitutable with Rust's trim https://posthog.slack.com/archives/C08KJMUG1HP/p1745539308195499
Not sure if this is significant enough to be special cased with a slower regex replace
There was a problem hiding this comment.
I think as long as we're ignoring/removing whitespace from left and right sides as part of the parsing process, we should be gtg here 👍
| PropertyTypeTestData( # weird formats clickhouse allows | ||
| '{"a": "23:21:04", "b": "September", "c": "2024", "d": "1"}', | ||
| [("a", "String"), ("b", "String"), ("c", "String"), ("d", "String")], | ||
| ), | ||
| PropertyTypeTestData('{"a": "2025-04-24"}', [("a", "DateTime")]), | ||
| PropertyTypeTestData('{"a": "04/24/2025"}', [("a", "DateTime")]), | ||
| PropertyTypeTestData( | ||
| '{"a": "2025-04-24 23:21:04"}', # ISO-8601, space separator, no tz | ||
| [("a", "DateTime")], | ||
| ), | ||
| PropertyTypeTestData( | ||
| '{"a": "2025-04-24T23:21:04+0000"}', # ISO-8601, t separator, with tz | ||
| [("a", "DateTime")], | ||
| ), | ||
| PropertyTypeTestData( | ||
| '{"a": "2025-04-24T23:21:04+00:00"}', # RFC 3339 | ||
| [("a", "DateTime")], | ||
| ), | ||
| PropertyTypeTestData( | ||
| '{"a": "Thu, 24 Apr 2025 23:21:04 +0000"}', # RFC 2822 | ||
| [("a", "DateTime")], | ||
| ), |
There was a problem hiding this comment.
This is generally consistent with
posthog/rust/property-defs-rs/src/types.rs
Lines 47 to 51 in 052f4ea
But whatever is parsed into a DateTime here can actually be read as a DateTime by ClickHouse which wasn't necessarily true previously
There was a problem hiding this comment.
PR Summary
Here's my concise review of the recent changes to the property definitions ingestion system:
This PR enhances the ClickHouse property definitions system with improved type detection, configuration management, and comprehensive test coverage.
- Adds robust property type detection in
dags/property_definitions.pyaligned with existing Rust implementation, handling edge cases for timestamps, booleans, and various date formats - Introduces
TimeRangeclass for consistent time window handling with half-open intervals and flexible format parsing - Implements group property ingestion support with proper type handling and deduplication
- Adds extensive test suite in
test_property_definitions.pycovering edge cases like null bytes, property name length limits, and special event types - Improves code organization by moving SQL definitions from migration to model file for better maintainability
5 file(s) reviewed, 2 comment(s)
Edit PR Review Bot Settings | Greptile
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
This comment was marked as outdated.
This comment was marked as outdated.
eli-r-ph
left a comment
There was a problem hiding this comment.
Nice updates! May want another pair of eyes just to verify any in-the-weeds CH details I'm not fully read into yet, but this LGTM 👍
a8074e6
into
property-definitions-table-and-job
Problem
Extends #30807 with fixes, improvements, and adds some missing functionality.
Changes
ClickhouseClusteras it provides default retry policy, consistent configuration, and simplifies test integrationDoes this work well for both Cloud and self-hosted?
Yes, currently off any critical paths
How did you test this code?
Added them