-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DT-32 Histogram of clicks and impressions #46
base: main
Are you sure you want to change the base?
Conversation
dashboards/clicks_and_impressions.py
Outdated
# helper to extract industry | ||
def extract_industry(params): | ||
industries = params.get("industries", []) | ||
if isinstance(industries, list) and industries: | ||
return industries[0] | ||
return None | ||
|
||
df_processed["industries"] = df_processed["parsed_params"].apply(extract_industry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand of this code, a click/impression on a mentor is considered to be of Industry X if the search query parameter includes Industry X. However, this isn't necessarily the case - in fact, the URL visit isn't sufficient and one needs to use the Elasticsearch data on individual mentors in order to obtain their industry (see #42).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grouping by industries is useful, but would require some additional code to take in Elasticsearch data and then perform a JOIN query to match the recorded document IDs with the mentor profiles.
For now, I would like to just have the following implemented:
- Filter only production events
Each tracked event has an unique ID website_event_id
with two fields, data_key
and string_value
(for now the other ..._value
fields aren't used). Each pair corresponds to a single row in the table:
Filtering by env = 'production'
is required to remove development events from the database.
-
Calculate number of clicks and impressions on a per-record basis
-
Histogram for these clicks/impressions (to see if there are particular mentors that receive a very high no. of clicks/impressions)
dashboards/clicks_and_impressions.py
Outdated
df = conn.query("select * from website_event;") | ||
|
||
df_processed = df.copy(deep=True) | ||
# | ||
df_processed["url"] = "/?" + df["url_query"].astype(str) | ||
df_processed["query_params"] = df_processed["url"].apply(extract_query_params) | ||
df_processed["parsed_params"] = df_processed["query_params"].apply(process_query_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, I updated the query to use DuckDB SQL it now will group by mentor_name as the id field will create 2 different rows for some mentors |
4f47d81
to
57e8060
Compare
Assigning to @JaCh23 for review. |
@vanesssalai Much rather we look into this deeper, we should try to work and join by ID for data integrity purposes, else we may run into issues now or later on (eg. 2 diff mentors with same name then accidentally grouped together); can revise query to use id? Also can attach some screenshots of visual outputs in this thread too thx! |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Added query for clicks and impressions for mentors (grouped by industry). currently displayed as a table with the top and bottom 5 in the page 'Clicks & Impressions'