Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DT-32 Histogram of clicks and impressions #46

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

vanesssalai
Copy link

Added query for clicks and impressions for mentors (grouped by industry). currently displayed as a table with the top and bottom 5 in the page 'Clicks & Impressions'

Copy link

linear bot commented Dec 24, 2024

@wei2912 wei2912 linked an issue Dec 24, 2024 that may be closed by this pull request
@wei2912 wei2912 self-requested a review December 24, 2024 09:09
@wei2912 wei2912 assigned wei2912 and vanesssalai and unassigned wei2912 Dec 24, 2024
Comment on lines 18 to 25
# helper to extract industry
def extract_industry(params):
industries = params.get("industries", [])
if isinstance(industries, list) and industries:
return industries[0]
return None

df_processed["industries"] = df_processed["parsed_params"].apply(extract_industry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand of this code, a click/impression on a mentor is considered to be of Industry X if the search query parameter includes Industry X. However, this isn't necessarily the case - in fact, the URL visit isn't sufficient and one needs to use the Elasticsearch data on individual mentors in order to obtain their industry (see #42).

Copy link
Member

@wei2912 wei2912 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping by industries is useful, but would require some additional code to take in Elasticsearch data and then perform a JOIN query to match the recorded document IDs with the mentor profiles.

For now, I would like to just have the following implemented:

  1. Filter only production events

Each tracked event has an unique ID website_event_id with two fields, data_key and string_value (for now the other ..._value fields aren't used). Each pair corresponds to a single row in the table:

image

Filtering by env = 'production' is required to remove development events from the database.

  1. Calculate number of clicks and impressions on a per-record basis

  2. Histogram for these clicks/impressions (to see if there are particular mentors that receive a very high no. of clicks/impressions)

Comment on lines 10 to 16
df = conn.query("select * from website_event;")

df_processed = df.copy(deep=True)
#
df_processed["url"] = "/?" + df["url_query"].astype(str)
df_processed["query_params"] = df_processed["url"].apply(extract_query_params)
df_processed["parsed_params"] = df_processed["query_params"].apply(process_query_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 9ef75bd has been merged, please use DuckDB SQL queries instead to retrieve the initial dataframe. Some sample code is available at #48.

@vanesssalai
Copy link
Author

As mentioned above, I updated the query to use DuckDB SQL it now will group by mentor_name as the id field will create 2 different rows for some mentors

@wei2912 wei2912 requested a review from JaCh23 January 4, 2025 13:40
@wei2912
Copy link
Member

wei2912 commented Jan 4, 2025

Assigning to @JaCh23 for review.

@JaCh23
Copy link

JaCh23 commented Jan 7, 2025

As mentioned above, I updated the query to use DuckDB SQL it now will group by mentor_name as the id field will create 2 different rows for some mentors

@vanesssalai Much rather we look into this deeper, we should try to work and join by ID for data integrity purposes, else we may run into issues now or later on (eg. 2 diff mentors with same name then accidentally grouped together); can revise query to use id?

Also can attach some screenshots of visual outputs in this thread too thx!

@vanesssalai
Copy link
Author

Updated the sql query to group by mentor data. Here are the screenshots
Screenshot 2025-01-13 201443
Screenshot 2025-01-13 201451

@JaCh23
Copy link

JaCh23 commented Jan 27, 2025

LGTM

Copy link

@JaCh23 JaCh23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Histogram of clicks and impressions
3 participants