Skip to content

Create new user_domain_memberships field in ES user index #36397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

minhaminha
Copy link
Contributor

@minhaminha minhaminha commented May 12, 2025

Technical Summary

Jira Ticket

The ES related changes/migration to enable filtering for the current domain active status.

Adds the user_domain_memberships field to act as unified field for both domain_membership and domain_memberships. This new field is exactly the same as the 2 others but adds the new is_active property in DomainMembership and is of type nested, to make filtering for specific DomainMemberships possible.

Additionally updates all filtering functions to use this new field instead of either domain_membership or domain_memberships.

The migration and the initial population command will be run on a private release after approval and before merging to avoid conflicts during the deploy process.

Safety Assurance

Safety story

Tested locally (with supporting changes that'll be found in another PR) to deactivate/reactivate and show deactivated/active users. Will test on staging after initial approval since these migrations are difficult to reverse.

Migrations

  • The migrations in this code can be safely applied first independently of the code

Rollback instructions

I'm not sure how the migration rollback will work but reverting this would require remaking the user index.

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@minhaminha minhaminha requested review from MartinRiese and Jtang-1 May 12, 2025 19:05
@dimagimon dimagimon added the reindex/migration Reindex or migration will be required during or before deploy label May 12, 2025
@minhaminha
Copy link
Contributor Author

As it is, this works fine but the filtering function is a bit inefficient since domain_memberships isn’t a nested field, so it’ll search across all domain_memberships a user might have, instead of just the current one. Reindexing to turn it into a nested type I feel like more trouble than its worth vs my current solution, which is to check in the returned list of users whether any have multiple domain_memberships and do another check specifically for the current domain’s domain_membership if so, but I was wondering if anyone else had more thoughts on this.

@minhaminha minhaminha requested review from esoergel and removed request for esoergel May 12, 2025 19:18
Copy link
Contributor

@Jtang-1 Jtang-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes make sense. I'm not sure about the performance implications that you mention in your comment, but I also think it's fine as is.

Comment on lines 60 to 61
is_active,
is_active_in_domain,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also try to combine these two filters into one, so it works regardless of user type. The location and role_id filters do similarly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm getting a bit confused with how these filters typically stack. If active and is_active_in_domain were combined similarly to how location is handled, on its own, that'll return both mobile workers and web users that are active/deactivated, right? So is it expected that the full query should always look like UserES().web_users().active(True) / UserES().mobile_users.active(True) to limit it to the doc_types we're trying to query for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think so - otherwise .active() implicitly filters to mobile workers, and is_active_in_domain likewise filters to web users.

In my opinion, UserES().active(True) should return all active users, of both types. If you want to limit to a particular user type, then you can always add a separate filter for that.

@minhaminha minhaminha added the Open for review: do not merge A work in progress label May 19, 2025
…h will eventually replace domain_membership and domain_memberships as one combined field

Also adds function to populate new combined field
…mbership/domain_memberships changes to user_domain_memberships
@minhaminha
Copy link
Contributor Author

@AmitPhulera soliciting your feedback here since I’m unsure about this approach. I assumed we don't actually want a unified domain_membership couch doc (mostly since I don't know how That migration would work, but assumed it would be a huge hassle) and opted to just copy both domain_membership and domain_memberships data into the third user_domain_memberships (feel free to suggest a better name!). Since this is a transitionary PR (and because I’d like to get this out before going offline next week) I thought I'd focus solely on getting this third field populated and usable but I feel like I'm missing something here, even though this seems to get the results I want. Is there anything else you'd suggest I put into this initial PR?

@minhaminha minhaminha changed the title Add is_active field to DomainMemberships (web users) Create new user_domain_memberships field in ES user index May 20, 2025
Comment on lines 239 to 240
if not domain:
return filters.term("is_active", active)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall, how does WebUser.is_active fit in here?

)


def is_active(active=True):
return filters.term("is_active", active)
def is_active(active=True, domain=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to use the is_active filter without passing in a domain? I worry that making optional makes it too easy to forget to support web users

domain_membership_filter
)
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#36397 (comment)

Reviving this discussion - I was imagining this filter being something that works for all users, like:

  • UserES().is_active() returns all active users
  • UserES().is_active().mobile_users() returns all active mobile workers
  • UserES().is_active().web_users() returns all active mobile workers

Instead, it behaves like:

  • UserES().is_active() returns all active mobile workers
  • UserES().is_active(domain=domain) returns all active web users
  • UserES().OR(is_active(), is_active(domain=domain)) returns both

This behavior seems unintuitive to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I've been trying to think of a better solution to this for a while too. I'm going to address your other comments about the need for the domain arg and the how WebUser.is_active fits into all of this here too.

So because of how web user domain active status works, we must have a domain arg to filter for domain-scoped deactivated/active web users (the whole reason we wanted to create this new nested field). This unfortunately does mean that not passing in a domain implies we only really care about getting mobile workers, and to a lesser extent web users (for their SSO active status), but technically that is included here. I just pushed a slight refactor so now the behavior looks like:

  • UserES().is_active() returns all user.is_active=True users. This will usually be filtered for mobile users (all usage of this filter previously already did)
  • UserES().is_active(domain=domain) returns all user.is_active=True and user.domain_membership/s.is_active=True users on the domain. This will usually be filtered for web users (which I'll do in the other PR).

For active=True filters, it must return users that have their user.is_active status and user.domain_membership/s.is_active equal to True because if either is False, it's deactivated in one way or another (by an IDP or by the domain). This is a bit confusing to put down in words honestly, I'd be up to chat briefly if this still looks concerning/confusing in any way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unfortunately does mean that not passing in a domain implies we only really care about getting mobile workers

This is why I think we should not allow callers to exclude domain. If it's always required, then there's no ambiguity. It doesn't look like there are too many references to update. You can define the function like this to make domain a required keyword-only argument that comes after an optional positional argument:

def is_active(active=True, *, domain):

Or of course you could make domain the first arg:

def is_active(domain, active=True):

Otherwise I just think it's too easy for someone without the context to accidentally make a change that breaks a feature for projects like ours that use web users for data entry. It is suprising that adding .is_active() to a query excludes all web users - I wouldn't expect people to anticipate that or notice it in code review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

domain is now a required parameter in is_active but the commit I pushed has a bunch of comments (for Martin) about places where I'm unsure what is being returned is appropriate for the context. I think there's going to have to be an additional exclude-or-include domain deactivated users parameter to either include domain deactivated when active is True (users that are domain deactivated but still active as users) or exclude them when active is False (same type of users), since we're still following the philosophy that a domain deactivated user is still a valid user in the domain. I can't tell without knowing exactly what the returned users will be used for unfortunately so i left a few guesses. @MartinRiese

@minhaminha minhaminha mentioned this pull request May 22, 2025
3 tasks
Copy link
Contributor

@Jtang-1 Jtang-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good to me, but going to hold off approval until Ethan has a chance to respond about the filtering.

return memberships


def populate_user_domain_memberships():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just going to be ran once in a private release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

twice - before deploy, to get the bulk of the data populated and after deploy, to fill in gaps (user changes/new users) that might've occurred between the first run and the code is deployed.

Copy link
Contributor

@esoergel esoergel May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll also need to be included in a Django migration to account for third party environments where we can't run this code manually. That migration needs to happen in a subsequent deploy if we're to worry about gaps. However we'd probably want to add @once_off_migration or similar to this code now so that the later migration noops on prod.

An alternative approach is to move the query changes to a subsequent deploy and follow the approach described here: https://commcare-hq.readthedocs.io/migrations_in_practice.html#multiple-deploys
(I think I already linked to that doc, but I can't find where I did, so apologies if I'm repeating myself)

filters.AND(
filters.term('user_domain_memberships.domain.exact', domain),
domain_membership_filter
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure if this is true but instead of doing and_or would this be equivalent?

filter = filters.AND(
        user_active_filter,
        queries.nested(
            'user_domain_memberships',
            filters.AND(
                filters.term('user_domain_memberships.domain.exact', domain),
                filters.NOT(filters.term("user_domain_memberships.is_active", False))
            )
 if active:
   return filter
else:
   return filters.NOT(filter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not a 100% sure in my reading of this modified logic but doing a filters.NOT on an AND I think means it'll only filter for things where both user_active_filter and the nested filter don't match, which is not the intention. If I'm filtering for "inactive" users I want to return users that either have their is_active as False or their domain_membership.is_active as False.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, that's faintly how i remember it working too

Copy link
Contributor

@esoergel esoergel May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Jonathan's read is correct - it should be a complete disjunction between the two clauses. I also find the and_or logic to be hard to think about. You could also do it with a recursive call if you like the way that reads:

def is_active(active=True, *, domain):
    if not active:
        return filters.NOT(is_active(domain=domain))
    return filters.AND(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha this made it click in my branch - changed to follow Jonathan's original suggestion.

@dimagimon dimagimon added the Risk: High Change affects files that have been flagged as high risk. label May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Open for review: do not merge A work in progress reindex/migration Reindex or migration will be required during or before deploy Risk: High Change affects files that have been flagged as high risk.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants