Skip to content

Enterprise Form Submissions Iterators #35295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Jan 14, 2025
Merged

Conversation

mjriley
Copy link
Contributor

@mjriley mjriley commented Oct 29, 2024

Product Description

Technical Summary

This PR is now ready for review. This PR creates iterators to handle enterprise report requests in a scalable manner. Previously, both our enterprise reports and the enterprise report APIs needed to generate the entire enterprise report in order to deliver any results. With this PR, the form submissions report API has been modified to instead source that information from an iterator that will only fetch data up until a page boundary. If more data than a page boundary is needed, the request will be paginated.

Feature Flag

No feature flag.

Safety Assurance

Safety story

I've done local testing, created multiple test suites, and verified this PR on staging.

Automated test coverage

New test suites created in test_iterators.py and test_apis.py

QA Plan

As there is no user-facing component to this other than the API results, I don't think this needs to be run through QA -- the same process I'm using would be duplicated by QA.

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@mjriley mjriley requested a review from jingcheng16 October 29, 2024 16:17
@dimagimon dimagimon added the Risk: High Change affects files that have been flagged as high risk. label Oct 29, 2024

# if a limit exists, increase it by 1 to allow us to check whether additional items remain at the end
padded_limit = limit + 1 if limit else None
self.original_it = iter(sequence_factory_fn(padded_limit))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass padded_limit to sequence_factory_fn? Based on the test cases, sequence_factory_fn ignores the parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we find it is being used in create_multi_domain_form_generator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're asking why we need a limit, the issue is that Tastypie needs some way to communicate the 'limit' it receives with the underlying limit used by Elasticsearch. If we can't communicate this to elasticsearch, it leads to situations like the API asking for 100 records and Elasticsearch fetching 5000, leading to 4900 unused records. Or the reverse, where the API asks for 5000 records,and Elasticsearch fetches them 100 at a time, leading to 50 calls to Elasticsearch rather than 1.

Ideally, we'd just pass the limit all the way down, but we don't have control over how the Paginator is instantiated or how it is called -- we can tell Tastypie to use a certain Paginator class, but it controls the instatiation and the call. What we can control is the 'objects' object that gets passed to the paginator. That object can't know about the limit yet, because the paginator is responsible for setting that limit. So the object needs a way to receive the limit prior to retrieving the results.

The important part here is that the iterator needs to be able to receive a limit parameter, even if it does nothing with it. In the case of the tests, they ignore that limit, but if the underlying sequence was an elasticsearch query, it would need access to that limit

Copy link
Contributor

@jingcheng16 jingcheng16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't go through all the commits due to the time


# if a limit exists, increase it by 1 to allow us to check whether additional items remain at the end
padded_limit = limit + 1 if limit else None
self.original_it = iter(sequence_factory_fn(padded_limit))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we find it is being used in create_multi_domain_form_generator

objects = SequenceWrapper(range(5), lambda ele: {'next': ele})
paginator = KeysetPaginator(request_data, objects, resource_uri='http://test.com/')
page = paginator.page()
self.assertEqual(page['meta']['next'], 'http://test.com/?limit=3&next=2')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused why the next uri have limit=3 if we will delete limit key in request_data

class KeysetPaginator(Paginator):
'''
An alternate paginator meant to support paginating by keyset rather than by index/offset.
`objects` is expected to represent a query object that exposes an `.execute(limit)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not having a base QueryObject class, and having a function execute that will raise NotImplemented error. Then have a docstring to explain what do you expect from the execute function or even give an example. Anyone who wants to use this KeysetPaginator should pass a object inherits from the base QueryObject


num_fetched = len(results.hits)

if num_fetched >= results.total or (remaining and num_fetched >= remaining):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is num_fetched different from results.total?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this is how elasticsearch works" -- Matt cc @gherceg

limit=limit
)

xform_converter = RawFormConverter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize this in __init__ and make it something that can be passed into this query and can be swapped out for something else depending on use case

return start_date, end_date


class RawFormConverter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to something like EnterpriseFormReportConverter or something better? more specific name re: usecase


def _create_enterprise_account_covering_domains(self, domains):
billing_account = generator.billing_account('[email protected]', '[email protected]')
billing_account.enterprise_admin_emails = ['[email protected]']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a customer billing account because it is mapped to multiple domains

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in bff5fac

def _create_enterprise_admin(self, email, domain):
user = WebUser.create(
domain, email, 'test123', None, None, email)
user.is_superuser = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be avoided in tests. please add to enterprise_admin_emails in related BillingAccount

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved as part of bff5fac

Comment on lines 83 to 89
role = Role.objects.create(slug="test_role")
UserRole.objects.create(user=user.get_django_user(), role=role)
accounting_admin_role = Role.objects.get_or_create(
name="Accounting Admin",
slug=privileges.ACCOUNTING_ADMIN,
)[0]
Grant.objects.create(from_role=role, to_role=accounting_admin_role)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't needed when the user is actually an enterprise admin and not a superuser

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed as part of bff5fac

encoded_auth = base64.b64encode(auth_string.encode()).decode()
request = factory.get(
'/',
{'startdate': '2004-10-10', 'enddate': '2004-11-10'},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will push to a different level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included as part of bff5fac


@es_test(requires=[form_adapter])
class FormSubmissionResourceTests(TestCase):
def test_happy_path(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to test_resource_is_accessible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly additional test to ensure permissions restrict users that need to be restricted?

return self.domain_lookup_tables[domain].get(app_id, None)


def loop_over_domains(domains, query_factory, limit=None, last_domain=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this run_query_over_domains? and maybe switch the order of query_factory and domains

current_iterator = _get_domain_iterator(**next_args)


def loop_over_domain(domain, query_factory, limit=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, loop_query_over_domain and switch order of args

@mjriley mjriley changed the title Enterprise Iterators Draft -- Early Feedback Enterprise Form Submissions Iterators Nov 12, 2024
@mjriley mjriley marked this pull request as ready for review November 12, 2024 14:30
@mjriley mjriley mentioned this pull request Nov 12, 2024
4 tasks
the previous progress arguments
- start_date: a date to start the date range. Can be None
- end_date: the inclusive date to finish the date range. Can be None
last_domain, last_time, and last_id are intended to represent the last result from a previous query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does sphinx support back-ticks (`) for denoting variables?

}

@classmethod
def get_query_paraams(cls, fetched_object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_query_params instead of paraaaaaaaams

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests were passing with this incorrect name. good to verify if tests are covering this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified that the reason this was not flagged on tests was because get_query_params is only called for queries that extend beyond page boundaries. The test in test_apis is intended to just be a happy path test, and so doesn't try to verify more than the base case -- here, it constructs just 2 forms. A test does exist to ensure that the paginator handles page boundaries correctly, but its generic (in keyset_paginator_tests). I feel like this is mostly a typo error on my part, and we probably don't need multiple integration tests to verify each functionality of the APIs. If I were to add a test for form submission page boundaries, that would mean I'd want to do the same for every new API added.

return start_date, end_date


class EnterpriseFormReportConverter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly use ABC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in e55664e

@classmethod
def get_kwargs_from_map(cls, map):
'''
Takes a map-like object from a continuation request (generally GET/POST) and extracts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to include a note here about where it's used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the suggestion, but I feel that pointing to exactly where a class is used is brittle, as that could change. I think its appropriate to describe what a class does and then let the user choose to use it however they want in-line with that. Hopefully the EnterpriseDataConverter's docstring for get_kwargs_from_map addresses this

next_query_args = query_factory.get_next_query_args(next_query_args, last_hit)


class ReportQueryFactoryInterface:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe good to use ABC here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in e55664e


if limit and has_more:
last_fetched = objects[-1]
next_page_params = self.objects.get_query_params(last_fetched)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if there is a missing test for this? didn't fail for misnamed method name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in another comment, the test is generic

}


class PageableQueryInterface:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was unused. Removed

request_data,
objects,
resource_uri=resource_uri,
limit=limit,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially rename limit to indicate page size as it is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in eb67336

from corehq.apps.api.keyset_paginator import KeysetPaginator


class SequenceQuery:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what model is this supposed to emulate in the real code? can you add a comment there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KeysetPaginator contains a docstring indicating:

objects is expected to represent a query object that exposes an .execute(limit)
method that returns an iterable, and a get_query_params(object) method to retrieve the parameters
for the next query

SequenceQuery is a a simple version of that interface where it wraps a basic sequence. The closest class it represents is currently IterableEnterpriseFormQuery within the iterators module, but I don't want to make that link explicit. The tests were written against the docstring. If you provide an objects parameter that contains an execute method that returns an iterable, KeysetPaginator should work correctly. I'm having a hard time coming up with a more succinct term for that interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on this further, would it help to make IterableEnterpriseFormQuery a child of some abstract base class, so KeysetPaginator's objects needs to inherit from that specific base class rather than matching through duck-typing?

My impression is that Python prefers to work with duck-typing rather than explicit class inheritance, but I can switch if that would make it easier to understand this code

@mjriley mjriley merged commit 6070069 into master Jan 14, 2025
13 checks passed
@mjriley mjriley deleted the mjr/enterprise_iterators_draft branch January 14, 2025 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Risk: High Change affects files that have been flagged as high risk.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants