Enterprise Form Submissions Iterators #35295

mjriley · 2024-10-29T16:17:51Z

Product Description

Technical Summary

This PR is now ready for review. This PR creates iterators to handle enterprise report requests in a scalable manner. Previously, both our enterprise reports and the enterprise report APIs needed to generate the entire enterprise report in order to deliver any results. With this PR, the form submissions report API has been modified to instead source that information from an iterator that will only fetch data up until a page boundary. If more data than a page boundary is needed, the request will be paginated.

Feature Flag

No feature flag.

Safety Assurance

Safety story

I've done local testing, created multiple test suites, and verified this PR on staging.

Automated test coverage

New test suites created in test_iterators.py and test_apis.py

QA Plan

As there is no user-facing component to this other than the API results, I don't think this needs to be run through QA -- the same process I'm using would be duplicated by QA.

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

jingcheng16 · 2024-10-29T17:26:40Z

corehq/apps/enterprise/resumable_iterator_wrapper.py

+
+        # if a limit exists, increase it by 1 to allow us to check whether additional items remain at the end
+        padded_limit = limit + 1 if limit else None
+        self.original_it = iter(sequence_factory_fn(padded_limit))


Why pass padded_limit to sequence_factory_fn? Based on the test cases, sequence_factory_fn ignores the parameter.

Oh we find it is being used in create_multi_domain_form_generator

If you're asking why we need a limit, the issue is that Tastypie needs some way to communicate the 'limit' it receives with the underlying limit used by Elasticsearch. If we can't communicate this to elasticsearch, it leads to situations like the API asking for 100 records and Elasticsearch fetching 5000, leading to 4900 unused records. Or the reverse, where the API asks for 5000 records,and Elasticsearch fetches them 100 at a time, leading to 50 calls to Elasticsearch rather than 1.

Ideally, we'd just pass the limit all the way down, but we don't have control over how the Paginator is instantiated or how it is called -- we can tell Tastypie to use a certain Paginator class, but it controls the instatiation and the call. What we can control is the 'objects' object that gets passed to the paginator. That object can't know about the limit yet, because the paginator is responsible for setting that limit. So the object needs a way to receive the limit prior to retrieving the results.

The important part here is that the iterator needs to be able to receive a limit parameter, even if it does nothing with it. In the case of the tests, they ignore that limit, but if the underlying sequence was an elasticsearch query, it would need access to that limit

jingcheng16

Didn't go through all the commits due to the time

jingcheng16 · 2024-10-29T17:33:03Z

corehq/apps/enterprise/resumable_iterator_wrapper.py

+
+        # if a limit exists, increase it by 1 to allow us to check whether additional items remain at the end
+        padded_limit = limit + 1 if limit else None
+        self.original_it = iter(sequence_factory_fn(padded_limit))


Oh we find it is being used in create_multi_domain_form_generator

corehq/apps/enterprise/resumable_iterator_wrapper.py

jingcheng16 · 2024-10-29T18:18:25Z

corehq/apps/enterprise/tests/api/keyset_paginator_tests.py

+        objects = SequenceWrapper(range(5), lambda ele: {'next': ele})
+        paginator = KeysetPaginator(request_data, objects, resource_uri='http://test.com/')
+        page = paginator.page()
+        self.assertEqual(page['meta']['next'], 'http://test.com/?limit=3&next=2')


Confused why the next uri have limit=3 if we will delete limit key in request_data

jingcheng16 · 2024-10-29T18:27:40Z

corehq/apps/enterprise/api/keyset_paginator.py

+class KeysetPaginator(Paginator):
+    '''
+    An alternate paginator meant to support paginating by keyset rather than by index/offset.
+    `objects` is expected to represent a query object that exposes an `.execute(limit)`


Why not having a base QueryObject class, and having a function execute that will raise NotImplemented error. Then have a docstring to explain what do you expect from the execute function or even give an example. Anyone who wants to use this KeysetPaginator should pass a object inherits from the base QueryObject

jingcheng16 · 2024-10-29T18:38:45Z

corehq/apps/enterprise/iterators.py

+
+        num_fetched = len(results.hits)
+
+        if num_fetched >= results.total or (remaining and num_fetched >= remaining):


When is num_fetched different from results.total?

"this is how elasticsearch works" -- Matt cc @gherceg

corehq/apps/enterprise/iterators.py

biyeun · 2024-10-30T16:47:59Z

corehq/apps/enterprise/iterators.py

+            limit=limit
+        )
+
+        xform_converter = RawFormConverter()


initialize this in __init__ and make it something that can be passed into this query and can be swapped out for something else depending on use case

biyeun · 2024-10-30T16:48:30Z

corehq/apps/enterprise/iterators.py

+    return start_date, end_date
+
+
+class RawFormConverter:


rename to something like EnterpriseFormReportConverter or something better? more specific name re: usecase

Moved from `received_on` to `inserted_at`

biyeun · 2024-11-06T16:22:01Z

corehq/apps/enterprise/tests/test_apis.py

+
+    def _create_enterprise_account_covering_domains(self, domains):
+        billing_account = generator.billing_account('[email protected]', '[email protected]')
+        billing_account.enterprise_admin_emails = ['[email protected]']


this should be a customer billing account because it is mapped to multiple domains

Resolved in bff5fac

biyeun · 2024-11-06T16:23:57Z

corehq/apps/enterprise/tests/test_apis.py

+    def _create_enterprise_admin(self, email, domain):
+        user = WebUser.create(
+            domain, email, 'test123', None, None, email)
+        user.is_superuser = True


this should be avoided in tests. please add to enterprise_admin_emails in related BillingAccount

Resolved as part of bff5fac

biyeun · 2024-11-06T16:29:42Z

corehq/apps/enterprise/tests/test_apis.py

+        role = Role.objects.create(slug="test_role")
+        UserRole.objects.create(user=user.get_django_user(), role=role)
+        accounting_admin_role = Role.objects.get_or_create(
+            name="Accounting Admin",
+            slug=privileges.ACCOUNTING_ADMIN,
+        )[0]
+        Grant.objects.create(from_role=role, to_role=accounting_admin_role)


this isn't needed when the user is actually an enterprise admin and not a superuser

Changed as part of bff5fac

biyeun · 2024-11-06T16:31:06Z

corehq/apps/enterprise/tests/test_apis.py

+        encoded_auth = base64.b64encode(auth_string.encode()).decode()
+        request = factory.get(
+            '/',
+            {'startdate': '2004-10-10', 'enddate': '2004-11-10'},


will push to a different level

Included as part of bff5fac

biyeun · 2024-11-06T16:32:46Z

corehq/apps/enterprise/tests/test_apis.py

+
+@es_test(requires=[form_adapter])
+class FormSubmissionResourceTests(TestCase):
+    def test_happy_path(self):


maybe rename to test_resource_is_accessible?

possibly additional test to ensure permissions restrict users that need to be restricted?

biyeun · 2024-11-06T16:41:25Z

corehq/apps/enterprise/iterators.py

+        return self.domain_lookup_tables[domain].get(app_id, None)
+
+
+def loop_over_domains(domains, query_factory, limit=None, last_domain=None, **kwargs):


maybe call this run_query_over_domains? and maybe switch the order of query_factory and domains

biyeun · 2024-11-06T16:42:00Z

corehq/apps/enterprise/iterators.py

+        current_iterator = _get_domain_iterator(**next_args)
+
+
+def loop_over_domain(domain, query_factory, limit=None, **kwargs):


similarly, loop_query_over_domain and switch order of args

… failure

biyeun · 2024-11-13T16:30:45Z

corehq/apps/enterprise/iterators.py

+      the previous progress arguments
+    - start_date: a date to start the date range. Can be None
+    - end_date: the inclusive date to finish the date range. Can be None
+    last_domain, last_time, and last_id are intended to represent the last result from a previous query


nit: does sphinx support back-ticks (`) for denoting variables?

biyeun · 2024-11-13T16:35:22Z

corehq/apps/enterprise/iterators.py

+        }
+
+    @classmethod
+    def get_query_paraams(cls, fetched_object):


get_query_params instead of paraaaaaaaams

tests were passing with this incorrect name. good to verify if tests are covering this?

Verified that the reason this was not flagged on tests was because get_query_params is only called for queries that extend beyond page boundaries. The test in test_apis is intended to just be a happy path test, and so doesn't try to verify more than the base case -- here, it constructs just 2 forms. A test does exist to ensure that the paginator handles page boundaries correctly, but its generic (in keyset_paginator_tests). I feel like this is mostly a typo error on my part, and we probably don't need multiple integration tests to verify each functionality of the APIs. If I were to add a test for form submission page boundaries, that would mean I'd want to do the same for every new API added.

biyeun · 2024-11-13T16:38:02Z

corehq/apps/enterprise/iterators.py

+    return start_date, end_date
+
+
+class EnterpriseFormReportConverter:


possibly use ABC?

Addressed in e55664e

biyeun · 2024-11-13T16:39:44Z

corehq/apps/enterprise/iterators.py

+    @classmethod
+    def get_kwargs_from_map(cls, map):
+        '''
+        Takes a map-like object from a continuation request (generally GET/POST) and extracts


might be good to include a note here about where it's used?

Appreciate the suggestion, but I feel that pointing to exactly where a class is used is brittle, as that could change. I think its appropriate to describe what a class does and then let the user choose to use it however they want in-line with that. Hopefully the EnterpriseDataConverter's docstring for get_kwargs_from_map addresses this

biyeun · 2024-11-13T16:53:07Z

corehq/apps/enterprise/iterators.py

+            next_query_args = query_factory.get_next_query_args(next_query_args, last_hit)
+
+
+class ReportQueryFactoryInterface:


maybe good to use ABC here?

Addressed in e55664e

biyeun · 2024-11-13T16:58:18Z

corehq/apps/api/keyset_paginator.py

+
+        if limit and has_more:
+            last_fetched = objects[-1]
+            next_page_params = self.objects.get_query_params(last_fetched)


check if there is a missing test for this? didn't fail for misnamed method name

Mentioned in another comment, the test is generic

biyeun · 2024-11-13T16:58:53Z

corehq/apps/api/keyset_paginator.py

+        }
+
+
+class PageableQueryInterface:


is this still being used?

It was unused. Removed

biyeun · 2024-11-13T17:04:14Z

corehq/apps/api/keyset_paginator.py

+            request_data,
+            objects,
+            resource_uri=resource_uri,
+            limit=limit,


potentially rename limit to indicate page size as it is used?

Addressed in eb67336

corehq/apps/enterprise/iterators.py

biyeun · 2024-11-26T18:27:38Z

corehq/apps/api/tests/keyset_paginator_tests.py

+from corehq.apps.api.keyset_paginator import KeysetPaginator
+
+
+class SequenceQuery:


what model is this supposed to emulate in the real code? can you add a comment there?

The KeysetPaginator contains a docstring indicating:

objects is expected to represent a query object that exposes an .execute(limit)
method that returns an iterable, and a get_query_params(object) method to retrieve the parameters
for the next query

SequenceQuery is a a simple version of that interface where it wraps a basic sequence. The closest class it represents is currently IterableEnterpriseFormQuery within the iterators module, but I don't want to make that link explicit. The tests were written against the docstring. If you provide an objects parameter that contains an execute method that returns an iterable, KeysetPaginator should work correctly. I'm having a hard time coming up with a more succinct term for that interface.

Thinking on this further, would it help to make IterableEnterpriseFormQuery a child of some abstract base class, so KeysetPaginator's objects needs to inherit from that specific base class rather than matching through duck-typing?

My impression is that Python prefers to work with duck-typing rather than explicit class inheritance, but I can switch if that would make it easier to understand this code

mjriley added 6 commits October 23, 2024 11:00

Apply sorting to enterprise domain list

b9d3b6d

Add resumable iterator wrapper

463d97b

Add KeysetPaginator

3eccde7

Added enterprise form iterators

5154360

Rewire FormSubmissionResource to use iterators

4112456

Moved generic API classes into the API application

399b013

mjriley requested a review from jingcheng16 October 29, 2024 16:17

dimagimon added the Risk: High Change affects files that have been flagged as high risk. label Oct 29, 2024

jingcheng16 reviewed Oct 29, 2024

View reviewed changes

mjriley added 3 commits October 30, 2024 10:57

Removed ResumableIteratorWrapper

185a143

Switched received filter to inserted

05eaa9a

Rename domain forms generator

2504668

biyeun reviewed Oct 30, 2024

View reviewed changes

corehq/apps/enterprise/iterators.py Show resolved Hide resolved

biyeun reviewed Oct 30, 2024

View reviewed changes

corehq/apps/enterprise/iterators.py Show resolved Hide resolved

biyeun reviewed Oct 30, 2024

View reviewed changes

corehq/apps/enterprise/iterators.py Outdated Show resolved Hide resolved

biyeun reviewed Oct 30, 2024

View reviewed changes

corehq/apps/enterprise/iterators.py Outdated Show resolved Hide resolved

biyeun reviewed Oct 30, 2024

View reviewed changes

mjriley added 4 commits October 30, 2024 14:52

Make enterprise form api timezone aware

dd334de

Rename mobile_user field to username

080d837

Made enterprise form submission report iteration generic

09c104b

Moved from `received_on` to `inserted_at`

Added happy path test for form resource api

409f725

biyeun reviewed Nov 6, 2024

View reviewed changes

mjriley added 3 commits November 7, 2024 15:52

isort

2d9d74b

Additional clarifying comments/structures

dedb429

Allow the iterable query to use a generic converter

d489a36

mjriley changed the title ~~Enterprise Iterators Draft -- Early Feedback~~ Enterprise Form Submissions Iterators Nov 12, 2024

mjriley marked this pull request as ready for review November 12, 2024 14:30

mjriley requested review from esoergel, AmitPhulera, dannyroberts and nospame as code owners November 12, 2024 14:30

mjriley mentioned this pull request Nov 12, 2024

Enterprise Form Report Iterators #35253

Closed

4 tasks

Changed "test-domain" to "testing-domain" to try to isolate a testing…

c5d96fb

… failure

biyeun reviewed Nov 13, 2024

View reviewed changes

mjriley added 6 commits November 20, 2024 10:43

Fixed typo

e65970d

Added variable highlighting for iterator documentation

7d0e897

Removed unused class

18faad3

Add abstract base classes

e55664e

Merge branch 'master' into mjr/enterprise_iterators_draft

4cce055

Added page_size support and turned limit into a real limit

eb67336

biyeun reviewed Nov 21, 2024

View reviewed changes

corehq/apps/enterprise/iterators.py Show resolved Hide resolved

biyeun reviewed Nov 26, 2024

View reviewed changes

biyeun approved these changes Jan 8, 2025

View reviewed changes

mjriley merged commit 6070069 into master Jan 14, 2025
13 checks passed

mjriley deleted the mjr/enterprise_iterators_draft branch January 14, 2025 18:32


		num_fetched = len(results.hits)

		if num_fetched >= results.total or (remaining and num_fetched >= remaining):

		return self.domain_lookup_tables[domain].get(app_id, None)


		def loop_over_domains(domains, query_factory, limit=None, last_domain=None, **kwargs):

		current_iterator = _get_domain_iterator(**next_args)


		def loop_over_domain(domain, query_factory, limit=None, **kwargs):

		return start_date, end_date


		class EnterpriseFormReportConverter:

		next_query_args = query_factory.get_next_query_args(next_query_args, last_hit)


		class ReportQueryFactoryInterface:

		from corehq.apps.api.keyset_paginator import KeysetPaginator


		class SequenceQuery:

Uh oh!

Enterprise Form Submissions Iterators #35295

Enterprise Form Submissions Iterators #35295

Uh oh!

Conversation

mjriley commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Product Description

Technical Summary

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Rollback instructions

Labels & Review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jingcheng16 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mjriley commented Oct 29, 2024 •

edited

Loading