Enterprise Form Report Iterators #35253

mjriley · 2024-10-23T15:21:45Z

Product Description

Technical Summary

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Migrations

The migrations in this code can be safely applied first independently of the code

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

jingcheng16 · 2024-10-23T16:00:51Z

corehq/apps/enterprise/tests/test_resumable_iterator_wrapper.py

+
+
+class ResumableIteratorWrapperTests(SimpleTestCase):
+    def test_can_iterate_through_a_wrapped_iterator(self):


Suggested change

def test_can_iterate_through_a_wrapped_iterator(self):

def test_can_iterate_through_an_iterator(self):

jingcheng16 · 2024-10-23T16:21:38Z

corehq/apps/enterprise/api/keyset_paginator.py

+        }
+
+        if limit:
+            next_params = self.objects.get_next_query_params()


What will list.get_next_query_params() return?

millerdev · 2024-10-25T12:31:27Z

corehq/apps/enterprise/iterators.py

+    domain_index = domains.index(last_domain) if last_domain else 0
+
+    def _get_domain_iterator(last_time=None, last_id=None):
+        if domain_index >= len(domains):
+            return None
+        domain = domains[domain_index]
+        return domain_form_generator(domain, start_date, end_date, last_time, last_id)
+
+    current_iterator = _get_domain_iterator(last_time, last_id)
+
+    while current_iterator:
+        yield from current_iterator
+        domain_index += 1
+        if domain_index >= len(domains):
+            break
+        current_iterator = _get_domain_iterator()


The while loop with domain_index was hard to follow. Here's a suggestion of how itertools.dropwhile could be used with a for loop.

Suggested change

domain_index = domains.index(last_domain) if last_domain else 0

def _get_domain_iterator(last_time=None, last_id=None):

if domain_index >= len(domains):

return None

domain = domains[domain_index]

return domain_form_generator(domain, start_date, end_date, last_time, last_id)

current_iterator = _get_domain_iterator(last_time, last_id)

while current_iterator:

yield from current_iterator

domain_index += 1

if domain_index >= len(domains):

break

current_iterator = _get_domain_iterator()

from itertools import dropwhile # TODO move to top of module

last_plus_remains = dropwhile(lambda d: d != last_domain, domains)

remaining_domains = dropwhile(lambda d: d == last_domain, last_plus_remains)

for domain in remaining_domains:

yield from domain_form_generator(domain, start_date, end_date, last_time, last_id)

millerdev · 2024-10-25T12:36:19Z

corehq/apps/enterprise/iterators.py

+
+def domain_form_generator(domain, start_date, end_date, last_time=None, last_id=None):
+    if not last_time:
+        last_time = datetime.now()


I think this should use utcnow (or whatever the equivalent non-deprecated version of that is).

If last_time is passed in it will be a string because it was retrieved from the request. Does it matter to be mixing types like that?

millerdev · 2024-10-25T12:47:09Z

corehq/apps/enterprise/iterators.py

+        for form in results.hits:
+            last_form_fetched = form
+            yield last_form_fetched
+
+        if len(results.hits) >= results.total:
+            break
+        else:
+            last_time = last_form_fetched['received_on']
+            last_id = last_form_fetched['_id']


Will line 60 raise NameError or use the previous value of last_form_fetched (which is probably wrong and I think will result in an infinite loop) if results.hits is empty?

Suggested change

for form in results.hits:

last_form_fetched = form

yield last_form_fetched

if len(results.hits) >= results.total:

break

else:

last_time = last_form_fetched['received_on']

last_id = last_form_fetched['_id']

yield from results.hits

if not results.hits or len(results.hits) >= results.total:

break

last_form_fetched = results.hits[-1]

last_time = last_form_fetched['received_on']

last_id = last_form_fetched['_id']

The >= operation implies that it is possible for len(results.hits) to be greater than results.total. Out of curiosity, what would that mean? Sounds like ES would have returned more results than it though were available, which seems like a contradiction.

millerdev · 2024-10-25T12:49:11Z

corehq/apps/enterprise/iterators.py

+            last_id = last_form_fetched['_id']
+
+
+def create_domain_query(domain, start_date, end_date, last_time, last_id):


This name made me think it was building a query that returned domains.

Suggested change

def create_domain_query(domain, start_date, end_date, last_time, last_id):

def create_domain_forms_query(domain, start_date, end_date, last_time, last_id):

corehq/apps/enterprise/iterators.py

millerdev · 2024-10-25T13:01:43Z

corehq/apps/enterprise/api/resources.py

-            include_form_id=True,
-        )
+
+        return get_enterprise_form_iterator(account, start_date, end_date, last_domain, last_time, last_id)


Could the page size limit be passed in here? Might be able to retrieve that with something like

limit = self.paginator.get_limit()

millerdev · 2024-10-25T13:08:11Z

corehq/apps/enterprise/iterators.py

+
+    query.es_query['sort'] = [
+        {'received_on': {'order': 'desc'}},
+        {'form.meta.instanceID': 'asc'}


I'm not sure how it works in ES, but in SQL this could result in an inefficient query if there is no index that can be used with this sort criteria. Have you looked into the performance of ES with sorting large result sets?

@esoergel mentioned that Elasticsearch essentially indexes all fields. I believe this follows how we retrieve cases

Yep that's right, this is the same approach we use there. ES stores everything you send it, and it makes everything in the mapping file available for querying. It's indexing and filter caching mechanisms are kinda black-boxy as best as I can tell, but this is for sure in line with how querying and sorting are expected to work.

millerdev · 2024-10-25T13:16:34Z

corehq/apps/enterprise/resumable_iterator_wrapper.py

+        if self.is_complete:
+            return None
+        if not self.iteration_started:
+            return {}


Is there a meaningful difference between these return values on the client? If not, would it work to return an empty dict if the iteration is complete? It would simplify the return type (always returns a dict).

The goal was to be able to use this value to determine whether the iterator was complete or not, but as you recommended, I've removed this class now

millerdev · 2024-10-25T13:26:22Z

corehq/apps/enterprise/api/keyset_paginator.py

+        """
+        limit = self.get_limit()
+
+        objects = list(islice(self.objects, limit if limit else None))


What does it mean if limit evaluates to false? I assume that would mean no limit. Should that not be impossible, given that this class has a max_limit parameter?

Suggested change

objects = list(islice(self.objects, limit if limit else None))

assert limit, f"page limit is required (got {limit!r})"

objects = list(islice(self.objects, limit))

This code is trying to mimic what Tastypie does. You can see in its Paginator code that it allows the possibility of no limit. It mentions that here

millerdev · 2024-10-25T13:59:25Z

corehq/apps/enterprise/api/keyset_paginator.py

+            next_params = self.objects.get_next_query_params()
+            if next_params:


Is this assuming that self.objects is an instance of ResumableIteratorWrapper? Could ResumableIteratorWrapper be eliminated or dramatically simplified (remove all iteration logic) since we are doing the iteration above in this method (list(islice(self.objects, ...)?

Suggested change

next_params = self.objects.get_next_query_params()

if next_params:

if objects:

next_params = self.objects.get_element_properties_fn(objects[-1])

Put another way, does anything other than this method iterate over self.objects? What happens if the limit is not applied as it is here?

Working on this. It does seem to simplify the logic if the paginator is responsible for resolving next parameters.

esoergel · 2024-10-29T17:59:51Z

corehq/apps/enterprise/iterators.py

+        {'received_on': {'order': 'desc'}},
+        {'form.meta.instanceID': 'asc'}


Here I'd actually recommend inserted_at rather than received_on. received_on is the timestamp of when we receive the form, and inserted_at is a timestamp applied just before it gets sent to ES. The risk with using received_on is that delays in processing mean you might see documents with an earlier received_on date appearing in ES after others with a later date. If you're paginating over the data as it's being amended, this could cause some of the most recent submissions to get skipped. To be sure, that's a bit of an edge case, but there's an easy enough mitigation.

You can also use doc_id instead of form.meta.instanceID - not sure if it's any faster, but less nesting seems nice.

mjriley · 2024-11-12T15:11:54Z

Closed this PR with the intention of moving the 'active' PR to #35295

mjriley added 5 commits October 23, 2024 11:00

Apply sorting to enterprise domain list

b9d3b6d

Add resumable iterator wrapper

733da5b

Add KeysetPaginator

aa2a10b

Added enterprise form iterators

8320589

Temporary FormSubmissionResource changes

a9656a9

jingcheng16 reviewed Oct 23, 2024

View reviewed changes

millerdev reviewed Oct 25, 2024

View reviewed changes

esoergel reviewed Oct 29, 2024

View reviewed changes

mjriley closed this Nov 12, 2024



		class ResumableIteratorWrapperTests(SimpleTestCase):
		def test_can_iterate_through_a_wrapped_iterator(self):

	def test_can_iterate_through_a_wrapped_iterator(self):
	def test_can_iterate_through_an_iterator(self):

		last_id = last_form_fetched['_id']


		def create_domain_query(domain, start_date, end_date, last_time, last_id):

	def create_domain_query(domain, start_date, end_date, last_time, last_id):
	def create_domain_forms_query(domain, start_date, end_date, last_time, last_id):

	objects = list(islice(self.objects, limit if limit else None))
	assert limit, f"page limit is required (got {limit!r})"
	objects = list(islice(self.objects, limit))

		next_params = self.objects.get_next_query_params()
		if next_params:

		{'received_on': {'order': 'desc'}},
		{'form.meta.instanceID': 'asc'}

Uh oh!

Enterprise Form Report Iterators #35253

Enterprise Form Report Iterators #35253

Uh oh!

Conversation

mjriley commented Oct 23, 2024

Product Description

Technical Summary

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Migrations

Rollback instructions

Labels & Review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

esoergel Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjriley commented Nov 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

esoergel Oct 29, 2024 •

edited

Loading