Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enterprise Form Submissions Iterators #35295

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

mjriley
Copy link
Contributor

@mjriley mjriley commented Oct 29, 2024

Product Description

Technical Summary

This PR is now ready for review. This PR creates iterators to handle enterprise report requests in a scalable manner. Previously, both our enterprise reports and the enterprise report APIs needed to generate the entire enterprise report in order to deliver any results. With this PR, the form submissions report API has been modified to instead source that information from an iterator that will only fetch data up until a page boundary. If more data than a page boundary is needed, the request will be paginated.

Feature Flag

No feature flag.

Safety Assurance

Safety story

I've done local testing, created multiple test suites, and verified this PR on staging.

Automated test coverage

New test suites created in test_iterators.py and test_apis.py

QA Plan

As there is no user-facing component to this other than the API results, I don't think this needs to be run through QA -- the same process I'm using would be duplicated by QA.

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@dimagimon dimagimon added the Risk: High Change affects files that have been flagged as high risk. label Oct 29, 2024

# if a limit exists, increase it by 1 to allow us to check whether additional items remain at the end
padded_limit = limit + 1 if limit else None
self.original_it = iter(sequence_factory_fn(padded_limit))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass padded_limit to sequence_factory_fn? Based on the test cases, sequence_factory_fn ignores the parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we find it is being used in create_multi_domain_form_generator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're asking why we need a limit, the issue is that Tastypie needs some way to communicate the 'limit' it receives with the underlying limit used by Elasticsearch. If we can't communicate this to elasticsearch, it leads to situations like the API asking for 100 records and Elasticsearch fetching 5000, leading to 4900 unused records. Or the reverse, where the API asks for 5000 records,and Elasticsearch fetches them 100 at a time, leading to 50 calls to Elasticsearch rather than 1.

Ideally, we'd just pass the limit all the way down, but we don't have control over how the Paginator is instantiated or how it is called -- we can tell Tastypie to use a certain Paginator class, but it controls the instatiation and the call. What we can control is the 'objects' object that gets passed to the paginator. That object can't know about the limit yet, because the paginator is responsible for setting that limit. So the object needs a way to receive the limit prior to retrieving the results.

The important part here is that the iterator needs to be able to receive a limit parameter, even if it does nothing with it. In the case of the tests, they ignore that limit, but if the underlying sequence was an elasticsearch query, it would need access to that limit

Copy link
Contributor

@jingcheng16 jingcheng16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't go through all the commits due to the time


# if a limit exists, increase it by 1 to allow us to check whether additional items remain at the end
padded_limit = limit + 1 if limit else None
self.original_it = iter(sequence_factory_fn(padded_limit))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we find it is being used in create_multi_domain_form_generator

if self.limit and not self.is_complete:
# the end of the limited sequence was reached, check if items beyond the limit remain
try:
next(self.original_it)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [3]:         initial_it = iter(range(10))
   ...:         it = ResumableIteratorWrapper(lambda _: initial_it,limit =2)

In [4]: next(it)
Out[4]: 0

In [5]: next(it)
Out[5]: 1

In [6]: next(it)
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[6], line 1
----> 1 next(it)

File ~/commcare-hq/corehq/apps/api/resumable_iterator_wrapper.py:27, in ResumableIteratorWrapper.__next__(self)
     24 self.iteration_started = True
     26 try:
---> 27     self.prev_element = next(self.it)
     28 except StopIteration:
     29     if self.limit and not self.is_complete:
     30         # the end of the limited sequence was reached, check if items beyond the limit remain

StopIteration: 

In [7]: next(initial_it)
Out[7]: 3

Calling next on original_it will result in us skipping an element, do you have later commit to overcome it?

objects = SequenceWrapper(range(5), lambda ele: {'next': ele})
paginator = KeysetPaginator(request_data, objects, resource_uri='http://test.com/')
page = paginator.page()
self.assertEqual(page['meta']['next'], 'http://test.com/?limit=3&next=2')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused why the next uri have limit=3 if we will delete limit key in request_data

class KeysetPaginator(Paginator):
'''
An alternate paginator meant to support paginating by keyset rather than by index/offset.
`objects` is expected to represent a query object that exposes an `.execute(limit)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not having a base QueryObject class, and having a function execute that will raise NotImplemented error. Then have a docstring to explain what do you expect from the execute function or even give an example. Anyone who wants to use this KeysetPaginator should pass a object inherits from the base QueryObject


num_fetched = len(results.hits)

if num_fetched >= results.total or (remaining and num_fetched >= remaining):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is num_fetched different from results.total?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this is how elasticsearch works" -- Matt cc @gherceg

limit=limit
)

xform_converter = RawFormConverter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize this in __init__ and make it something that can be passed into this query and can be swapped out for something else depending on use case

return start_date, end_date


class RawFormConverter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to something like EnterpriseFormReportConverter or something better? more specific name re: usecase


def _create_enterprise_account_covering_domains(self, domains):
billing_account = generator.billing_account('[email protected]', '[email protected]')
billing_account.enterprise_admin_emails = ['[email protected]']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a customer billing account because it is mapped to multiple domains

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in bff5fac

def _create_enterprise_admin(self, email, domain):
user = WebUser.create(
domain, email, 'test123', None, None, email)
user.is_superuser = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be avoided in tests. please add to enterprise_admin_emails in related BillingAccount

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved as part of bff5fac

Comment on lines 83 to 89
role = Role.objects.create(slug="test_role")
UserRole.objects.create(user=user.get_django_user(), role=role)
accounting_admin_role = Role.objects.get_or_create(
name="Accounting Admin",
slug=privileges.ACCOUNTING_ADMIN,
)[0]
Grant.objects.create(from_role=role, to_role=accounting_admin_role)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't needed when the user is actually an enterprise admin and not a superuser

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed as part of bff5fac

encoded_auth = base64.b64encode(auth_string.encode()).decode()
request = factory.get(
'/',
{'startdate': '2004-10-10', 'enddate': '2004-11-10'},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will push to a different level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included as part of bff5fac


@es_test(requires=[form_adapter])
class FormSubmissionResourceTests(TestCase):
def test_happy_path(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to test_resource_is_accessible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly additional test to ensure permissions restrict users that need to be restricted?

return self.domain_lookup_tables[domain].get(app_id, None)


def loop_over_domains(domains, query_factory, limit=None, last_domain=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this run_query_over_domains? and maybe switch the order of query_factory and domains

current_iterator = _get_domain_iterator(**next_args)


def loop_over_domain(domain, query_factory, limit=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, loop_query_over_domain and switch order of args

@mjriley mjriley changed the title Enterprise Iterators Draft -- Early Feedback Enterprise Form Submissions Iterators Nov 12, 2024
@mjriley mjriley marked this pull request as ready for review November 12, 2024 14:30
@mjriley mjriley mentioned this pull request Nov 12, 2024
4 tasks
the previous progress arguments
- start_date: a date to start the date range. Can be None
- end_date: the inclusive date to finish the date range. Can be None
last_domain, last_time, and last_id are intended to represent the last result from a previous query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does sphinx support back-ticks (`) for denoting variables?

}

@classmethod
def get_query_paraams(cls, fetched_object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_query_params instead of paraaaaaaaams

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests were passing with this incorrect name. good to verify if tests are covering this?

return start_date, end_date


class EnterpriseFormReportConverter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly use ABC?

@classmethod
def get_kwargs_from_map(cls, map):
'''
Takes a map-like object from a continuation request (generally GET/POST) and extracts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to include a note here about where it's used?

next_query_args = query_factory.get_next_query_args(next_query_args, last_hit)


class ReportQueryFactoryInterface:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe good to use ABC here?


if limit and has_more:
last_fetched = objects[-1]
next_page_params = self.objects.get_query_params(last_fetched)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if there is a missing test for this? didn't fail for misnamed method name

}


class PageableQueryInterface:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still being used?

request_data,
objects,
resource_uri=resource_uri,
limit=limit,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially rename limit to indicate page size as it is used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Risk: High Change affects files that have been flagged as high risk.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants