Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request implements comprehensive extraction, propagation, and storage of Access Control List (ACL) information for documents ingested from Google Drive, OneDrive, and SharePoint connectors. It introduces connector-specific logic to fetch detailed user and group permissions from each provider's API and ensures that this ACL data is consistently passed through the document processing pipeline and indexed in OpenSearch. This enables fine-grained access control and auditing for ingested documents.
The most important changes are:
Connector-specific ACL Extraction:
_extract_google_drive_acltogoogle_drive/connector.pyto fetch and parse user/group permissions from the Google Drive API for each file, and propagate this ACL intoConnectorDocumentinstances. [1] [2] [3] [4]_extract_onedrive_acltoonedrive/connector.pyto retrieve permissions from the Microsoft Graph API for OneDrive items, and use this ACL in document creation. [1] [2]_extract_sharepoint_acltosharepoint/connector.pyto obtain permissions from the Microsoft Graph API for SharePoint files, and use this ACL in document creation. [1] [2]Pipeline and Metadata Propagation:
service.pyandprocessors.py) to accept and propagate theaclfield from connectors through to chunk indexing, ensuring ACLs are stored with each chunk in OpenSearch. [1] [2] [3] [4]Efficient ACL Indexing and Updates:
_update_connector_metadatainservice.pyto call a dedicatedupdate_document_aclutility, optimizing ACL updates using hashing to skip unchanged ACLs and updating only when necessary. Other metadata is now updated via a singleupdate_by_querycall for efficiency.These changes collectively provide end-to-end support for extracting, storing, and updating document-level ACLs from external storage providers, improving security and compliance in the document indexing pipeline.