Proposal: Enable Federated Analytics using OpenFL #1385

tanwarsh · 2025-02-18T05:47:51Z

tanwarsh
Feb 18, 2025
Maintainer

Summary

Federated Analytics is a data analysis approach where a querier answers a query through collaboration with multiple data owners (clients) who retain their local raw data. Instead of exchanging raw data, intermediate query responses are transferred and aggregated by the querier to answer the query. This documentation provides a high-level design for implementing Federated Analytics using OpenFL

Motivation

FA is a distributed approach to data analysis that enables multiple parties to collaboratively analyze data without sharing the actual data. This approach is particularly valuable in scenarios where data privacy and security are critical, such as in healthcare, finance, and other sensitive fields. Key points about FA include:

Data Privacy: FA ensures that data remains with the data owner and is not shared with other parties. Only analysis results or model updates are shared.
Collaboration: Multiple parties can collaborate on data analysis without exposing their data to each other, enabling joint analysis on a larger dataset while maintaining data privacy.
Security: By keeping data local and only sharing necessary information, FA reduces the risk of data breaches and unauthorized access.
Efficiency: FA leverages the computational resources of multiple parties, reducing the need for centralized data processing and potentially increasing efficiency.
Applications: FA is used in various fields, including healthcare for multi-institutional collaborations, finance for fraud detection, and other areas where data privacy is critical.

High-Level Design

Scope

Showcase Federated Analytics with OpenFL.
Integrate with existing federated learning frameworks to leverage distributed data for analytics.
Implement methods for the Aggregator to aggregate query results and for Collaborators to execute queries on their local datasets and return the results to the Aggregator.
Set up a Taskrunner API workspace to showcase FA.

Technical Details

TaskRunner API Workspace

runner_fa.py, loader_fa.py, and task_fa.yaml will support all FA workspaces.
- runner_FA.py: Task runner class responsible for managing tasks for collaborators and query and aggregation function itself will be added by users themself. This will facilitate adding support for new queries.
- loader_FA.py: Loads data for collaborators which will be used for analysis. Support to load data from different sources eg. MySQL, MongoDB, CSV, Amazon S3, InfluxDB etc.
Mode: analysis will be used for FA and learning for FL. This will be used to implement core component-level changes related to machine learning models and analysis.
Aggregation type (AggregationFunction) will be used by aggregator to aggregate query results.
taskrunner.py: It will be responsible for different types of query for FA. It will also provide different analytical algorithms out of box. for eg. histogram, Heavy hitters, set Intersection etc.
Similar to how collaborators share model weights, with FA collaborators will share query results (metrics) which will be aggregated using the custom aggregation function written by users. eg. LocalTensor(col_name='collaborator1_sepal length (cm)', tensor=array([2.2 , 2.42, 2.64, 2.86, 3.08, 3.3 , 3.52, 3.74, 3.96, 4.18, 4.4 ], dtype=float32)).
Sequence diagram
Sample Plan.

aggregator :
  defaults : plan/defaults/aggregator.yaml
  template : openfl.component.Aggregator
  settings :
    save_path : save/result.csv
    mode: analysis
 
collaborator :
  defaults : plan/defaults/collaborator.yaml
  template : openfl.component.Collaborator
  settings:
    mode: analysis
 
data_loader :
  defaults : plan/defaults/data_loader.yaml
  template : src.dataloader.IRISInMemory
  settings :
    collaborator_count : 2
    data_group_name    : iris
    batch_size         : 150
 
task_runner :
  defaults : plan/defaults/task_runner.yaml
  template : src.taskrunner.IrisHistogram
 
network :
  defaults : plan/defaults/network.yaml

Types of Supported Queries
1.Statistical queries
2.Set-based queries
3.Matrix transformation queries

Documentation

Update Federate Analytics related documentation in OpenFL.

Open Questions

Handling data and data schema? - Not in scope.
Handling timeouts? - Not in scope.

Phase 1

Core component level changes to enable user to run federated analytics with OpenFL Taskrunner.
Collaborators will query data and return analysis result.
Aggregator will aggregate the analysis results.
Taskrunner Workspace: Calculate histogram (an estimate of the probability distribution of a continuous variable) on specific columns of Iris dat aset.

Phase 2

Dynamic query/parameters are passed via command line. This will support custom information/setting related to loading data for different collaborators for eg. different type of data storage.
```
 fx collaborator start  --query query.json
```
To support flexible data loading, the design has been updated so that a query.json file can be passed directly to the collaborator at startup. This JSON file is then propagated through the system: from the collaborator to the task runner, and finally to the data loader. The data loader uses the configuration in query.json to determine the appropriate backend (e.g., MySQL, PostgreSQL, or CSV) and executes the specified query or data access logic for the task. This approach enables dynamic, runtime selection of data sources and query strategies, improving extensibility and decoupling data access logic from the core federated learning workflow.
Enable support for different databases/source for different collaborators.

The UML class diagram above illustrates the extensible data loading architecture using the Strategy design pattern. The central DataLoader class is responsible for orchestrating data access and query execution. Rather than implementing database-specific logic directly, DataLoader delegates these responsibilities to a QueryStrategy interface, which defines the contract for connecting to a data source and executing queries.

Concrete implementations of QueryStrategy—such as MySQLDataLoader, CSVDataLoader, and PostgreSQLDataLoader—provide the actual logic for connecting to and querying their respective data sources. At runtime, DataLoader selects the appropriate strategy based on the configuration provided in query.json. This allows the system to support multiple data backends (e.g., MySQL, CSV, PostgreSQL) without modifying the core logic of DataLoader.

This design enables easy extension to new data sources by simply implementing new QueryStrategy classes.
Runtime flexibility, as the backend can be chosen dynamically based on user or system configuration.
Clean separation of concerns, with DataLoader.

Sample query.json
```
//JSON for MySQL
{
  "type": "mysql",
  "host": "localhost",
  "port": 3306,
  "user": "username",
  "password": "password",
  "database": "health_db",
  "query": "SELECT age, sex, heart_rate FROM patients WHERE age > 30"
}
//JSON for CSV
{
  "type": "csv",
  "file_path": "./data/patients.csv",
  "columns": ["Age", "SEX", "Heart_Rate"]
}
```
Workspace showcasing Dataloader capability to support CSV and MySQL database.
~~A way to define modifiable parameters and alerting collaborators of change in these parameters.~~

Phase 3

Dataloader capability to support data from different sources. eg. ~~MySQL~~, MongoDB, ~~CSV~~, Amazon S3, InfluxDB etc.
Support for different analytical algorithms out of box. for eg. histogram, Heavy hitters, set Intersection etc.
Taskrunner Workspace: Using https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset, compute analytics such as Age mean, Blood pressure standard deviation or Sex percentage

Phase 4

Use Secure Aggregation or other methods like encryption eg Multiparty Homomorphic Encryption on intermediate analytical results which will help preserve data privacy and security.
Basic set of aggregation functions

teoparvanov · 2025-02-18T15:00:10Z

teoparvanov
Feb 18, 2025
Maintainer

Thanks for starting this discussion, @tanwarsh! Federated Analytics is a promising field, but it presents a new set of challenges, esp. regarding the wide spectrum of possible analytics queries.

I have a couple of questions from my first read-through of this proposal:

Could you elaborate more on the type of aggregation functions that would be supported? Would there need to be a custom one per query type - implemented by the OpenFL user? Alternatively, would OpenFL provide a basic set of aggregation functions specifically for FA?
It looks like that for each type of query (histogram, SQL, K/V, …), you plan to have a separate task runner. Is there any common functionality that can be extracted in an abstract FA_TaskRunner base class, to facilitate adding support for new types of queries (possibly even by OpenFL users)?
Regarding the example FL plan, why do you need learning and evaluation task groups in a FL plan focused around analytics?
Could you elaborate a bit more in the design how the query results would be encoded as tensors by the collaborator?

4 replies

teoparvanov Feb 18, 2025
Maintainer

@tanwarsh, maybe some inspiration (esp. regarding the aggregation aspect) can be drawn from MapReduce as a pattern for the first iteration of Federated Analytics v1.0 for OpenFL?

tanwarsh Feb 19, 2025
Maintainer Author

Thank you @teoparvanov, for the detailed review.

Query and Aggregation of result of queries depend on use-case and it will be defined by the users. There will be need of custom aggregation function which will be unique to the query. As of the first iteration OpenFL will not provide a basic set of aggregation function but in future iteration we can add basic aggregation functions for all types of query. eg. Statistical, Set-based and Matrix transformation queries.
There will be FA_TaskRunner base class with common functionality but the query and aggregation function itself will be added by users themself. This will facilitate adding support for new queries.
We only need analysis task group for this. Similar to FedEval only analysis task group task will be executed. selected_task_group : analysis.
Similar to how collaborators share model weights, with FA collaborators will share query results which will be aggregated using the custom aggregation function written by users.
eg. LocalTensor(col_name='collaborator1_sepal length (cm)', tensor=array([2.2 , 2.42, 2.64, 2.86, 3.08, 3.3 , 3.52, 3.74, 3.96, 4.18, 4.4 ], dtype=float32))

teoparvanov Feb 20, 2025
Maintainer

Thanks for the clarifications - please update the proposal accordingly, so we can continue the discussion.

tanwarsh Feb 24, 2025
Maintainer Author

@teoparvanov, I have updated proposal.

teoparvanov · 2025-02-18T15:04:21Z

teoparvanov
Feb 18, 2025
Maintainer

Regarding supplementing FA with SecureAggregation, you should take into account that SecAgg mainly works for cases where the local query results are summed arithmetically (as in FedAvg). It will likely not be applicable in other cases where other types of query aggregation is used, such as concatenation, finding maximum or minimum, or any non-linear aggregation operation.

I would therefore suggest that you add a section in the design where you illustrate the types of federated queries that will be supported, and more importantly - how they will be aggregated.

1 reply

tanwarsh Feb 19, 2025
Maintainer Author

Thank you for pointing this out. I agree that SecureAggregation will only work where the local query results are summed arithmetically. However, it does provide stronger privacy and security for the results and protects the data. Therefore, decided to include it in the second iteration although it will only support query with arithmetic results. Added a section that lists the types of federated queries for the first iteration.

teoparvanov · 2025-02-24T12:26:58Z

teoparvanov
Feb 24, 2025
Maintainer

I would like to raise an additional question/suggestion. In the current proposal, the query is defined as part of the FL plan. This entails that for each new query, the FL plan would have to be re-distributed among re-approved the participants. For a more streamlined experience, could we consider an approach where the query can be passed as an aggregator CLI parameter? This could follow a similar pattern to how we can currently switch the aggregator's mode between learning and evaluation:

fx aggregator start --task_group evaluation

In the case of FA, the query could be entirely removed from the FL plan, and passed via command line. For example:

fx aggregator start --task_group analytics --query query.json # ... or directly inline

Now, although convenient, this means that the aggregator would be able to run pretty much any query without preliminary approval by the collaborators. These may need to be additionally restricted.

@ishaileshpant , what would be your thoughts and suggestions in this regard?

4 replies

tanwarsh Feb 25, 2025
Maintainer Author

The query itself wont be defined in the plan. Let me give a code example for this:

def query(self, col_name, round_num, columns, **kwargs):
        data = self.data_loader.get_data()
        tags = ("query",)
        for column in columns:
            histogram = np.histogram(data[column])
            query_tensorkey_dict[TensorKey(column, col_name, round_num, False, tags)] = histogram
        return query_tensorkey_dict

The query is defined in Taskrunner class src.taskrunner.IrisHistogram and aggregation in aggregation_type: template: src.aggregation.Histogram

psfoley Feb 27, 2025
Maintainer

Good discussion here. I want to unpack and try to uplevel what I see as the true requirements.

I'm not sure that passing a task group for analytics is equivalent to the learning and validation; the main reason being that analytics is not a subset of an existing task groups baked into the plan (e.g. you wouldn't have learning, evaluation, and analytics simultaneously defined in the plan, and select analytics from that). But I understand the motivation to have a way to pass SQL type queries. This capability (even if it's not part of the example being proposed) seems pretty essential for federated analytics. There's a few things that we need to more broadly solve that I think will help address some of the questions here:

You raise the question about whether a query needs to be baked into a plan or can be passed dynamically. This is the security vs. convenience tradeoff we keep running into with workload distribution (pass ahead of time for review (TaskRunner API), or create a pipe that anyone can send new experiments into (Former Interactive API or Director/Envoy). There was a proposal years ago from @msheller to use the Hydra package to allow for this kind of command line configuration that allows changes to the YAML file after the plan is loaded. What we really need is a way to define what are modifiable / unmodifiable fields within the plan AND signal to users what the changes are before the workload actually gets run. The way I see this playing out for Federated Analytics is that it's just a nice subset of core functionality we provide within OpenFL. The query is a plan parameter, that can alternatively be passed via command line. The collaborators are alerted to this change within the plan (and can agree to it) BEFORE it actually gets run (using Hydra or equivalent package). As a side benefit, this allows for all kinds of other conveniences like changing batch_size, model hyperparameters, etc. at runtime. (As a note, we did have this kind of functionality in the Python API, but that was limited utility because it was only ever intended for simulation).
Dealing with custom information we need to pass to collaborators. This is something we do today in the FeTS Challenge, where different collaborators sometimes could benefit from a custom learning rate / training epochs. If we think about generalizing that kind of capability, that allows us to handle the harder challenges that come with Federated Analytics. As an example, different collaborators may use different types of databases. So having a way to pass custom information to each collaborator would allow for passing the right SQL syntax for different DB backends. This is a related problem to dataset preprocessing; not every collaborator is going to do the same thing. This is the problem we need to solve (and it's also related to the point above about modifiable vs. unmodifiable plan sections and how we cleanly express that).

psfoley Feb 28, 2025
Maintainer

BTW - I'm not giving a prescription for how SQL queries could be passed. If we were to load them from plan today, there is no extension to OpenFL needed. Adding the ability to modify plan parameters dynamically is where some thought will need to go. We've used the TensorDB for this in the past, but there's likely a cleaner abstraction for this.

teoparvanov Feb 28, 2025
Maintainer

Hey @psfoley , thanks for the detailed feedback and context around dynamic FL plan modifications! Note that I started this conversation as a side-thread of the original design proposal where the query is "hardcoded" as an initial iteration of the FA feature (either in the TaskRunner, or in the FL plan, whatever makes more sense). We can revisit this in future iterations, and also consider generalizing towards other use cases, as you suggested above.

psfoley · 2025-02-28T00:19:12Z

psfoley
Feb 28, 2025
Maintainer

@tanwarsh Thanks for the proposal. I'm generally okay with a dedicated Federated Analytics task runner, and the creation of a concrete example for it. This is the sequence we've followed for every prior DL framework implementation. The task runner will have no model to set up ahead of time or tensorkey dependencies; just a defined statistical function that's defined by the user. As a note, I see in the discussion above a query example where users are dealing with tensorkeys directly. We should try to avoid this if we can for the basic example. Instead, they can return Metrics in the same way we recommend for other quickstart examples.

The place that is far less clear is the data loader. Unlike PyTorch / Tensorflow / JAX where the model / dataloader usually come from the framework itself (or from another closely related data science package like numpy), we won't be able to rely on a standard functions in loader_fa.py; just an interface. I think it's worth looking at what other federated analytics patterns alternative frameworks have followed for inspiration on this.

The sequence diagram looks good to me.

In Phase 2, you mention adding unique aggregation functions. I think there's an opportunity here to demonstrate what kind of custom code users can insert into the plan and finally document this. Specifically, what I would recommend here is actually defining a custom statistical aggregation function in the workspace and modifying the plan to point to it (i.e. something like template: src.custom_aggregation.CustomAggregation)

Also see my in-line response above for other areas that need consideration

4 replies

tanwarsh Mar 5, 2025
Maintainer Author

@psfoley , Thank you for the review.

I will summarise the changes made to the design to accommodate your review::

Listed core functionality related to FA provided by OpenFL and deliverables phase wise.
Added details related to data loader which will be implemented in phase 2.
Introduced modes for aggregator and collaborators instead of task groups. mode : learning for FL and mode : analysis for FA.
Will be passing query parameters via command line.

I will add detailes for custom information/settings for different collaborators, such as different types of data sources, and a method to define modifiable parameters and alert collaborators of changes in these parameters.

If the proposed design is suitable for phase 1, we can proceed with it.

psfoley Mar 12, 2025
Maintainer

Thanks @tanwarsh. I've reviewed the additional changes. The need for mode: analytics is still not clear (and how it's used between learning / evaluation is different), and passing of query parameters is going to get us into another situation where there's very specific code added in core components (aggregator / collaborator) that's only needed for this use case. Because of that, I would suggest starting with a basic federated analytics example for phase 1 (inheriting from the base taskrunner as suggested below) focusing on statistical analysis of a known dataset (i.e. not a SQL DB), and then pursuing a larger design for how to pass overwritable plan information via the CLI to all participants.

tanwarsh Mar 13, 2025
Maintainer Author

Thanks @psfoley, I have added details regarding mode: analytics in the design.

The core components (aggregator, collaborator, etc.) are currently based on ML models. The idea is to run model-related code for learning and only query/non-model-related core components code for analysis. For example, init_state_path, best_state_path, last_state_path, and constructing models with utils.construct_model_proto are not needed for analysis. For more details, please refer to the changes in this branch:
https://github.com/securefederatedai/openfl/compare/develop...tanwarsh:openfl:federate-analytics-components?expand=1 (these changes are just a proof of concept and not final).

I can certainly develop a basic federated analytics example for phase 1, but it will still necessitate code component-level changes. Please share your thoughts on this.

@teoparvanov, @rahulga1 WDYT?

rahulga1 Mar 14, 2025
Maintainer

@tanwarsh I think we have gone through a detailed offline discussion, and it will be good now we have it discussed in a dedicated meeting with relevant folks involved. So, please schedule a sync.

MasterSkepticista · 2025-03-05T14:21:28Z

MasterSkepticista
Mar 5, 2025
Maintainer

@tanwarsh

Thanks for this draft. Could you please help me understand:

What do new interfaces (runner_fa, loader_fa and tasks_fa) bring in algorithmically^ that standard taskrunner interfaces do not support?
What proportion of LOC are we planning to add in the framework vs example workspace? (e.g. 10% framework changes, 90% as new workspace)

^: By algorithmically, I am referring to changes in the way communication/computation is done within the existing taskrunner code. This does not include novel data loaders.

1 reply

tanwarsh Mar 7, 2025
Maintainer Author

@MasterSkepticista, Please find the explanation below:

New interfaces runner_fa and loader_fa will algorithmically include query-related code, such as get_data instead of get_valid_data_size, and query instead of train_batches. Additionally, other model-related code, such as reset_opt_vars and validate will be not be part of runner. The task_fa interface has been removed as it is no longer required and I have updated the design.
At this point, it is difficult to estimate the percentage of lines of code (LOC) that will be present in the framework and workspaces. The idea is to add analysis-related code at the framework level, which is currently only model-related, and to add 2-3 workspaces to showcase FA.

tanwarsh · 2025-03-19T08:54:23Z

tanwarsh
Mar 19, 2025
Maintainer Author

As discussed, we will proceed with the initial implementation for FA with OpenFL. The first iteration will include the following changes:

Core component level changes with mode analysis for FA, enabling users to run federated analytics with the OpenFL Taskrunner API.
Collaborators will query data and return analysis results.
The aggregator will aggregate the analysis results.
The Base Taskrunner class will be used for FA.
Taskrunner Workspace: Calculate a histogram (an estimate of the probability distribution of a continuous variable) on specific columns of the Iris dat aset.

Areas for further investigation and discussion:

Dynamic queries/parameters instead of static/baked-in queries - Possible solution: pass parameters/queries via the command line.
Method to define changeable plan parameters upon agreement from collaborators, which can be used for dynamic queries and analysis, along with the security implications of this approach.
Detailed design for the data loader for FA.
Custom information/settings for different collaborators.

3 replies

teoparvanov Mar 19, 2025
Maintainer

Sounds good, thanks for summarizing, @tanwarsh !

tanwarsh Mar 20, 2025
Maintainer Author

@teoparvanov /@psfoley,

One more thing I missed in the summary is task to run analytics on the data by collaborators.
And as discussed above this will also help with custom aggregation functions.
Please check the part sample plan for the same.

assigner :
  template : openfl.component.RandomGroupedAssigner
  settings :
    task_groups  :
      - name       : analysis
        percentage : 1.0
        tasks      :
          - analysis

tasks :
  analysis:
    function : analysis
    aggregation_type: 
      template: src.histogram.Histogram
      
    kwargs   :
      columns: ['sepal length (cm)', 'sepal width (cm)']

tanwarsh May 2, 2025
Maintainer Author

As discussed further, we will proceed with the initial implementation for FA with OpenFL. The first iteration will include the following changes:

~~Core component level changes with mode analysis for FA, enabling users to run federated analytics with the OpenFL Taskrunner API.~~ Refactoring is needed in core OpenFL to accommodate use cases beyond learning and hence for the first iteration dummy model and weights will be used to avoid core component changes.
Collaborators will query data and return analysis results.
The aggregator will aggregate the analysis results.
The Base Taskrunner class will be used for FA.
Taskrunner Workspace: Calculate a histogram (an estimate of the probability distribution of a continuous variable) on specific columns of the Iris dat aset.

Proposal: Enable Federated Analytics using OpenFL #1385

Uh oh!

Uh oh!

tanwarsh Feb 18, 2025 Maintainer

Summary

Motivation

High-Level Design

Technical Details

Documentation

Open Questions

Phase 1

Phase 2

Sample query.json

Phase 3

Phase 4

Replies: 6 comments · 17 replies

Uh oh!

Uh oh!

teoparvanov Feb 18, 2025 Maintainer

Uh oh!

Uh oh!

teoparvanov Feb 18, 2025 Maintainer

Uh oh!

Uh oh!

tanwarsh Feb 19, 2025 Maintainer Author

Uh oh!

teoparvanov Feb 20, 2025 Maintainer

Uh oh!

tanwarsh Feb 24, 2025 Maintainer Author

Uh oh!

Uh oh!

teoparvanov Feb 18, 2025 Maintainer

Uh oh!

tanwarsh Feb 19, 2025 Maintainer Author

Uh oh!

Uh oh!

teoparvanov Feb 24, 2025 Maintainer

Uh oh!

tanwarsh Feb 25, 2025 Maintainer Author

Uh oh!

Uh oh!

psfoley Feb 27, 2025 Maintainer

Uh oh!

psfoley Feb 28, 2025 Maintainer

Uh oh!

Uh oh!

teoparvanov Feb 28, 2025 Maintainer

Uh oh!

Uh oh!

psfoley Feb 28, 2025 Maintainer

Uh oh!

Uh oh!

tanwarsh Mar 5, 2025 Maintainer Author

Uh oh!

psfoley Mar 12, 2025 Maintainer

Uh oh!

tanwarsh Mar 13, 2025 Maintainer Author

Uh oh!

rahulga1 Mar 14, 2025 Maintainer

Uh oh!

MasterSkepticista Mar 5, 2025 Maintainer

Uh oh!

Uh oh!

tanwarsh Mar 7, 2025 Maintainer Author

Uh oh!

Uh oh!

tanwarsh
Feb 18, 2025
Maintainer

Replies: 6 comments 17 replies

teoparvanov
Feb 18, 2025
Maintainer

teoparvanov Feb 18, 2025
Maintainer

tanwarsh Feb 19, 2025
Maintainer Author

teoparvanov Feb 20, 2025
Maintainer

tanwarsh Feb 24, 2025
Maintainer Author

teoparvanov
Feb 18, 2025
Maintainer

tanwarsh Feb 19, 2025
Maintainer Author

teoparvanov
Feb 24, 2025
Maintainer

tanwarsh Feb 25, 2025
Maintainer Author

psfoley Feb 27, 2025
Maintainer

psfoley Feb 28, 2025
Maintainer

teoparvanov Feb 28, 2025
Maintainer

psfoley
Feb 28, 2025
Maintainer

tanwarsh Mar 5, 2025
Maintainer Author

psfoley Mar 12, 2025
Maintainer

tanwarsh Mar 13, 2025
Maintainer Author

rahulga1 Mar 14, 2025
Maintainer

MasterSkepticista
Mar 5, 2025
Maintainer

tanwarsh Mar 7, 2025
Maintainer Author