Replies: 6 comments 17 replies
-
Thanks for starting this discussion, @tanwarsh! Federated Analytics is a promising field, but it presents a new set of challenges, esp. regarding the wide spectrum of possible analytics queries. I have a couple of questions from my first read-through of this proposal:
|
Beta Was this translation helpful? Give feedback.
-
Regarding supplementing FA with SecureAggregation, you should take into account that SecAgg mainly works for cases where the local query results are summed arithmetically (as in FedAvg). It will likely not be applicable in other cases where other types of query aggregation is used, such as concatenation, finding maximum or minimum, or any non-linear aggregation operation. I would therefore suggest that you add a section in the design where you illustrate the types of federated queries that will be supported, and more importantly - how they will be aggregated. |
Beta Was this translation helpful? Give feedback.
-
I would like to raise an additional question/suggestion. In the current proposal, the query is defined as part of the FL plan. This entails that for each new query, the FL plan would have to be re-distributed among re-approved the participants. For a more streamlined experience, could we consider an approach where the query can be passed as an aggregator CLI parameter? This could follow a similar pattern to how we can currently switch the aggregator's mode between fx aggregator start --task_group evaluation In the case of FA, the query could be entirely removed from the FL plan, and passed via command line. For example: fx aggregator start --task_group analytics --query query.json # ... or directly inline Now, although convenient, this means that the aggregator would be able to run pretty much any query without preliminary approval by the collaborators. These may need to be additionally restricted. @ishaileshpant , what would be your thoughts and suggestions in this regard? |
Beta Was this translation helpful? Give feedback.
-
@tanwarsh Thanks for the proposal. I'm generally okay with a dedicated Federated Analytics task runner, and the creation of a concrete example for it. This is the sequence we've followed for every prior DL framework implementation. The task runner will have no model to set up ahead of time or tensorkey dependencies; just a defined statistical function that's defined by the user. As a note, I see in the discussion above a query example where users are dealing with tensorkeys directly. We should try to avoid this if we can for the basic example. Instead, they can return Metrics in the same way we recommend for other quickstart examples. The place that is far less clear is the data loader. Unlike PyTorch / Tensorflow / JAX where the model / dataloader usually come from the framework itself (or from another closely related data science package like numpy), we won't be able to rely on a standard functions in The sequence diagram looks good to me. In Phase 2, you mention adding unique aggregation functions. I think there's an opportunity here to demonstrate what kind of custom code users can insert into the plan and finally document this. Specifically, what I would recommend here is actually defining a custom statistical aggregation function in the workspace and modifying the plan to point to it (i.e. something like Also see my in-line response above for other areas that need consideration |
Beta Was this translation helpful? Give feedback.
-
Thanks for this draft. Could you please help me understand:
^: By algorithmically, I am referring to changes in the way communication/computation is done within the existing taskrunner code. This does not include novel data loaders. |
Beta Was this translation helpful? Give feedback.
-
As discussed, we will proceed with the initial implementation for FA with OpenFL. The first iteration will include the following changes:
Areas for further investigation and discussion:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Federated Analytics is a data analysis approach where a querier answers a query through collaboration with multiple data owners (clients) who retain their local raw data. Instead of exchanging raw data, intermediate query responses are transferred and aggregated by the querier to answer the query. This documentation provides a high-level design for implementing Federated Analytics using OpenFL
Motivation
FA is a distributed approach to data analysis that enables multiple parties to collaboratively analyze data without sharing the actual data. This approach is particularly valuable in scenarios where data privacy and security are critical, such as in healthcare, finance, and other sensitive fields. Key points about FA include:
High-Level Design
Scope
Technical Details
TaskRunner API Workspace

LocalTensor(col_name='collaborator1_sepal length (cm)', tensor=array([2.2 , 2.42, 2.64, 2.86, 3.08, 3.3 , 3.52, 3.74, 3.96, 4.18, 4.4 ], dtype=float32))
.Types of Supported Queries
1.Statistical queries
2.Set-based queries
3.Matrix transformation queries
Documentation
Open Questions
Phase 1
Phase 2
Dynamic query/parameters are passed via command line. This will support custom information/setting related to loading data for different collaborators for eg. different type of data storage.
To support flexible data loading, the design has been updated so that a query.json file can be passed directly to the collaborator at startup. This JSON file is then propagated through the system: from the collaborator to the task runner, and finally to the data loader. The data loader uses the configuration in query.json to determine the appropriate backend (e.g., MySQL, PostgreSQL, or CSV) and executes the specified query or data access logic for the task. This approach enables dynamic, runtime selection of data sources and query strategies, improving extensibility and decoupling data access logic from the core federated learning workflow.
Enable support for different databases/source for different collaborators.
The UML class diagram above illustrates the extensible data loading architecture using the Strategy design pattern. The central DataLoader class is responsible for orchestrating data access and query execution. Rather than implementing database-specific logic directly, DataLoader delegates these responsibilities to a QueryStrategy interface, which defines the contract for connecting to a data source and executing queries.
Concrete implementations of QueryStrategy—such as MySQLDataLoader, CSVDataLoader, and PostgreSQLDataLoader—provide the actual logic for connecting to and querying their respective data sources. At runtime, DataLoader selects the appropriate strategy based on the configuration provided in query.json. This allows the system to support multiple data backends (e.g., MySQL, CSV, PostgreSQL) without modifying the core logic of DataLoader.
This design enables easy extension to new data sources by simply implementing new QueryStrategy classes.
Runtime flexibility, as the backend can be chosen dynamically based on user or system configuration.
Clean separation of concerns, with DataLoader.
Sample query.json
Workspace showcasing Dataloader capability to support CSV and MySQL database.
A way to define modifiable parameters and alerting collaborators of change in these parameters.Phase 3
MySQL, MongoDB,CSV, Amazon S3, InfluxDB etc.Phase 4
Beta Was this translation helpful? Give feedback.
All reactions