Introduction
Briefly describe the project in plain language.
- Problem and domain:
- Why federated analytics suits this problem:
- Who will use the results:
Scope and objectives
Define the project boundaries and goals.
- Primary objective:
- Population and setting (describe subjects, patients, or data sources):
- Outputs to produce:
- Predictive model, or
- Descriptive / inferential statistics, or
- Privacy preserving dashboards or studies
- Out of scope:
Assumptions and constraints
Make assumptions explicit to avoid hidden risks.
- Key assumptions:
- Constraints (technical, organizational, regulatory):
- How assumptions will be validated or monitored:
Governance
Policies and roles guiding responsible data and model use. (Note: governance covers oversight and policy; technical controls belong in the Privacy, security, and risk section.)
- Stakeholders and roles:
- Approvals and oversight:
- Legal basis, consent, and agreements:
- Ethics status (institutional approval, N/A for public datasets, or secondary analysis notes):
- Access and sharing policy for data, models, or results:
- Publication and dissemination rules:
Data landscape
Describe the data available and site level differences.
- Federation mode (simulated, live, hybrid):
- If simulated: describe how clients are defined (e.g., temporal cross-validation, geographic partitions, synthetic splits)
- If live: describe institutional boundaries and participation
- Clients and data sources:
- Inclusion and exclusion rules:
- Feature families and link to data dictionary:
- Label or outcome definitions (if applicable):
- Dataset size, class balance, and known biases:
- Data quality issues:
Standards and harmonization
List conventions that ensure semantic alignment.
- Vocabularies, ontologies, or coding systems:
- Unit conventions and mapping rules:
- Versioning and updates:
Infrastructure
How the federation is run and secured.
- Federation topology and orchestration:
- Frameworks and libraries:
- Client participation policy:
- Compute, storage, and networking:
- Monitoring and failure recovery:
- Simulation or live clients (if simulated, describe how this approximates real-world federation):
- Security baseline for transport and authentication:
Wrangling
How data are prepared locally.
- Preprocessing steps and provenance:
- Train, validation, and test splits (if modeling):
- Normalization strategy and source of stats:
- Missing data handling:
- Class imbalance handling (if modeling):
- Validation checks and data QA:
Computation plan
Describe methods to be run.
- If predictive modeling
- Baselines and algorithms (include centralized baselines if available):
- Personalization or adaptation strategy:
- Model architectures:
- Training schedule and early stopping:
- Hyperparameters and search plan:
- If analytics without modeling
- Statistical methods and estimators:
- Aggregations and query design:
- Hypothesis tests and assumptions:
- Privacy budgets (if using differential privacy):
- Random seeds and reproducibility notes:
Evaluation and success criteria
How results will be judged.
- If modeling
- Primary and secondary metrics:
- Client side evaluation plan:
- Aggregation across clients:
- Calibration and threshold selection:
- Runtime and cost reporting:
- Statistical tests and uncertainty:
- Comparison to centralized baseline (if available, report performance difference and privacy trade-offs):
- If analytics without modeling
- Estimator accuracy and precision:
- Coverage or confidence intervals:
- Agreement with a centralized reference (if feasible):
- Sensitivity analyses for assumptions:
- Robustness checks across clients:
- Runtime and cost reporting:
- Fairness and subgroup checks (demographic, per-site, per-outcome, or other relevant subgroups):
Privacy, security, and risk
Technical and procedural safeguards. (Note: this section describes how governance policies are implemented.)
- Threat model (distinguish simulation vs deployment if applicable):
- Controls in use:
- Secure aggregation
- Encryption in transit and at rest
- Differential privacy or k-anonymity if applicable
- Access logging and audit trail
- Privacy budget accounting for repeated queries:
- Incident response and contacts:
- Simulation-specific notes (if applicable, describe how simulation differs from deployment privacy risks):
Reproducibility and sharing
Make it possible for others to rerun or extend the work.
- Code repository and commit tag:
- Environment capture and seeds:
- Artifacts to release (configs, metrics, models if allowed):
- Artifact registry or index for traceability:
- Data availability (public dataset with URL, restricted access with application process, synthetic samples):
- Known limitations and caveats:
Operationalization and maintenance
Plan for use beyond the study.
- Deployment target and owner (or “intended use case” for proof-of-concept work without concrete deployment):
- If modeling
- Monitoring for drift and performance:
- Update and retraining policy:
- If analytics without modeling
- Schedule for recurring queries or dashboards:
- Change control for query definitions:
- Site playbooks and operator training:
- Sunset or rollback plan:
Technology readiness level (TRL)
Describe maturity and supporting evidence.
- Claimed TRL:
- Evidence and references:
- Gaps to reach the next TRL:
- Target deployment setting:
Wrap up
Summarize the key outcomes and next steps.
- Key learning:
- Decisions made and why:
- Next step to raise TRL:
Appendix: Changes from Template v1.1
Added fields
- Governance section: “Ethics status” field to handle public datasets, secondary analysis, and N/A cases
- Data landscape section: “Federation mode” field (simulated, live, hybrid) with guidance on documenting simulation approaches
- Infrastructure section: Explicit prompt to describe how simulation approximates real-world federation
- Computation plan section: Explicit request for centralized baselines in modeling subsection
- Evaluation section: “Comparison to centralized baseline” field to quantify privacy-preserving trade-offs
- Privacy section: “Simulation-specific notes” to distinguish current vs. future deployment threat models
- Reproducibility section: “Data availability” field with three options (public, restricted, synthetic)
- Operationalization section: Allow “intended use case” for proof-of-concept work without concrete deployment owner
Clarified language
- Scope section: Changed “Population and setting” description to explicitly include non-clinical subjects (e.g., “describe subjects, patients, or data sources”)
- Evaluation section: Clarified “Fairness and subgroup checks” to include non-demographic subgroups (per-site, per-outcome, per-finger, etc.)
- Throughout: Added simulation considerations where relevant to support early-stage work
Design principles preserved from v1.1
- Prose-first approach (minimize bullet points in completed examples)
- Separation of governance (policy) from privacy/security (technical controls)
- Explicit TRL assessment with gap analysis
- Reproducibility focus with artifact tracking
Appendix: Potential Future Sections for Template v3.0
The following sections were identified as potentially valuable but require more community feedback before inclusion:
-
Algorithm Development & Validation: For projects introducing novel federated methods, document mathematical formulation, algorithmic innovations, and validation separate from application performance (distinct from “Computation plan” which focuses on using existing methods)
-
Communication & Bandwidth Analysis: Quantify bytes transmitted per round, total bandwidth requirements, compression strategies, network latency tolerance, and communication efficiency techniques (gradient sparsification, quantization)
-
Heterogeneity Analysis: Systematically document data heterogeneity (distribution differences), system heterogeneity (compute/memory/network variations), and statistical heterogeneity (non-IID effects on convergence) in a dedicated section rather than scattered across Infrastructure and Evaluation
-
Client Selection & Sampling Strategy: Document selection criteria (random, stratified, active learning), minimum participation requirements per round, dropout tolerance policies, and strategies for handling partial client participation
-
Interpretability & Explainability: For clinical or high-stakes applications, document model interpretability for domain experts, feature importance analysis, failure mode characterization, and explanation generation strategies
-
Cost-Benefit Analysis: Quantify resource costs (compute, storage, personnel), opportunity costs of federated constraints, and benefit quantification to justify federated approach versus centralized alternatives (include acceptable performance trade-offs)