Enabling NVIDIA Flare POC Mode#
This part of the How-To assumes you have completed Part 1: Setting up a basic Data Federation for local testing. You have a working federation with federation.dfm.yaml, adapter code in myfed/myfed/lib/, a registered federation, generated code, and a notebook that runs with target="local".
Here we add NVIDIA Flare POC (Proof of Concept) mode so you can run the same federation over Flare’s simulated distributed infrastructure on your machine.
Table of contents#
What You will Do#
Add a server site to
federation.dfm.yaml(required so DFM generates the server runtime for Flare).Add a project.yaml file (Flare’s infrastructure descriptor).
Update your federation registration to point to the project.
Regenerate federation code and reinstall the package.
Start POC and run your existing notebook with
target="flare".
Your adapter code stays the same. You will add one server site to federation.dfm.yaml (so the Flare controller can load its runtime module), add project.yaml, then change the notebook connection.
Step 1: Add the Server Site to the Federation (Required for POC)#
In POC mode, the Flare server participant runs the DFM controller, which loads a runtime module for the site named server. That module is only generated when server is listed under sites in your federation config. Part 1 only defines homesite, loader, and plotter, so you must add a server site before running POC.
Open myfed/configs/federation.dfm.yaml and add a server entry under sites with an empty interface (the server does not run pipeline operations):
sites:
homesite:
# ... existing homesite entry unchanged ...
server:
info:
description: "Flare server (controller); no operations"
interface: {}
loader:
# ... existing loader entry unchanged ...
plotter:
# ...
Then regenerate code (you will do this again after adding project.yaml in Step 2, or do it once after Step 2):
dfm fed gen code myfed --output-dir myfed
This creates myfed/fed/runtime/server/ so the controller can load myfed.fed.runtime.server.
Step 2: Add the NVIDIA Flare Project File#
Flare needs a project file that defines participants (server and clients) and how to build startup kits. DFM uses this for POC and for later provisioning.
What project.yaml Is For#
federation.dfm.yaml defines your operations and sites (including the
serversite you added in Step 1, plus loader, plotter, etc.).project.yaml defines the Flare infrastructure: one server, one admin, and one client per executing site.
Client names in project.yaml must match the site names in federation.dfm.yaml (except homesite and server; the server is the Flare server participant, not a client).
Create myfed/configs/project.yaml#
Create myfed/configs/project.yaml with the following. Use the same workspace directory as in Part 1 (for example, ~/zero-to-thirty).
api_version: 3
name: myfed
description: "Array subsetting federation"
participants:
# Admin: required by Flare. Name must be email format; use admin@nvidia.com (DFM default).
- name: admin@nvidia.com
type: admin
org: myorg
role: project_admin
# Server: central coordinator. Needs fed_learn_port and admin_port.
- name: server
type: server
org: myorg
fed_learn_port: 8002
admin_port: 8003
# Clients: one per executing site in federation.dfm.yaml (loader, plotter in Part 1).
- name: loader
type: client
org: myorg
- name: plotter
type: client
org: myorg
builders:
- path: nvflare.lighter.impl.workspace.WorkspaceBuilder
args:
template_file:
- master_template.yml
- aws_template.yml
- azure_template.yml
- path: nvflare.lighter.impl.template.TemplateBuilder
- path: nvflare.lighter.impl.static_file.StaticFileBuilder
args:
config_folder: config
overseer_agent:
path: nvflare.ha.dummy_overseer_agent.DummyOverseerAgent
overseer_exists: false
args:
sp_end_point: server:8002:8003
- path: nvflare.lighter.impl.cert.CertBuilder
- path: nvflare.lighter.impl.signature.SignatureBuilder
- path: nv_dfm_core.targets.flare.builder.WorkspaceArchiveBuilder
Important
If your Part 1 federation has different site names (for example, if you added a slicer site), add a matching - name: slicer client under participants and keep client names in sync with federation.dfm.yaml.
Step 3: Point DFM at the Project File#
Register the project path so DFM (and dfm poc) can find it. From your workspace root (for example, ~/zero-to-thirty):
dfm fed config set myfed \
--federation-dir myfed \
--config-path configs/federation.dfm.yaml \
--project-path configs/project.yaml
If you already had myfed registered without a project path, this updates it. Paths are relative to the federation directory: configs/project.yaml means myfed/configs/project.yaml.
Step 4: Regenerate Code and Reinstall#
Because you added the server site (and possibly only just added project.yaml), regenerate and reinstall so all participants have the right runtime:
dfm fed gen code myfed --output-dir myfed
pip install -e myfed/
Step 5: Start POC#
From the workspace root:
dfm poc start -f myfed
Optional: check status and logs:
dfm poc status -f myfed
dfm poc logs
If something goes wrong (for example, port in use or certificate errors), stop and clean up, then try again:
dfm poc stop
dfm poc cleanup -f myfed
dfm poc start -f myfed
Step 6: Update Your Notebook for Flare#
Your pipeline and the rest of the notebook stay the same. Only the connection cell changes: use target="flare" and pass the admin startup kit path.
After POC has started, Flare creates the admin startup kit under the workspace, for example:
workspace/myfed_poc/myfed/prod_00/admin@nvidia.com
From a notebook under myfed/apps/, the workspace root is typically two levels up (for example, Path.cwd().parent.parent). Use that to build the path.
Replace your existing “Connect to federation” cell with:
from pathlib import Path
# POC creates the admin startup kit here (relative to workspace root).
# If the notebook is in myfed/apps/, workspace root is parent.parent.
admin_startup_kit = (
Path.cwd().parent.parent
/ "workspace"
/ "myfed_poc"
/ "myfed"
/ "prod_00"
/ "admin@nvidia.com"
)
session = get_session(target="flare", admin_package=admin_startup_kit)
session.connect()
print("Connected to federation!")
Keep all other cells as in Part 1 (same pipeline, prepare, execute, callback, wait, display, close). Run the notebook from the workspace (or ensure the path to admin_startup_kit is correct for your directory layout).
Step 7: Run the Notebook#
Start Jupyter from the workspace root:
cd ~/zero-to-thirty jupyter lab
Open
myfed/apps/application.ipynb(or your notebook).Run all cells in order.
You should see the same behavior as in Part 1, but with jobs running over Flare POC instead of the local target.
NOTE: After these changes, you can still run in local target mode. If you change your Cell 2 back to use get_session(target="local"), it should still work!
Cleanup#
When you are done with POC:
dfm poc stop
dfm poc cleanup -f myfed
Troubleshooting (POC-specific)#
Issue |
What to do |
|---|---|
Port already in use / certificate errors |
Run |
“Startup kit does not exist” |
Ensure POC has finished ( |
“Cannot find module myfed” on POC sites |
From workspace root: |
“Could not locate a Python module for name myfed.fed.runtime.server” |
Add the |
Site names in project.yaml |
Client names under |
For general federation or pipeline issues, see Part 1.