Progressive FT Launcher Attribution
Summary
Progressive FT launcher attribution reduces restart-decision latency by starting
log analysis when a fault-tolerance cycle starts, while the workload is still
running. When the cycle ends and ft_launcher asks attrsvc for a stop/restart
decision, attrsvc should reuse the analysis work already completed for the same
per-cycle application log and only perform the final missing work needed to
return an authoritative decision.
The feature is aimed at the ft_launcher integration path. The cluster-wide
service path may expose progressive behavior for validation, but it is not a
user-facing requirement there because service-mode attribution is post-mortem
and is not latency sensitive.
Design conclusion: attrsvc should stay mostly as plumbing. The progressive
option is offered at the NVRx-owned loganalysis tool boundary, and that tool can
later decide whether to delegate progressive state to LogSage or drive multiple
LogSage calls over the file flow. Until the LogSage contract is available,
POST /logs can carry the intent and the loganalysis tool can report
unsupported without changing GET /logs behavior.
Problem
Today, ft_launcher submits a per-cycle log path near cycle start with
POST /logs and later requests attribution with GET /logs after the
workload has stopped. The POST path records/tracks the log, but the
GET path triggers the full LogSage attribution pipeline. For large
application logs this makes ft_launcher wait for parsing and LLM analysis
after the workload has already died, increasing the time before the launcher can
act on a stop/restart decision.
The per-cycle log path is already a stable correlation key. Both calls refer to
the same path produced by the cycle log naming convention, for example
<applog>_cycle<id>.log.
Goals
Start attribution pre-work from the
ft_launchercycle-startPOST.Preserve the existing
GET /logsdecision contract forft_launcher.Make the final
GETresult equivalent to terminal analysis of the complete per-cycle log.Avoid enabling expensive progressive analysis by default for cluster-wide service-mode submissions.
Keep progressive analysis an optimization: if pre-analysis is unavailable, failed, stale, or unsupported,
GETmust fall back to the existing terminal analysis behavior.
Non-Goals
Do not require progressive analysis for service-mode cluster scans.
Do not introduce a user-facing live-attribution workflow outside
ft_launcher.Do not change the stop/restart policy consumed by
ft_launcher.Do not make
POST /logsblock on analysis completion.Do not make
POST /logsgenerate or return an attribution result.
User Flows
ft_launcher mode
At cycle start, the launcher computes the per-cycle log path and sends
POST /logsfor that path.Attrsvc recognizes the submission as an
ft_launcherprogressive-analysis request and starts non-blocking pre-analysis for the growing file.The workload runs and continues writing to the same log file.
At cycle end,
ft_launchersendsGET /logsfor the same path.Attrsvc uses the progressive state to complete the normal stop/end analysis and returns the normal attribution payload with a normalized recommendation.
If progressive state is missing or unusable, attrsvc computes the result with the existing terminal pipeline.
Service mode
Service mode may continue using POST /logs for job/file tracking and
GET /logs for post-mortem attribution. Progressive analysis should remain
disabled unless explicitly requested by a test or diagnostic client.
Functional Requirements
POST /logsmust accept an explicit signal that the caller wants progressive analysis for a single growing log.The ft_launcher client must send that signal when submitting a per-cycle log.
Attrsvc must forward progressive intent to the loganalysis tool boundary and return from
POSTwithout waiting for analyzer completion.The loganalysis tool may delegate progressive state to LogSage or implement the file-flow orchestration itself. Attrsvc should not need to know which model was chosen.
Attrsvc must not infer progressive analysis only from the absence of
job_idbecause non-ft_launchercallers can also submit single files.GET /logsmust continue to run the normal terminal analysis path before returning a decision. Once LogSage/tool support exists, that terminal call can reuse progressive work when available.GET /logsmust fall back to the existing full terminal analysis path when progressive analysis is unsupported, incomplete, stale, failed, or disabled.Progressive state, if any, must be correlated by normalized log path because this is the stable key shared by
POSTandGET.POST /logsmust remain a notification/early-start path.GET /logsremains the result-producing path.The existing result cache behavior does not change:
POSTdoes not populate the final analysis cache;GETremains the path that computes and records final attribution results.
API and Data Contract
Handoff Points
The feature crosses three handoff points. Each handoff should be documented separately so NVRx-owned plumbing is not confused with the shared LogSage capability contract.
- Service HTTP boundary
Owned by NVRx.
POST /logsaccepts the optional progressive intent andGET /logsremains the stop/end decision API. This boundary should stay backward compatible for existing attrsvc clients. The service does not own progressive parsing, tailing, or final-result caching.- MCP / loganalysis boundary
Owned by NVRx. This is the product feature boundary: it exposes the progressive option regardless of whether the implementation is an MCP tool, an in-process adapter, repeated LogSage calls, or a future LogSage-native progressive session. This is where NVRx should hide implementation ownership from attrsvc.
- LogSage API boundary
Shared with LogSage. This contract defines whether LogSage can start work early, preserve progressive state, and let the terminal analysis call reuse that state while producing a result equivalent to full terminal analysis. It remains the main open design item.
HTTP API
Extend POST /logs with optional fields:
analysis_intentOptional analysis behavior requested by the client. Proposed values are
"track_only"and"progressive". Default is"track_only"for backward compatibility.
The existing log_path, user, and job_id fields stay compatible.
Existing clients that omit analysis_intent retain current behavior.
GET /logs should keep the current response shape. It may optionally include
diagnostic metadata in the future, but ft_launcher must continue to consume
the normalized recommendation field without understanding progressive
internals.
Python API
Extend the internal submit boundary from the HTTP adapter down to the analyzer
with an optional progressive intent. The NVRx-side lifecycle should remain
simple: POST starts early work when requested, and the existing
GET/analyze path remains the stop/end activity that returns the final
decision.
submit_log(..., analysis_intent="track_only")analyzer-level delegation to
LogAnalyzer.start_progressive_analysis(path, user, job_id)loganalysis runner delegation to the selected lib/MCP tool adapter
The current plumbing does not change Analyzer.analyze or the GET result
path. When the LogSage/tool contract supports reuse, terminal analysis can add a
use_progressive-style option at the loganalysis layer while preserving the
existing caller-facing GET shape.
MCP / Loganalysis Contract
The MCP/loganalysis boundary is NVRx-owned and should mirror the in-process library adapter. The boundary needs two concepts:
- Progressive start
A non-result-producing operation used by the
POST /logspath whenanalysis_intent="progressive". The initial code exposes this aslog_analyzer_progressive_start. It accepts the normalizedlog_path,is_per_cycle=Truefor ft_launcher cycle logs, optional observability fields, and any runtime settings needed by LogSage to bind a future progressive session. It returns status metadata such as accepted/unsupported/failed and, if useful, a handle or session id. It must not return a final attribution result or create a cached MCP result resource.- Terminal run with progressive reuse
The existing
GET /logspath should still invoke terminal log analysis and receive the current LogSage-shaped result. Once the LogSage API is settled, the loganalysis boundary needs a way to ask the backend to reuse progressive state for the same path, for example an optionaluse_progressive=Trueargument onlog_analyzer. If the backend cannot reuse state, the call should fall back to normal terminal analysis. The current plumbing leaves terminalGETunchanged.
Flight-recorder analysis is unchanged by this feature. POST /logs does not
notify FR. On GET /logs, attrsvc can continue to run log analysis and FR
analysis with the existing terminal orchestration; only the log-analysis call
needs a way to reuse progressive LogSage state when it exists.
Configuration
Add a service-side policy switch for progressive analysis. The default honors explicit progressive requests because ft_launcher is the expected caller and the POST path remains non-blocking even while the LogSage progressive API is being implemented.
NVRX_ATTRSVC_PROGRESSIVE_ANALYSISall_explicitby default. Honors explicit progressivePOST /logsrequests. Setoffto disable progressive start. A stricter ft_launcher-only policy would require a caller identity or another server-side way to identify the submitter.
No attrsvc polling, cache, or concurrency settings are required in the current plumbing. If the loganalysis tool later owns repeated LogSage calls over a growing file, those operational controls should live with that tool implementation rather than in the HTTP service wrapper.
Observability
Expose enough state to tell whether the latency optimization is working:
Count progressive
POSTrequests and whether they were accepted, ignored, or rejected by policy.Count started, unsupported, failed, canceled, completed, and fallback progressive analyses.
Track final
GETlatency and, where possible, time saved by progressive work.Include active progressive paths in status/debug output without dumping log contents.
Log when
GETfalls back to terminal analysis, including the fallback reason.
Compatibility
Existing attrsvc callers must continue to work without sending new fields.
Service-mode POST calls with job_id continue to support splitlog
detection and tracking. Single-file POST calls from older clients remain
track-only.
The final GET result must remain semantically compatible with the current
terminal analysis result. A progressive implementation may improve latency, but
must not weaken final attribution correctness.
NVRx-Side Implementation Plan
Add an explicit progressive intent field to the shared HTTP helper and attrsvc
SubmitRequest.Update the attribution-owned in-job HTTP client to send
analysis_intent="progressive"on its normalPOST /logsnotification.ft_launchershould continue using the existing attribution submit hook and should not own the HTTP payload detail.Thread the intent through
AttributionHttpAdapter,AttributionController, andAnalyzer.submit.In attrsvc, keep default
POSTbehavior as track-only. When progressive intent is requested and the service feature gate allows it, initiate progressive analysis throughLogAnalyzer.start_progressive_analysis.In MCP mode, expose
log_analyzer_progressive_startas a non-result-producing tool. Until LogSage support exists, it returnsunsupportedstatus metadata.On success, return the existing response shape.
GET /logsremains the path that produces and records final attribution results.Once the LogSage contract exists, extend the terminal loganalysis call so it can request progressive reuse while keeping the existing
GETresponse shape and fallback behavior.
Progressive Execution Model
Attrsvc owns request plumbing and policy. The NVRx loganalysis tool owns the progressive option exposed to attrsvc. The implementation behind that tool is still a design choice and should be settled with the LogSage implementation.
- LogSage-owned progression
The loganalysis tool calls a LogSage
startoperation and returns. LogSage owns any background watching, tailing, checkpointing, or incremental state. The normalGETpath calls terminal analysis with progressive reuse enabled. Attrsvc only sees accepted/unsupported/failure status.- Tool-owned progression
The NVRx loganalysis tool advances analysis as the file grows, possibly by calling LogSage multiple times or by using a smaller LogSage
advanceprimitive. Polling/tailing configuration, concurrency limits, lifecycle cleanup, and state tracking belong to that tool implementation.
The requirement does not choose between these models. It requires that
POST can request an early start, GET remains the stop/end decision path,
and terminal correctness is preserved by fallback to the existing full analysis.
LogSage API Proposal
The current LogSage-style API is effectively terminal:
input: complete
log_pathoutput: final LogSage-shaped result and recommendation
For this feature, the NVRx loganalysis tool needs a way for LogSage to preserve and reuse work for a growing per-cycle log. The shared behavior can be expressed with two operations:
start(path, *, session_id=None, is_per_cycle=True, metadata=None)Create or return progressive state for a log. This should be idempotent for the same normalized path or supplied session id.
run(path, *, use_progressive=True, final=True)Return the normal terminal LogSage result for the complete log, reusing any progressive state that was started for the same path. This is invoked from the existing
GET /logsstop/end path.
LogSage may internally expose an advance or completion primitive if that is
the cleanest implementation, or the NVRx loganalysis tool may drive repeated
LogSage calls. Attrsvc does not need a separate public
finalize_progressive operation. The important part is that the terminal
run can consume any remaining tail bytes, validate the final log state, and
produce the same result shape used today.
cancel(handle)Release resources if the job ends without a final attribution request or the service is shutting down.
Minimum status metadata returned through the NVRx boundary:
handleorsession_idStable identifier for the progressive analysis.
consumed_offsetLast byte or line offset included in progressive state.
statuspending,running,ready_to_complete,completed,failed, orunsupported.errorStructured failure reason when status is
failedorunsupported.
The important semantic requirement is that the stop/end run must produce a
result equivalent to running terminal LogSage on the complete final file.
Attrsvc should not need to understand LogSage’s internal summaries, LLM prompt
state, or whether reuse was LogSage-owned or tool-owned.
Validation
Unit-test that existing
POST /logsrequests remain track-only by default.Unit-test that ft_launcher
POSTsends progressive intent.Unit-test that
Analyzer.submitdelegates progressive start to the loganalysis boundary only when the feature gate is enabled.Unit-test that the MCP/loganalysis progressive-start operation returns status metadata and does not run terminal attribution.
Once LogSage/tool reuse exists, unit-test that the terminal MCP/loganalysis call can request progressive reuse while preserving the existing result shape.
Unit-test that service-mode
POSTwithjob_iddoes not initiate progressive analysis by default.Once reuse exists, unit-test
GETfallback when progressive state is missing, unsupported, failed, stale, or cannot be used to complete analysis.Unit-test that
POSTdoes not return a completed attribution result.Integration-test the full
ft_launcherflow with a fake progressive analyzer: POST starts state, log grows, GET returns a normal recommendation through the existing stop/end path.End-to-end validate latency improvement with real LogSage progressive support once available.
Open Questions
What exact LogSage API should back
log_analyzer_progressive_startand terminal reuse?Should progressive advancement be LogSage-owned after
start, or should the NVRx loganalysis tool drive repeated LogSage calls over the growing file?If the NVRx loganalysis tool owns advancement, what polling and concurrency policy should it use?