Progressive FT Launcher Attribution

Summary

Progressive FT launcher attribution reduces restart-decision latency by starting log analysis when a fault-tolerance cycle starts, while the workload is still running. When the cycle ends and ft_launcher asks attrsvc for a stop/restart decision, attrsvc should reuse the analysis work already completed for the same per-cycle application log and only perform the final missing work needed to return an authoritative decision.

The feature is aimed at the ft_launcher integration path. The cluster-wide service path may expose progressive behavior for validation, but it is not a user-facing requirement there because service-mode attribution is post-mortem and is not latency sensitive.

Design conclusion: attrsvc should stay mostly as plumbing. The progressive option is offered at the NVRx-owned loganalysis tool boundary, and that tool can later decide whether to delegate progressive state to LogSage or drive multiple LogSage calls over the file flow. Until the LogSage contract is available, POST /logs can carry the intent and the loganalysis tool can report unsupported without changing GET /logs behavior.

Problem

Today, ft_launcher submits a per-cycle log path near cycle start with POST /logs and later requests attribution with GET /logs after the workload has stopped. The POST path records/tracks the log, but the GET path triggers the full LogSage attribution pipeline. For large application logs this makes ft_launcher wait for parsing and LLM analysis after the workload has already died, increasing the time before the launcher can act on a stop/restart decision.

The per-cycle log path is already a stable correlation key. Both calls refer to the same path produced by the cycle log naming convention, for example <applog>_cycle<id>.log.

Goals

  • Start attribution pre-work from the ft_launcher cycle-start POST.

  • Preserve the existing GET /logs decision contract for ft_launcher.

  • Make the final GET result equivalent to terminal analysis of the complete per-cycle log.

  • Avoid enabling expensive progressive analysis by default for cluster-wide service-mode submissions.

  • Keep progressive analysis an optimization: if pre-analysis is unavailable, failed, stale, or unsupported, GET must fall back to the existing terminal analysis behavior.

Non-Goals

  • Do not require progressive analysis for service-mode cluster scans.

  • Do not introduce a user-facing live-attribution workflow outside ft_launcher.

  • Do not change the stop/restart policy consumed by ft_launcher.

  • Do not make POST /logs block on analysis completion.

  • Do not make POST /logs generate or return an attribution result.

User Flows

ft_launcher mode

  1. At cycle start, the launcher computes the per-cycle log path and sends POST /logs for that path.

  2. Attrsvc recognizes the submission as an ft_launcher progressive-analysis request and starts non-blocking pre-analysis for the growing file.

  3. The workload runs and continues writing to the same log file.

  4. At cycle end, ft_launcher sends GET /logs for the same path.

  5. Attrsvc uses the progressive state to complete the normal stop/end analysis and returns the normal attribution payload with a normalized recommendation.

  6. If progressive state is missing or unusable, attrsvc computes the result with the existing terminal pipeline.

Service mode

Service mode may continue using POST /logs for job/file tracking and GET /logs for post-mortem attribution. Progressive analysis should remain disabled unless explicitly requested by a test or diagnostic client.

Functional Requirements

  • POST /logs must accept an explicit signal that the caller wants progressive analysis for a single growing log.

  • The ft_launcher client must send that signal when submitting a per-cycle log.

  • Attrsvc must forward progressive intent to the loganalysis tool boundary and return from POST without waiting for analyzer completion.

  • The loganalysis tool may delegate progressive state to LogSage or implement the file-flow orchestration itself. Attrsvc should not need to know which model was chosen.

  • Attrsvc must not infer progressive analysis only from the absence of job_id because non-ft_launcher callers can also submit single files.

  • GET /logs must continue to run the normal terminal analysis path before returning a decision. Once LogSage/tool support exists, that terminal call can reuse progressive work when available.

  • GET /logs must fall back to the existing full terminal analysis path when progressive analysis is unsupported, incomplete, stale, failed, or disabled.

  • Progressive state, if any, must be correlated by normalized log path because this is the stable key shared by POST and GET.

  • POST /logs must remain a notification/early-start path. GET /logs remains the result-producing path.

  • The existing result cache behavior does not change: POST does not populate the final analysis cache; GET remains the path that computes and records final attribution results.

API and Data Contract

Handoff Points

The feature crosses three handoff points. Each handoff should be documented separately so NVRx-owned plumbing is not confused with the shared LogSage capability contract.

Service HTTP boundary

Owned by NVRx. POST /logs accepts the optional progressive intent and GET /logs remains the stop/end decision API. This boundary should stay backward compatible for existing attrsvc clients. The service does not own progressive parsing, tailing, or final-result caching.

MCP / loganalysis boundary

Owned by NVRx. This is the product feature boundary: it exposes the progressive option regardless of whether the implementation is an MCP tool, an in-process adapter, repeated LogSage calls, or a future LogSage-native progressive session. This is where NVRx should hide implementation ownership from attrsvc.

LogSage API boundary

Shared with LogSage. This contract defines whether LogSage can start work early, preserve progressive state, and let the terminal analysis call reuse that state while producing a result equivalent to full terminal analysis. It remains the main open design item.

HTTP API

Extend POST /logs with optional fields:

analysis_intent

Optional analysis behavior requested by the client. Proposed values are "track_only" and "progressive". Default is "track_only" for backward compatibility.

The existing log_path, user, and job_id fields stay compatible. Existing clients that omit analysis_intent retain current behavior.

GET /logs should keep the current response shape. It may optionally include diagnostic metadata in the future, but ft_launcher must continue to consume the normalized recommendation field without understanding progressive internals.

Python API

Extend the internal submit boundary from the HTTP adapter down to the analyzer with an optional progressive intent. The NVRx-side lifecycle should remain simple: POST starts early work when requested, and the existing GET/analyze path remains the stop/end activity that returns the final decision.

  • submit_log(..., analysis_intent="track_only")

  • analyzer-level delegation to LogAnalyzer.start_progressive_analysis(path, user, job_id)

  • loganalysis runner delegation to the selected lib/MCP tool adapter

The current plumbing does not change Analyzer.analyze or the GET result path. When the LogSage/tool contract supports reuse, terminal analysis can add a use_progressive-style option at the loganalysis layer while preserving the existing caller-facing GET shape.

MCP / Loganalysis Contract

The MCP/loganalysis boundary is NVRx-owned and should mirror the in-process library adapter. The boundary needs two concepts:

Progressive start

A non-result-producing operation used by the POST /logs path when analysis_intent="progressive". The initial code exposes this as log_analyzer_progressive_start. It accepts the normalized log_path, is_per_cycle=True for ft_launcher cycle logs, optional observability fields, and any runtime settings needed by LogSage to bind a future progressive session. It returns status metadata such as accepted/unsupported/failed and, if useful, a handle or session id. It must not return a final attribution result or create a cached MCP result resource.

Terminal run with progressive reuse

The existing GET /logs path should still invoke terminal log analysis and receive the current LogSage-shaped result. Once the LogSage API is settled, the loganalysis boundary needs a way to ask the backend to reuse progressive state for the same path, for example an optional use_progressive=True argument on log_analyzer. If the backend cannot reuse state, the call should fall back to normal terminal analysis. The current plumbing leaves terminal GET unchanged.

Flight-recorder analysis is unchanged by this feature. POST /logs does not notify FR. On GET /logs, attrsvc can continue to run log analysis and FR analysis with the existing terminal orchestration; only the log-analysis call needs a way to reuse progressive LogSage state when it exists.

Configuration

Add a service-side policy switch for progressive analysis. The default honors explicit progressive requests because ft_launcher is the expected caller and the POST path remains non-blocking even while the LogSage progressive API is being implemented.

NVRX_ATTRSVC_PROGRESSIVE_ANALYSIS

all_explicit by default. Honors explicit progressive POST /logs requests. Set off to disable progressive start. A stricter ft_launcher-only policy would require a caller identity or another server-side way to identify the submitter.

No attrsvc polling, cache, or concurrency settings are required in the current plumbing. If the loganalysis tool later owns repeated LogSage calls over a growing file, those operational controls should live with that tool implementation rather than in the HTTP service wrapper.

Observability

Expose enough state to tell whether the latency optimization is working:

  • Count progressive POST requests and whether they were accepted, ignored, or rejected by policy.

  • Count started, unsupported, failed, canceled, completed, and fallback progressive analyses.

  • Track final GET latency and, where possible, time saved by progressive work.

  • Include active progressive paths in status/debug output without dumping log contents.

  • Log when GET falls back to terminal analysis, including the fallback reason.

Compatibility

Existing attrsvc callers must continue to work without sending new fields. Service-mode POST calls with job_id continue to support splitlog detection and tracking. Single-file POST calls from older clients remain track-only.

The final GET result must remain semantically compatible with the current terminal analysis result. A progressive implementation may improve latency, but must not weaken final attribution correctness.

NVRx-Side Implementation Plan

  1. Add an explicit progressive intent field to the shared HTTP helper and attrsvc SubmitRequest.

  2. Update the attribution-owned in-job HTTP client to send analysis_intent="progressive" on its normal POST /logs notification. ft_launcher should continue using the existing attribution submit hook and should not own the HTTP payload detail.

  3. Thread the intent through AttributionHttpAdapter, AttributionController, and Analyzer.submit.

  4. In attrsvc, keep default POST behavior as track-only. When progressive intent is requested and the service feature gate allows it, initiate progressive analysis through LogAnalyzer.start_progressive_analysis.

  5. In MCP mode, expose log_analyzer_progressive_start as a non-result-producing tool. Until LogSage support exists, it returns unsupported status metadata.

  6. On success, return the existing response shape. GET /logs remains the path that produces and records final attribution results.

  7. Once the LogSage contract exists, extend the terminal loganalysis call so it can request progressive reuse while keeping the existing GET response shape and fallback behavior.

Progressive Execution Model

Attrsvc owns request plumbing and policy. The NVRx loganalysis tool owns the progressive option exposed to attrsvc. The implementation behind that tool is still a design choice and should be settled with the LogSage implementation.

LogSage-owned progression

The loganalysis tool calls a LogSage start operation and returns. LogSage owns any background watching, tailing, checkpointing, or incremental state. The normal GET path calls terminal analysis with progressive reuse enabled. Attrsvc only sees accepted/unsupported/failure status.

Tool-owned progression

The NVRx loganalysis tool advances analysis as the file grows, possibly by calling LogSage multiple times or by using a smaller LogSage advance primitive. Polling/tailing configuration, concurrency limits, lifecycle cleanup, and state tracking belong to that tool implementation.

The requirement does not choose between these models. It requires that POST can request an early start, GET remains the stop/end decision path, and terminal correctness is preserved by fallback to the existing full analysis.

LogSage API Proposal

The current LogSage-style API is effectively terminal:

  • input: complete log_path

  • output: final LogSage-shaped result and recommendation

For this feature, the NVRx loganalysis tool needs a way for LogSage to preserve and reuse work for a growing per-cycle log. The shared behavior can be expressed with two operations:

start(path, *, session_id=None, is_per_cycle=True, metadata=None)

Create or return progressive state for a log. This should be idempotent for the same normalized path or supplied session id.

run(path, *, use_progressive=True, final=True)

Return the normal terminal LogSage result for the complete log, reusing any progressive state that was started for the same path. This is invoked from the existing GET /logs stop/end path.

LogSage may internally expose an advance or completion primitive if that is the cleanest implementation, or the NVRx loganalysis tool may drive repeated LogSage calls. Attrsvc does not need a separate public finalize_progressive operation. The important part is that the terminal run can consume any remaining tail bytes, validate the final log state, and produce the same result shape used today.

cancel(handle)

Release resources if the job ends without a final attribution request or the service is shutting down.

Minimum status metadata returned through the NVRx boundary:

handle or session_id

Stable identifier for the progressive analysis.

consumed_offset

Last byte or line offset included in progressive state.

status

pending, running, ready_to_complete, completed, failed, or unsupported.

error

Structured failure reason when status is failed or unsupported.

The important semantic requirement is that the stop/end run must produce a result equivalent to running terminal LogSage on the complete final file. Attrsvc should not need to understand LogSage’s internal summaries, LLM prompt state, or whether reuse was LogSage-owned or tool-owned.

Validation

  • Unit-test that existing POST /logs requests remain track-only by default.

  • Unit-test that ft_launcher POST sends progressive intent.

  • Unit-test that Analyzer.submit delegates progressive start to the loganalysis boundary only when the feature gate is enabled.

  • Unit-test that the MCP/loganalysis progressive-start operation returns status metadata and does not run terminal attribution.

  • Once LogSage/tool reuse exists, unit-test that the terminal MCP/loganalysis call can request progressive reuse while preserving the existing result shape.

  • Unit-test that service-mode POST with job_id does not initiate progressive analysis by default.

  • Once reuse exists, unit-test GET fallback when progressive state is missing, unsupported, failed, stale, or cannot be used to complete analysis.

  • Unit-test that POST does not return a completed attribution result.

  • Integration-test the full ft_launcher flow with a fake progressive analyzer: POST starts state, log grows, GET returns a normal recommendation through the existing stop/end path.

  • End-to-end validate latency improvement with real LogSage progressive support once available.

Open Questions

  • What exact LogSage API should back log_analyzer_progressive_start and terminal reuse?

  • Should progressive advancement be LogSage-owned after start, or should the NVRx loganalysis tool drive repeated LogSage calls over the growing file?

  • If the NVRx loganalysis tool owns advancement, what polling and concurrency policy should it use?