Progressive FT Launcher Attribution
===================================

Summary
-------

Progressive FT launcher attribution reduces restart-decision latency by starting
log analysis when a fault-tolerance cycle starts, while the workload is still
running. When the cycle ends and ``ft_launcher`` asks attrsvc for a stop/restart
decision, attrsvc should reuse the analysis work already completed for the same
per-cycle application log and only perform the final missing work needed to
return an authoritative decision.

The feature is aimed at the ``ft_launcher`` integration path. The cluster-wide
service path may expose progressive behavior for validation, but it is not a
user-facing requirement there because service-mode attribution is post-mortem
and is not latency sensitive.

Design conclusion: attrsvc should stay mostly as plumbing. The progressive
option is offered at the NVRx-owned loganalysis tool boundary, and that tool can
later decide whether to delegate progressive state to LogSage or drive multiple
LogSage calls over the file flow. Until the LogSage contract is available,
``POST /logs`` can carry the intent and the loganalysis tool can report
``unsupported`` without changing ``GET /logs`` behavior.

Problem
-------

Today, ``ft_launcher`` submits a per-cycle log path near cycle start with
``POST /logs`` and later requests attribution with ``GET /logs`` after the
workload has stopped. The ``POST`` path records/tracks the log, but the
``GET`` path triggers the full LogSage attribution pipeline. For large
application logs this makes ``ft_launcher`` wait for parsing and LLM analysis
after the workload has already died, increasing the time before the launcher can
act on a stop/restart decision.

The per-cycle log path is already a stable correlation key. Both calls refer to
the same path produced by the cycle log naming convention, for example
``<applog>_cycle<id>.log``.

Goals
-----

* Start attribution pre-work from the ``ft_launcher`` cycle-start ``POST``.
* Preserve the existing ``GET /logs`` decision contract for ``ft_launcher``.
* Make the final ``GET`` result equivalent to terminal analysis of the complete
  per-cycle log.
* Avoid enabling expensive progressive analysis by default for cluster-wide
  service-mode submissions.
* Keep progressive analysis an optimization: if pre-analysis is unavailable,
  failed, stale, or unsupported, ``GET`` must fall back to the existing terminal
  analysis behavior.

Non-Goals
---------

* Do not require progressive analysis for service-mode cluster scans.
* Do not introduce a user-facing live-attribution workflow outside
  ``ft_launcher``.
* Do not change the stop/restart policy consumed by ``ft_launcher``.
* Do not make ``POST /logs`` block on analysis completion.
* Do not make ``POST /logs`` generate or return an attribution result.

User Flows
----------

``ft_launcher`` mode
~~~~~~~~~~~~~~~~~~~~

1. At cycle start, the launcher computes the per-cycle log path and sends
   ``POST /logs`` for that path.
2. Attrsvc recognizes the submission as an ``ft_launcher`` progressive-analysis
   request and starts non-blocking pre-analysis for the growing file.
3. The workload runs and continues writing to the same log file.
4. At cycle end, ``ft_launcher`` sends ``GET /logs`` for the same path.
5. Attrsvc uses the progressive state to complete the normal stop/end analysis
   and returns the normal attribution payload with a normalized recommendation.
6. If progressive state is missing or unusable, attrsvc computes the result with
   the existing terminal pipeline.

Service mode
~~~~~~~~~~~~

Service mode may continue using ``POST /logs`` for job/file tracking and
``GET /logs`` for post-mortem attribution. Progressive analysis should remain
disabled unless explicitly requested by a test or diagnostic client.

Functional Requirements
-----------------------

* ``POST /logs`` must accept an explicit signal that the caller wants
  progressive analysis for a single growing log.
* The ft_launcher client must send that signal when submitting a per-cycle log.
* Attrsvc must forward progressive intent to the loganalysis tool boundary and
  return from ``POST`` without waiting for analyzer completion.
* The loganalysis tool may delegate progressive state to LogSage or implement
  the file-flow orchestration itself. Attrsvc should not need to know which
  model was chosen.
* Attrsvc must not infer progressive analysis only from the absence of
  ``job_id`` because non-``ft_launcher`` callers can also submit single files.
* ``GET /logs`` must continue to run the normal terminal analysis path before
  returning a decision. Once LogSage/tool support exists, that terminal call can
  reuse progressive work when available.
* ``GET /logs`` must fall back to the existing full terminal analysis path when
  progressive analysis is unsupported, incomplete, stale, failed, or disabled.
* Progressive state, if any, must be correlated by normalized log path because
  this is the stable key shared by ``POST`` and ``GET``.
* ``POST /logs`` must remain a notification/early-start path. ``GET /logs``
  remains the result-producing path.
* The existing result cache behavior does not change: ``POST`` does not populate
  the final analysis cache; ``GET`` remains the path that computes and records
  final attribution results.

API and Data Contract
---------------------

Handoff Points
~~~~~~~~~~~~~~

The feature crosses three handoff points. Each handoff should be documented
separately so NVRx-owned plumbing is not confused with the shared LogSage
capability contract.

Service HTTP boundary
   Owned by NVRx. ``POST /logs`` accepts the optional progressive intent and
   ``GET /logs`` remains the stop/end decision API. This boundary should stay
   backward compatible for existing attrsvc clients. The service does not own
   progressive parsing, tailing, or final-result caching.

MCP / loganalysis boundary
   Owned by NVRx. This is the product feature boundary: it exposes the
   progressive option regardless of whether the implementation is an MCP tool,
   an in-process adapter, repeated LogSage calls, or a future LogSage-native
   progressive session. This is where NVRx should hide implementation ownership
   from attrsvc.

LogSage API boundary
   Shared with LogSage. This contract defines whether LogSage can start work
   early, preserve progressive state, and let the terminal analysis call reuse
   that state while producing a result equivalent to full terminal analysis. It
   remains the main open design item.

HTTP API
~~~~~~~~

Extend ``POST /logs`` with optional fields:

``analysis_intent``
   Optional analysis behavior requested by the client. Proposed values are
   ``"track_only"`` and ``"progressive"``. Default is ``"track_only"`` for
   backward compatibility.

The existing ``log_path``, ``user``, and ``job_id`` fields stay compatible.
Existing clients that omit ``analysis_intent`` retain current behavior.

``GET /logs`` should keep the current response shape. It may optionally include
diagnostic metadata in the future, but ``ft_launcher`` must continue to consume
the normalized ``recommendation`` field without understanding progressive
internals.

Python API
~~~~~~~~~~

Extend the internal submit boundary from the HTTP adapter down to the analyzer
with an optional progressive intent. The NVRx-side lifecycle should remain
simple: ``POST`` starts early work when requested, and the existing
``GET``/``analyze`` path remains the stop/end activity that returns the final
decision.

* ``submit_log(..., analysis_intent="track_only")``
* analyzer-level delegation to
  ``LogAnalyzer.start_progressive_analysis(path, user, job_id)``
* loganalysis runner delegation to the selected lib/MCP tool adapter

The current plumbing does not change ``Analyzer.analyze`` or the ``GET`` result
path. When the LogSage/tool contract supports reuse, terminal analysis can add a
``use_progressive``-style option at the loganalysis layer while preserving the
existing caller-facing ``GET`` shape.

MCP / Loganalysis Contract
~~~~~~~~~~~~~~~~~~~~~~~~~~

The MCP/loganalysis boundary is NVRx-owned and should mirror the in-process
library adapter. The boundary needs two concepts:

Progressive start
   A non-result-producing operation used by the ``POST /logs`` path when
   ``analysis_intent="progressive"``. The initial code exposes this as
   ``log_analyzer_progressive_start``. It accepts the normalized ``log_path``,
   ``is_per_cycle=True`` for ft_launcher cycle logs, optional observability
   fields, and any runtime settings needed by LogSage to bind a future
   progressive session. It returns status metadata such as
   accepted/unsupported/failed and, if useful, a handle or session id. It must
   not return a final attribution result or create a cached MCP result resource.

Terminal run with progressive reuse
   The existing ``GET /logs`` path should still invoke terminal log analysis and
   receive the current LogSage-shaped result. Once the LogSage API is settled,
   the loganalysis boundary needs a way to ask the backend to reuse progressive
   state for the same path, for example an optional ``use_progressive=True``
   argument on ``log_analyzer``. If the backend cannot reuse state, the call
   should fall back to normal terminal analysis. The current plumbing leaves
   terminal ``GET`` unchanged.

Flight-recorder analysis is unchanged by this feature. ``POST /logs`` does not
notify FR. On ``GET /logs``, attrsvc can continue to run log analysis and FR
analysis with the existing terminal orchestration; only the log-analysis call
needs a way to reuse progressive LogSage state when it exists.

Configuration
-------------

Add a service-side policy switch for progressive analysis. The default honors
explicit progressive requests because ft_launcher is the expected caller and the
POST path remains non-blocking even while the LogSage progressive API is being
implemented.

``NVRX_ATTRSVC_PROGRESSIVE_ANALYSIS``
   ``all_explicit`` by default. Honors explicit progressive ``POST /logs``
   requests. Set ``off`` to disable progressive start. A stricter
   ft_launcher-only policy would require a caller identity or another
   server-side way to identify the submitter.

No attrsvc polling, cache, or concurrency settings are required in the current
plumbing. If the loganalysis tool later owns repeated LogSage calls over a
growing file, those operational controls should live with that tool
implementation rather than in the HTTP service wrapper.

Observability
-------------

Expose enough state to tell whether the latency optimization is working:

* Count progressive ``POST`` requests and whether they were accepted, ignored,
  or rejected by policy.
* Count started, unsupported, failed, canceled, completed, and fallback
  progressive analyses.
* Track final ``GET`` latency and, where possible, time saved by progressive
  work.
* Include active progressive paths in status/debug output without dumping log
  contents.
* Log when ``GET`` falls back to terminal analysis, including the fallback
  reason.

Compatibility
-------------

Existing attrsvc callers must continue to work without sending new fields.
Service-mode ``POST`` calls with ``job_id`` continue to support splitlog
detection and tracking. Single-file ``POST`` calls from older clients remain
track-only.

The final ``GET`` result must remain semantically compatible with the current
terminal analysis result. A progressive implementation may improve latency, but
must not weaken final attribution correctness.

NVRx-Side Implementation Plan
-----------------------------

1. Add an explicit progressive intent field to the shared HTTP helper and
   attrsvc ``SubmitRequest``.
2. Update the attribution-owned in-job HTTP client to send
   ``analysis_intent="progressive"`` on its normal ``POST /logs`` notification.
   ``ft_launcher`` should continue using the existing attribution submit hook
   and should not own the HTTP payload detail.
3. Thread the intent through ``AttributionHttpAdapter``,
   ``AttributionController``, and ``Analyzer.submit``.
4. In attrsvc, keep default ``POST`` behavior as track-only. When progressive
   intent is requested and the service feature gate allows it, initiate
   progressive analysis through ``LogAnalyzer.start_progressive_analysis``.
5. In MCP mode, expose ``log_analyzer_progressive_start`` as a
   non-result-producing tool. Until LogSage support exists, it returns
   ``unsupported`` status metadata.
6. On success, return the existing response shape. ``GET /logs`` remains the
   path that produces and records final attribution results.
7. Once the LogSage contract exists, extend the terminal loganalysis call so it
   can request progressive reuse while keeping the existing ``GET`` response
   shape and fallback behavior.

Progressive Execution Model
---------------------------

Attrsvc owns request plumbing and policy. The NVRx loganalysis tool owns the
progressive option exposed to attrsvc. The implementation behind that tool is
still a design choice and should be settled with the LogSage implementation.

LogSage-owned progression
   The loganalysis tool calls a LogSage ``start`` operation and returns.
   LogSage owns any background watching, tailing, checkpointing, or incremental
   state. The normal ``GET`` path calls terminal analysis with progressive reuse
   enabled. Attrsvc only sees accepted/unsupported/failure status.

Tool-owned progression
   The NVRx loganalysis tool advances analysis as the file grows, possibly by
   calling LogSage multiple times or by using a smaller LogSage ``advance``
   primitive. Polling/tailing configuration, concurrency limits, lifecycle
   cleanup, and state tracking belong to that tool implementation.

The requirement does not choose between these models. It requires that
``POST`` can request an early start, ``GET`` remains the stop/end decision path,
and terminal correctness is preserved by fallback to the existing full analysis.

LogSage API Proposal
--------------------

The current LogSage-style API is effectively terminal:

* input: complete ``log_path``
* output: final LogSage-shaped result and recommendation

For this feature, the NVRx loganalysis tool needs a way for LogSage to preserve
and reuse work for a growing per-cycle log. The shared behavior can be expressed
with two operations:

``start(path, *, session_id=None, is_per_cycle=True, metadata=None)``
   Create or return progressive state for a log. This should be idempotent for
   the same normalized path or supplied session id.

``run(path, *, use_progressive=True, final=True)``
   Return the normal terminal LogSage result for the complete log, reusing any
   progressive state that was started for the same path. This is invoked from
   the existing ``GET /logs`` stop/end path.

LogSage may internally expose an ``advance`` or completion primitive if that is
the cleanest implementation, or the NVRx loganalysis tool may drive repeated
LogSage calls. Attrsvc does not need a separate public
``finalize_progressive`` operation. The important part is that the terminal
``run`` can consume any remaining tail bytes, validate the final log state, and
produce the same result shape used today.

``cancel(handle)``
   Release resources if the job ends without a final attribution request or the
   service is shutting down.

Minimum status metadata returned through the NVRx boundary:

``handle`` or ``session_id``
   Stable identifier for the progressive analysis.

``consumed_offset``
   Last byte or line offset included in progressive state.

``status``
   ``pending``, ``running``, ``ready_to_complete``, ``completed``, ``failed``,
   or ``unsupported``.

``error``
   Structured failure reason when status is ``failed`` or ``unsupported``.

The important semantic requirement is that the stop/end ``run`` must produce a
result equivalent to running terminal LogSage on the complete final file.
Attrsvc should not need to understand LogSage's internal summaries, LLM prompt
state, or whether reuse was LogSage-owned or tool-owned.

Validation
----------

* Unit-test that existing ``POST /logs`` requests remain track-only by default.
* Unit-test that ft_launcher ``POST`` sends progressive intent.
* Unit-test that ``Analyzer.submit`` delegates progressive start to the
  loganalysis boundary only when the feature gate is enabled.
* Unit-test that the MCP/loganalysis progressive-start operation returns status
  metadata and does not run terminal attribution.
* Once LogSage/tool reuse exists, unit-test that the terminal MCP/loganalysis
  call can request progressive reuse while preserving the existing result shape.
* Unit-test that service-mode ``POST`` with ``job_id`` does not initiate
  progressive analysis by default.
* Once reuse exists, unit-test ``GET`` fallback when progressive state is
  missing, unsupported, failed, stale, or cannot be used to complete analysis.
* Unit-test that ``POST`` does not return a completed attribution result.
* Integration-test the full ``ft_launcher`` flow with a fake progressive
  analyzer: POST starts state, log grows, GET returns a normal recommendation
  through the existing stop/end path.
* End-to-end validate latency improvement with real LogSage progressive support
  once available.

Open Questions
--------------

* What exact LogSage API should back ``log_analyzer_progressive_start`` and
  terminal reuse?
* Should progressive advancement be LogSage-owned after ``start``, or should the
  NVRx loganalysis tool drive repeated LogSage calls over the growing file?
* If the NVRx loganalysis tool owns advancement, what polling and concurrency
  policy should it use?