Failure Attribution

Failure attribution collects job logs and optional trace data to explain training failures and produce caller-stable attribution results.