new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jun 5

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.

MagBridge-Battery: A Synthetic Bridge Dataset for Li-ion Magnetometry and State-of-Health Diagnostics

Battery health diagnostics today rely overwhelmingly on electrochemical signals measured at the cell terminals. A parallel literature has shown that magnetic sensing can resolve information that terminal-only measurements miss, but method development is limited by the absence, to the best of our knowledge, of public battery magnetic-measurement datasets paired with degradation labels. We release MagBridge-Battery v1.0, a synthetic dataset of 6,760 magnetic-field signatures that bridges real magnetic morphology from the Mohammadi-Jerschow Open Science Framework (OSF) archive with state-of-health (SOH) labels from the PulseBat dataset. The release contains 5,600 PulseBat-conditioned grounded samples, 600 synthetic sensor-anomaly samples derived from clean parents, and 560 low-voltage Regime-B extrapolation samples. A cell-disjoint, parent-child-leakage-free primary benchmark split is verified to contain zero overlapping cells, zero cross-split parent-child pairs, and zero sample-ID overlap. We define three primary benchmark tasks: SOH regression, second-life classification, and anomaly detection, plus an auxiliary anomaly-subtype classification task. A controlled label-shuffle ablation collapses SOH regression from R^2 approximately 0.77 to approximately 0, confirming that the bridge encodes input SOH non-trivially rather than producing label-aligned artifacts. The dataset is released on Zenodo under CC-BY-4.0, and the bridge code and benchmark suite are released under Apache-2.0. This work provides a public benchmark for magnetic-sensing battery diagnostics while paired magnetic-electrochemical measurements remain scarce.

  • 2 authors
·
May 16

Terminal Lucidity: Envisioning the Future of the Terminal

The Unix terminal, or just simply, the terminal, can be found being applied in almost every facet of computing. It is available across all major platforms and often integrated into other applications. Due to its ubiquity, even marginal improvements to the terminal have the potential to make massive improvements to productivity on a global scale. We believe that evolutionary improvements to the terminal, in its current incarnation as windowed terminal emulator, are possible and that developing a thorough understanding of issues that current terminal users face is fundamental to knowing how the terminal should evolve. In order to develop that understanding we have mined Unix and Linux Stack Exchange using a fully-reproducible method which was able to extract and categorize 91.0% of 1,489 terminal-related questions (from the full set of nearly 240,000 questions) without manual intervention. We present an analysis, to our knowledge the first of its kind, of windowed terminal-related questions posted over a 15-year period and viewed, in aggregate, approximately 40 million times. As expected, given its longevity, we find the terminal's many features being applied across a wide variety of use cases. We find evidence that the terminal, as windowed terminal emulator, has neither fully adapted to its now current graphical environment nor completely untangled itself from features more suited to incarnations in previous environments. We also find evidence of areas where we believe the terminal could be extended along with other areas where it could be simplified. Surprisingly, while many current efforts to improve the terminal include improving the terminal's social and collaborative aspects, we find little evidence of this as a prominent pain point.

  • 3 authors
·
Apr 18, 2025