Release Notes: v1.16.0¶
Release Date: 2026-02-26 Branch: feature/v1.16.0 Focus: Profile-driven tool loop, multi-tool support, agent UI, benchmark v2
Overview¶
v1.16.0 rewrites the core tool calling loop in chat.py with profile-driven routing, proper tool role messages, multi-tool support, and grouped tool call UI across all 4 clients. This is the largest single release in ppxai history.
Key Numbers: - 154 files changed, 30,400+ lines added - 1,536 tests passing (up from 1,349 in v1.15.6) - 36 benchmark tests across 9 categories (up from 28 in 7 categories) - 100+ benchmark runs across 29 model variants - 7 implementation steps over 4 days
Breaking Changes: Tool message format changed from synthetic assistant/user pairs to proper tool role messages for native mode. Session migration is automatic (v1.15.x sessions load via None-safe .get()). Prompt-based mode is unchanged.
Major Changes¶
1. Provider Hierarchy (Step 1)¶
What: All providers now inherit from BaseProvider ABC with a shared interface.
Why: chat.py relied on hasattr guards and duck-typing to handle provider differences. This made adding new providers fragile and the tool loop logic hard to follow.
Changes:
- BaseProvider ABC defines the full provider interface: stream(), get_capabilities_for_model(), get_model_profile(), list_models(), validate_config(), _convert_messages(), _get_generation_params(), _get_max_tokens(), _parse_usage()
- OpenAINativeProvider, GeminiProvider, and OpenAICompatibleProvider all inherit from BaseProvider
- Removed all hasattr guards from chat.py — providers are called through guaranteed interface methods
- 61 new parametrized tests in test_provider_hierarchy.py
Impact: Internal refactoring only. No user-facing changes.
2. Profile-Driven Tool Loop (Step 2)¶
What: ToolCallingProfile.mode ("native", "prompt_based", "auto") replaces the binary native_tool_calling: bool decision.
Why: The v1.15.6 benchmark analysis (27 models, 54+ runs) showed that models need different tool calling strategies: - o4-mini: 80.8% prompt-based vs 11.5% native - gpt-4.1-mini: 71.9% prompt-based vs 60.9% native - GPT-5.2: needs JSON stripping from response text - Codex models: need Responses API routing
How it works:
1. Look up ModelProfile for current model
2. Merge with AGENTS.md overrides → ppxai-config.json overrides
3. Check tc_profile.mode:
- "native" → send tools in API request, parse tool_calls from response
- "prompt_based" → inject tool schema into system prompt, parse JSON from text
- "auto" → try native first, fall back to prompt-based on empty/failure
4. Fallback flags:
- fallback_on_empty → native returns empty → retry with prompt-based
- fallback_on_failure → native tool parse fails → try prompt-based parser
5. Belt-and-suspenders → models with fallback flags get tool hints in system prompt
even in native mode (safety net)
Truncation recovery: Detects raw JSON truncation (unbalanced braces), sends escalating recovery messages, caps retries at 3 (MAX_TRUNCATION_RETRIES) with stuck_tool_loop WARNING event.
Key files:
- ppxai/engine/chat.py — Mode routing (~line 470), fallback logic, truncation recovery
- tests/test_chat_profile_routing.py — 16 routing tests
- tests/test_engine_tool_parsing.py — 7 truncation + 4 stuck-loop tests
3. Proper Tool Messages (Step 3)¶
What: Native mode now uses proper tool role messages instead of synthetic assistant/user pairs.
Before (v1.15.x):
# Synthetic pair — all providers
messages.append(Message(role="assistant", content="[tool call: read_file]"))
messages.append(Message(role="user", content="[tool result: file contents]"))
After (v1.16.0):
# Native mode — proper tool messages
messages.append(Message(
role="assistant",
content="",
tool_calls=[{"id": "tc_1", "function": {"name": "read_file", "arguments": "..."}}]
))
messages.append(Message(
role="tool",
content="file contents",
tool_call_id="tc_1"
))
# Prompt-based mode — unchanged synthetic pairs
messages.append(Message(role="assistant", content="[tool call: read_file]"))
messages.append(Message(role="user", content="[tool result: file contents]"))
Why: OpenAI, Gemini, and other providers expect tool role messages when using native function calling. Synthetic pairs worked but caused some models to get confused about conversation structure.
Changes:
- Message dataclass extended with tool_calls: Optional[List[Dict]] and tool_call_id: Optional[str]
- All 4 providers handle tool role in _convert_messages():
- base.py — default conversion
- openai_native.py — OpenAI-specific format with function wrapper
- openai_compat.py — OpenAI-compatible format
- gemini.py — Gemini's functionCall/functionResponse format
- Session serialization updated to save/load new fields
- v1.15.x session migration: m.get("tool_calls") / m.get("tool_call_id") returns None
- Message order validation allows tool messages after assistant(tool_calls)
- 28 new tests in test_tool_messages.py
4. Multi-Tool Support (Step 4)¶
What: All native tool calls in a response are processed, not just the first one.
Before: native_tool_calls[0] — only the first tool call was executed.
After: for tc in tool_calls_list — all tool calls are executed sequentially.
Gating: The parallel_tool_calls profile flag controls this behavior:
- True (qwen3-coder, gpt-5.2, gemini-3.1-pro-customtools): Process all tool calls
- False (default): Process only the first tool call (preserves v1.15.x behavior)
Sequential execution: Even when processing multiple calls, they're executed one at a time with individual consent prompts and loop detection per tool.
Key files:
- ppxai/engine/chat.py — Multi-tool loop (~line 639), profile gating (~line 607)
5. Agent UI Noise Reduction (Step 5)¶
What: Tool calls are grouped per iteration with collapsible UI across all 4 clients.
New engine events:
- TOOL_GROUP_START — emitted before each iteration's tool calls (contains iteration number)
- TOOL_GROUP_END — emitted after (contains tool names, success/failure counts)
- AGENT_COMPLETE — emitted when tool loop finishes (iteration count, commit hash)
Client rendering:
| Client | Rendering |
|---|---|
| Web app | Collapsible .tool-group containers, checkpoint bubble suppression, undo badge on commits only |
| VSCode | Tool group forwarding via stream.ts → chatPanel.ts, CSS styling |
| ppxaide TUI | Non-verbose: one summary line per group. Verbose: unchanged individual bubbles |
| ppxai Rich CLI | Dim separator lines with iteration number and status |
SSE fixes:
- Event type dispatch: side-channel events now emit their actual EventType instead of all being sent as consent_request
- Consent deadlock: SSE generator uses racing poll pattern (asyncio.ensure_future + 100ms polling) instead of async for
6. Config Integration (Step 6)¶
What: Per-model tool calling overrides with 3-layer precedence.
Precedence (highest wins):
1. ppxai-config.json — user's explicit config
2. AGENTS.md — project-specific hints with tool_calling: YAML front matter
3. Built-in profile — model_profiles.py default
ppxai-config.json example:
{
"providers": {
"local-vllm": {
"models": {
"*/qwen3-coder-30b*": {
"tool_calling": {
"mode": "native",
"parallel_tool_calls": true
}
}
}
}
}
}
AGENTS.md example:
/model info command: Shows effective profile with source attribution per field (e.g., "mode: native (built-in profile)" vs "mode: prompt_based (AGENTS.md override)").
Key files:
- ppxai/config/__init__.py — get_tool_calling_config()
- ppxai/engine/bootstrap.py — _parse_tool_calling_section(), get_tool_calling_overrides()
- ppxai/engine/chat.py — _get_effective_profile() (3-layer merge)
- 16 new tests across config, bootstrap, and profile merging
7. Benchmark v2 (Step 7)¶
What: Expanded from 28 tests/7 categories to 36 tests/9 categories with agentic multi-turn tests, efficiency metrics, and AGENTS.md delta testing.
New categories:
- agentic_tool_loops — multi-turn tool call chains requiring search → read → edit patterns
- efficiency — measures token usage and tool call redundancy
New tests:
| Test | Category | Description | Scoring |
|---|---|---|---|
patch_apply_verify |
code_editing | Generate patch, apply with _replace_hunk(), verify fix |
0.0/0.5/0.7/1.0 |
search_then_edit |
agentic_tool_loops | search_code → read_file → apply_patch (3 turns) | steps/3 |
fix_verify |
agentic_tool_loops | write → test → fail → fix → retest (4 turns) | steps/4 |
information_gathering |
agentic_tool_loops | Find and read 3 auth-related files | files_found/3 |
error_recovery_chain |
agentic_tool_loops | Handle not-found → search → read → permission denied (4 turns) | steps/4 |
multi_file_review |
agentic_tool_loops | Read all files before making claims | files_read/total |
claim_without_action |
hallucination_resistance | Refuse to fabricate without tool calls | 0.0 or 1.0 |
consecutive_tool_loop |
agentic_tool_loops | 5-step dependent chain | steps/5 |
time_to_first_tool_call |
efficiency | Penalize preamble >100 chars | 0.0/0.5/1.0 |
tool_call_efficiency |
efficiency | Score by redundant calls vs optimal | 0.3-1.0 |
AGENTS.md delta testing:
- --agents-md both runs suite twice per model (with/without AGENTS.md hints)
- Reports per-category score delta and overall percentage lift
- Biggest delta: gemini-3.1-pro-customtools +20.1% (81.5% → 61.4%)
Token/tool tracking:
- total_tokens and total_tool_calls in benchmark metadata
- Per-test tokens_used and tool_calls_made in test details
Duplicate tool call detection:
- _dedup_tool_call() helper returns [DUPLICATE CALL] feedback for repeated tool+args
- exempt_tools set for tools with intentionally varying results (run_command, search_code)
Benchmark Rankings (v1.16.0, 36 tests)¶
| Rank | Model | Score | Mode | Tier |
|---|---|---|---|---|
| 1 | qwen3-coder (cloud) | 95.8% | native | S |
| 2 | gpt-5.2 | 91.4% | native | A |
| 3 | gemini-2.5-flash | 90.6% | native | S |
| 4 | gpt-5 | 89.1% | native | A |
| 5 | gemini-2.5-pro | 87.5% | native | S |
| 6 | gpt-5-mini | 86.5% | native | A |
| 7 | gemini-3-flash | 84.4% | native | S |
| 8 | sonar-pro | 84.4% | prompt-based | A |
| 9 | gpt-4.1 | 82.8% | native | A |
| 10 | gemini-3.1-pro-customtools | 81.5% | native | A |
| 11 | gemini-3.1-pro | 81.5% | native | A |
| 12 | Qwen3-Coder-30B (DGX) | 81.2% | native | S |
| 13 | o4-mini | 80.8% | prompt-based | B |
| 14 | sonar | 76.6% | prompt-based | B |
| 15 | gpt-4.1-mini | 71.9% | prompt-based | B |
AGENTS.md Delta Testing Results¶
| Model | WITH | WITHOUT | Delta |
|---|---|---|---|
| gemini-3.1-pro-customtools | 81.5% | 61.4% | +20.1% |
| sonar-pro | 84.4% | 68.7% | +15.7% |
| gpt-5.2 | 91.4% | 82.8% | +8.6% |
| gemini-2.5-flash | 90.6% | 84.4% | +6.2% |
| qwen3-coder | 95.8% | 93.7% | +2.1% |
New Commands¶
/ls — Directory Listing¶
Lists files and directories with size, modification time, and type indicators.
Available in ppxaide TUI, Web app, and ppxai Rich CLI. HTTP endpoint: GET /files/list?path=...
/tree — Directory Tree¶
Shows directory structure as an indented tree with file counts.
/tree # Tree of current directory
/tree /path/to/dir # Tree of specific directory
/tree --depth 2 # Limit depth
Available in all 3 clients. HTTP endpoint: GET /files/tree?path=...&depth=...
/model info — Model Profile Info¶
Shows the effective tool calling profile for the current model with source attribution.
/model info
# Output:
# Model: gpt-5.2
# Tool Calling Profile:
# mode: native (built-in profile)
# strip_json_from_text: true (built-in profile)
# parallel_tool_calls: true (AGENTS.md override)
# fallback_on_empty: false (default)
Session Management¶
Model Switch Context Reset¶
Switching models now resets session context to prevent cross-model confusion:
- session.reset_for_model_switch() clears conversation history
- Commands show count of cleared messages
- Session restore paths pass reset_context=False to preserve history on load
Per-Model Iteration Limits¶
ModelProfile.max_tool_iterations sets the maximum tool loop iterations per model:
| Model | Max Iterations |
|---|---|
| gemini-2.5-pro/flash | 25 |
| gemini-3.1-pro | 20 |
| sonar-pro, sonar | 20 |
| qwen3-coder | 20 |
| codex-mini | 20 |
| Default | 15 |
Session Pollution Detection¶
After the first tool loop iteration, check_session_pollution() computes bigram similarity between the model's latest response and the previous one. Similarity >90% triggers a WARNING event, indicating the model is repeating itself.
SSE Disconnect Detection¶
request.is_disconnected() is checked in the SSE event generator. When a client disconnects mid-stream, the server stops processing instead of continuing to generate events into the void.
Files Changed¶
New Files¶
| File | Description |
|---|---|
tests/test_provider_hierarchy.py |
61 provider hierarchy tests |
tests/test_chat_profile_routing.py |
16 profile routing + 11 truncation tests |
tests/test_tool_messages.py |
28 tool message format tests |
benchmarks/llm-eval/test_cases.py (8 new tests) |
Agentic + efficiency benchmark tests |
Modified Files (Key)¶
| File | Change |
|---|---|
ppxai/engine/chat.py |
Profile-driven tool loop, proper tool messages, multi-tool, grouped events |
ppxai/engine/types.py |
Message.tool_calls, Message.tool_call_id, new EventTypes |
ppxai/engine/session.py |
reset_for_model_switch(), tool message serialization, validation |
ppxai/engine/client.py |
Model switch reset, config reload |
ppxai/engine/providers/base.py |
BaseProvider ABC with shared interface |
ppxai/engine/providers/openai_native.py |
tool role message conversion |
ppxai/engine/providers/openai_compat.py |
tool role message conversion |
ppxai/engine/providers/gemini.py |
functionCall/functionResponse conversion |
ppxai/engine/model_profiles.py |
Updated tiers, Gemini 3.1 profiles |
ppxai/engine/bootstrap.py |
tool_calling YAML parsing, overrides |
ppxai/engine/context.py |
tool_calling_overrides scope merging |
ppxai/config/__init__.py |
get_tool_calling_config() |
ppxai/server/http.py |
/files/list, /files/tree, SSE fixes, consent deadlock fix |
ppxai/commands/utility.py |
/ls, /tree commands |
ppxai/commands/provider.py |
/model info command |
ppxai/commands/session.py |
Model switch reset UI |
ppxai/tui/app.py |
Tool group rendering, non-verbose mode |
AGENTS.md |
Gemini 3.1 hints, Perplexity rewrite, tool calling YAML |
ppxai-config.example.json |
tool_calling override examples |
benchmarks/llm-eval/engine_runner.py |
AGENTS.md delta, token tracking, partial credit |
Migration Notes¶
Session Format¶
v1.15.x sessions load automatically. The new tool_calls and tool_call_id fields on Message default to None when loading old sessions. New sessions saved in v1.16.0 format are not backwards-compatible with v1.15.x.
Configuration¶
No configuration changes required. Existing ppxai-config.json files work without modification.
New optional config: Per-model tool_calling overrides in provider config:
{
"providers": {
"my-provider": {
"models": {
"model-name": {
"tool_calling": {
"mode": "prompt_based"
}
}
}
}
}
}
AGENTS.md¶
New optional tool_calling: YAML front matter section for project-specific tool calling overrides. Existing AGENTS.md files work unchanged.
Provider API¶
BaseProvider is now the required base class. Custom providers extending OpenAICompatibleProvider are unaffected (it inherits from BaseProvider). Direct provider implementations must implement the BaseProvider interface.
Test Summary¶
| Category | Count |
|---|---|
| Provider hierarchy tests | 61 |
| Profile routing + truncation tests | 27 |
| Tool message tests | 28 |
| Config + bootstrap + profile merging | 16 |
| Benchmark test definitions | 36 |
| All other tests | ~1,368 |
| Total | 1,536 |
All tests passing on macOS (Python 3.12).