Skip to content

Release Notes: v1.16.0

Release Date: 2026-02-26 Branch: feature/v1.16.0 Focus: Profile-driven tool loop, multi-tool support, agent UI, benchmark v2


Overview

v1.16.0 rewrites the core tool calling loop in chat.py with profile-driven routing, proper tool role messages, multi-tool support, and grouped tool call UI across all 4 clients. This is the largest single release in ppxai history.

Key Numbers: - 154 files changed, 30,400+ lines added - 1,536 tests passing (up from 1,349 in v1.15.6) - 36 benchmark tests across 9 categories (up from 28 in 7 categories) - 100+ benchmark runs across 29 model variants - 7 implementation steps over 4 days

Breaking Changes: Tool message format changed from synthetic assistant/user pairs to proper tool role messages for native mode. Session migration is automatic (v1.15.x sessions load via None-safe .get()). Prompt-based mode is unchanged.


Major Changes

1. Provider Hierarchy (Step 1)

What: All providers now inherit from BaseProvider ABC with a shared interface.

Why: chat.py relied on hasattr guards and duck-typing to handle provider differences. This made adding new providers fragile and the tool loop logic hard to follow.

Changes: - BaseProvider ABC defines the full provider interface: stream(), get_capabilities_for_model(), get_model_profile(), list_models(), validate_config(), _convert_messages(), _get_generation_params(), _get_max_tokens(), _parse_usage() - OpenAINativeProvider, GeminiProvider, and OpenAICompatibleProvider all inherit from BaseProvider - Removed all hasattr guards from chat.py — providers are called through guaranteed interface methods - 61 new parametrized tests in test_provider_hierarchy.py

Impact: Internal refactoring only. No user-facing changes.

2. Profile-Driven Tool Loop (Step 2)

What: ToolCallingProfile.mode ("native", "prompt_based", "auto") replaces the binary native_tool_calling: bool decision.

Why: The v1.15.6 benchmark analysis (27 models, 54+ runs) showed that models need different tool calling strategies: - o4-mini: 80.8% prompt-based vs 11.5% native - gpt-4.1-mini: 71.9% prompt-based vs 60.9% native - GPT-5.2: needs JSON stripping from response text - Codex models: need Responses API routing

How it works:

1. Look up ModelProfile for current model
2. Merge with AGENTS.md overrides → ppxai-config.json overrides
3. Check tc_profile.mode:
   - "native" → send tools in API request, parse tool_calls from response
   - "prompt_based" → inject tool schema into system prompt, parse JSON from text
   - "auto" → try native first, fall back to prompt-based on empty/failure
4. Fallback flags:
   - fallback_on_empty → native returns empty → retry with prompt-based
   - fallback_on_failure → native tool parse fails → try prompt-based parser
5. Belt-and-suspenders → models with fallback flags get tool hints in system prompt
   even in native mode (safety net)

Truncation recovery: Detects raw JSON truncation (unbalanced braces), sends escalating recovery messages, caps retries at 3 (MAX_TRUNCATION_RETRIES) with stuck_tool_loop WARNING event.

Key files: - ppxai/engine/chat.py — Mode routing (~line 470), fallback logic, truncation recovery - tests/test_chat_profile_routing.py — 16 routing tests - tests/test_engine_tool_parsing.py — 7 truncation + 4 stuck-loop tests

3. Proper Tool Messages (Step 3)

What: Native mode now uses proper tool role messages instead of synthetic assistant/user pairs.

Before (v1.15.x):

# Synthetic pair — all providers
messages.append(Message(role="assistant", content="[tool call: read_file]"))
messages.append(Message(role="user", content="[tool result: file contents]"))

After (v1.16.0):

# Native mode — proper tool messages
messages.append(Message(
    role="assistant",
    content="",
    tool_calls=[{"id": "tc_1", "function": {"name": "read_file", "arguments": "..."}}]
))
messages.append(Message(
    role="tool",
    content="file contents",
    tool_call_id="tc_1"
))

# Prompt-based mode — unchanged synthetic pairs
messages.append(Message(role="assistant", content="[tool call: read_file]"))
messages.append(Message(role="user", content="[tool result: file contents]"))

Why: OpenAI, Gemini, and other providers expect tool role messages when using native function calling. Synthetic pairs worked but caused some models to get confused about conversation structure.

Changes: - Message dataclass extended with tool_calls: Optional[List[Dict]] and tool_call_id: Optional[str] - All 4 providers handle tool role in _convert_messages(): - base.py — default conversion - openai_native.py — OpenAI-specific format with function wrapper - openai_compat.py — OpenAI-compatible format - gemini.py — Gemini's functionCall/functionResponse format - Session serialization updated to save/load new fields - v1.15.x session migration: m.get("tool_calls") / m.get("tool_call_id") returns None - Message order validation allows tool messages after assistant(tool_calls) - 28 new tests in test_tool_messages.py

4. Multi-Tool Support (Step 4)

What: All native tool calls in a response are processed, not just the first one.

Before: native_tool_calls[0] — only the first tool call was executed.

After: for tc in tool_calls_list — all tool calls are executed sequentially.

Gating: The parallel_tool_calls profile flag controls this behavior: - True (qwen3-coder, gpt-5.2, gemini-3.1-pro-customtools): Process all tool calls - False (default): Process only the first tool call (preserves v1.15.x behavior)

Sequential execution: Even when processing multiple calls, they're executed one at a time with individual consent prompts and loop detection per tool.

Key files: - ppxai/engine/chat.py — Multi-tool loop (~line 639), profile gating (~line 607)

5. Agent UI Noise Reduction (Step 5)

What: Tool calls are grouped per iteration with collapsible UI across all 4 clients.

New engine events: - TOOL_GROUP_START — emitted before each iteration's tool calls (contains iteration number) - TOOL_GROUP_END — emitted after (contains tool names, success/failure counts) - AGENT_COMPLETE — emitted when tool loop finishes (iteration count, commit hash)

Client rendering:

Client Rendering
Web app Collapsible .tool-group containers, checkpoint bubble suppression, undo badge on commits only
VSCode Tool group forwarding via stream.tschatPanel.ts, CSS styling
ppxaide TUI Non-verbose: one summary line per group. Verbose: unchanged individual bubbles
ppxai Rich CLI Dim separator lines with iteration number and status

SSE fixes: - Event type dispatch: side-channel events now emit their actual EventType instead of all being sent as consent_request - Consent deadlock: SSE generator uses racing poll pattern (asyncio.ensure_future + 100ms polling) instead of async for

6. Config Integration (Step 6)

What: Per-model tool calling overrides with 3-layer precedence.

Precedence (highest wins): 1. ppxai-config.json — user's explicit config 2. AGENTS.md — project-specific hints with tool_calling: YAML front matter 3. Built-in profile — model_profiles.py default

ppxai-config.json example:

{
  "providers": {
    "local-vllm": {
      "models": {
        "*/qwen3-coder-30b*": {
          "tool_calling": {
            "mode": "native",
            "parallel_tool_calls": true
          }
        }
      }
    }
  }
}

AGENTS.md example:

tool_calling:
  "gpt-5.2*":
    mode: native
    strip_json_from_text: true
  "o4-mini*":
    mode: prompt_based

/model info command: Shows effective profile with source attribution per field (e.g., "mode: native (built-in profile)" vs "mode: prompt_based (AGENTS.md override)").

Key files: - ppxai/config/__init__.pyget_tool_calling_config() - ppxai/engine/bootstrap.py_parse_tool_calling_section(), get_tool_calling_overrides() - ppxai/engine/chat.py_get_effective_profile() (3-layer merge) - 16 new tests across config, bootstrap, and profile merging

7. Benchmark v2 (Step 7)

What: Expanded from 28 tests/7 categories to 36 tests/9 categories with agentic multi-turn tests, efficiency metrics, and AGENTS.md delta testing.

New categories: - agentic_tool_loops — multi-turn tool call chains requiring search → read → edit patterns - efficiency — measures token usage and tool call redundancy

New tests:

Test Category Description Scoring
patch_apply_verify code_editing Generate patch, apply with _replace_hunk(), verify fix 0.0/0.5/0.7/1.0
search_then_edit agentic_tool_loops search_code → read_file → apply_patch (3 turns) steps/3
fix_verify agentic_tool_loops write → test → fail → fix → retest (4 turns) steps/4
information_gathering agentic_tool_loops Find and read 3 auth-related files files_found/3
error_recovery_chain agentic_tool_loops Handle not-found → search → read → permission denied (4 turns) steps/4
multi_file_review agentic_tool_loops Read all files before making claims files_read/total
claim_without_action hallucination_resistance Refuse to fabricate without tool calls 0.0 or 1.0
consecutive_tool_loop agentic_tool_loops 5-step dependent chain steps/5
time_to_first_tool_call efficiency Penalize preamble >100 chars 0.0/0.5/1.0
tool_call_efficiency efficiency Score by redundant calls vs optimal 0.3-1.0

AGENTS.md delta testing: - --agents-md both runs suite twice per model (with/without AGENTS.md hints) - Reports per-category score delta and overall percentage lift - Biggest delta: gemini-3.1-pro-customtools +20.1% (81.5% → 61.4%)

Token/tool tracking: - total_tokens and total_tool_calls in benchmark metadata - Per-test tokens_used and tool_calls_made in test details

Duplicate tool call detection: - _dedup_tool_call() helper returns [DUPLICATE CALL] feedback for repeated tool+args - exempt_tools set for tools with intentionally varying results (run_command, search_code)


Benchmark Rankings (v1.16.0, 36 tests)

Rank Model Score Mode Tier
1 qwen3-coder (cloud) 95.8% native S
2 gpt-5.2 91.4% native A
3 gemini-2.5-flash 90.6% native S
4 gpt-5 89.1% native A
5 gemini-2.5-pro 87.5% native S
6 gpt-5-mini 86.5% native A
7 gemini-3-flash 84.4% native S
8 sonar-pro 84.4% prompt-based A
9 gpt-4.1 82.8% native A
10 gemini-3.1-pro-customtools 81.5% native A
11 gemini-3.1-pro 81.5% native A
12 Qwen3-Coder-30B (DGX) 81.2% native S
13 o4-mini 80.8% prompt-based B
14 sonar 76.6% prompt-based B
15 gpt-4.1-mini 71.9% prompt-based B

AGENTS.md Delta Testing Results

Model WITH WITHOUT Delta
gemini-3.1-pro-customtools 81.5% 61.4% +20.1%
sonar-pro 84.4% 68.7% +15.7%
gpt-5.2 91.4% 82.8% +8.6%
gemini-2.5-flash 90.6% 84.4% +6.2%
qwen3-coder 95.8% 93.7% +2.1%

New Commands

/ls — Directory Listing

Lists files and directories with size, modification time, and type indicators.

/ls                    # List current directory
/ls /path/to/dir       # List specific directory

Available in ppxaide TUI, Web app, and ppxai Rich CLI. HTTP endpoint: GET /files/list?path=...

/tree — Directory Tree

Shows directory structure as an indented tree with file counts.

/tree                  # Tree of current directory
/tree /path/to/dir     # Tree of specific directory
/tree --depth 2        # Limit depth

Available in all 3 clients. HTTP endpoint: GET /files/tree?path=...&depth=...

/model info — Model Profile Info

Shows the effective tool calling profile for the current model with source attribution.

/model info
# Output:
# Model: gpt-5.2
# Tool Calling Profile:
#   mode: native (built-in profile)
#   strip_json_from_text: true (built-in profile)
#   parallel_tool_calls: true (AGENTS.md override)
#   fallback_on_empty: false (default)

Session Management

Model Switch Context Reset

Switching models now resets session context to prevent cross-model confusion: - session.reset_for_model_switch() clears conversation history - Commands show count of cleared messages - Session restore paths pass reset_context=False to preserve history on load

Per-Model Iteration Limits

ModelProfile.max_tool_iterations sets the maximum tool loop iterations per model:

Model Max Iterations
gemini-2.5-pro/flash 25
gemini-3.1-pro 20
sonar-pro, sonar 20
qwen3-coder 20
codex-mini 20
Default 15

Session Pollution Detection

After the first tool loop iteration, check_session_pollution() computes bigram similarity between the model's latest response and the previous one. Similarity >90% triggers a WARNING event, indicating the model is repeating itself.

SSE Disconnect Detection

request.is_disconnected() is checked in the SSE event generator. When a client disconnects mid-stream, the server stops processing instead of continuing to generate events into the void.


Files Changed

New Files

File Description
tests/test_provider_hierarchy.py 61 provider hierarchy tests
tests/test_chat_profile_routing.py 16 profile routing + 11 truncation tests
tests/test_tool_messages.py 28 tool message format tests
benchmarks/llm-eval/test_cases.py (8 new tests) Agentic + efficiency benchmark tests

Modified Files (Key)

File Change
ppxai/engine/chat.py Profile-driven tool loop, proper tool messages, multi-tool, grouped events
ppxai/engine/types.py Message.tool_calls, Message.tool_call_id, new EventTypes
ppxai/engine/session.py reset_for_model_switch(), tool message serialization, validation
ppxai/engine/client.py Model switch reset, config reload
ppxai/engine/providers/base.py BaseProvider ABC with shared interface
ppxai/engine/providers/openai_native.py tool role message conversion
ppxai/engine/providers/openai_compat.py tool role message conversion
ppxai/engine/providers/gemini.py functionCall/functionResponse conversion
ppxai/engine/model_profiles.py Updated tiers, Gemini 3.1 profiles
ppxai/engine/bootstrap.py tool_calling YAML parsing, overrides
ppxai/engine/context.py tool_calling_overrides scope merging
ppxai/config/__init__.py get_tool_calling_config()
ppxai/server/http.py /files/list, /files/tree, SSE fixes, consent deadlock fix
ppxai/commands/utility.py /ls, /tree commands
ppxai/commands/provider.py /model info command
ppxai/commands/session.py Model switch reset UI
ppxai/tui/app.py Tool group rendering, non-verbose mode
AGENTS.md Gemini 3.1 hints, Perplexity rewrite, tool calling YAML
ppxai-config.example.json tool_calling override examples
benchmarks/llm-eval/engine_runner.py AGENTS.md delta, token tracking, partial credit

Migration Notes

Session Format

v1.15.x sessions load automatically. The new tool_calls and tool_call_id fields on Message default to None when loading old sessions. New sessions saved in v1.16.0 format are not backwards-compatible with v1.15.x.

Configuration

No configuration changes required. Existing ppxai-config.json files work without modification.

New optional config: Per-model tool_calling overrides in provider config:

{
  "providers": {
    "my-provider": {
      "models": {
        "model-name": {
          "tool_calling": {
            "mode": "prompt_based"
          }
        }
      }
    }
  }
}

AGENTS.md

New optional tool_calling: YAML front matter section for project-specific tool calling overrides. Existing AGENTS.md files work unchanged.

Provider API

BaseProvider is now the required base class. Custom providers extending OpenAICompatibleProvider are unaffected (it inherits from BaseProvider). Direct provider implementations must implement the BaseProvider interface.


Test Summary

Category Count
Provider hierarchy tests 61
Profile routing + truncation tests 27
Tool message tests 28
Config + bootstrap + profile merging 16
Benchmark test definitions 36
All other tests ~1,368
Total 1,536

All tests passing on macOS (Python 3.12).