Release Notes: v1.15.6¶

Release Date: 2026-02-19 Branch: feature/benchmark-openai-models Focus: Native OpenAI Provider, Model Profile System, Benchmark Infrastructure

Overview¶

v1.15.6 adds a dedicated native OpenAI provider, a model profile system covering 37 models, and significant benchmark infrastructure improvements. This is the foundation release for v1.16.0's profile-driven tool loop.

Key Changes: - Native OpenAI provider with Chat Completions + Responses API routing - Model profile system with 37 built-in profiles (data structure only, not yet wired into chat.py) - Brace-counting JSON parser replacing regex (handles nested braces in apply_patch diffs) - JSON stripping from response text when native tool_calls are present - 54+ benchmark runs across 27 model variants with detailed analysis - 1,349 total tests passing

Major Changes¶

1. Native OpenAI Provider¶

What: OpenAINativeProvider — a standalone provider class for the OpenAI API that correctly handles modern OpenAI models.

Why: The previous OpenAICompatibleProvider treated all OpenAI-compatible APIs the same, but newer OpenAI models (GPT-5.x, o-series, Codex) require specific API handling: - GPT-5.x and o-series require max_completion_tokens instead of max_tokens - o-series models reject temperature and top_p parameters - Codex models use the Responses API (/responses) instead of Chat Completions (/chat/completions) - GPT-5.2 outputs tool call JSON in both tool_calls AND response text (needs stripping)

Key features: | Feature | Description | |---------|-------------| | Chat Completions API | Standard /chat/completions for GPT-4.1, GPT-5.x, o-series | | Responses API | /responses endpoint for Codex and Pro models | | 404 auto-fallback | Tries Chat Completions first, falls back to Responses on 404 | | max_completion_tokens | Automatically used for GPT-5.x and o-series (replaces max_tokens) | | Param stripping | Removes temperature/top_p for models that reject them | | Reasoning tokens | Extracts reasoning token counts from o-series responses | | Native tool calling | Streaming tool call assembly from chunked responses | | Web search | web_search_preview tool via Responses API (opt-in) |

Impact: Only affects users with openai provider configured. OpenRouter, local, and custom providers are unchanged (still use OpenAICompatibleProvider).

Key files: - ppxai/engine/providers/openai_native.py — Provider implementation (812 lines) - tests/test_openai_native.py — 46 unit tests

2. Model Profile System¶

What: model_profiles.py — a registry of per-model behavioral profiles encoding tool calling strategy, API routing, max_tokens, and benchmark performance tier.

Why: The benchmark analysis (27 models, 7 categories) showed that models have fundamentally different tool calling behaviors. A one-size-fits-all approach doesn't work: - o4-mini scores up to 80.8% with prompt-based vs 11.5% with native tool calling - gpt-4.1-mini scores 71.9% prompt-based vs 60.9% native - GPT-5.2 needs JSON stripping from response text - Codex models need Responses API routing

Architecture:

@dataclass
class ToolCallingProfile:
    mode: "native" | "prompt_based" | "auto"
    fallback_on_empty: bool      # Retry with prompt-based if native returns empty
    fallback_on_failure: bool    # Retry on native parse failure
    strip_json_from_text: bool   # Remove duplicate tool JSON from content
    parallel_tool_calls: bool    # Process all tool calls, not just first
    api_path: "chat" | "responses" | "auto"

@dataclass
class ModelProfile:
    tool_calling: ToolCallingProfile
    max_tokens: int              # 0 = use provider default
    supports_reasoning: bool     # o-series models
    restricted_params: List[str] # Params to strip (temperature, top_p)
    tier: str                    # S/A/B/C/D benchmark tier

37 built-in profiles covering: - OpenAI: gpt-5.2, gpt-5, gpt-5-mini, gpt-5-nano, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4o, gpt-4o-mini, o4-mini, o3, o3-mini, o3-pro, o1, o1-mini, gpt-5.1-codex, gpt-5.1-codex-mini - Perplexity: sonar, sonar-pro, sonar-reasoning-pro, sonar-deep-research, llama-3.1-sonar - Gemini: 2.5-pro, 2.5-flash, 3-flash-preview, 3-pro-preview, 2.0-flash-exp - DGX/vLLM: Qwen3-Coder-30B, Qwen3-Coder-Next, Qwen3-Next-80B (Instruct + Thinking), RedHatAI Qwen3-30B - Ollama: qwen2.5-coder:32b, qwen2.5-coder (small), qwen3:30b-a3b - Other: openai/gpt-oss

v1.15.6 scope: Data structure + registry only. Profiles are NOT yet consulted by chat.py. The profile-driven tool loop integration happens in v1.16.0.

Key files: - ppxai/engine/model_profiles.py — Profiles and registry (488 lines) - tests/test_model_profiles.py — 41 tests

3. Brace-Counting JSON Parser (P2)¶

What: Replaced regex-based JSON extraction in engine/tools/parser.py with a brace-counting parser that correctly handles nested braces.

Why: The previous regex \{[^}]+\} broke when parsing apply_patch tool calls containing code diffs with { and } characters. The regex would match partial JSON, causing parse failures.

How: _find_json_objects() scans text character-by-character, tracking brace depth and string literal boundaries. It correctly handles: - Nested braces in code diffs - Escaped characters inside strings - Multiple JSON objects in a single response - Markdown code fences around JSON

Both parse_tool_call() and strip_tool_json_from_text() now use this parser.

4. Benchmark Infrastructure¶

54+ runs across 27 model variants with the improved benchmark runner:

Rank	Model	Score	Tool Calling
1	o4-mini	100.0%	prompt-based
2	gpt-4.1-mini	100.0%	prompt-based
3	gemini-3-flash-preview	100.0%	native
4	sonar-pro	100.0%	native
5	gpt-oss-120b	89.1%	prompt-based
6	Qwen3-Coder-30B	81.2%	native
7	gemini-2.5-pro	81.2%	native
8	gemini-2.5-flash	81.2%	native
9	sonar	75.0%	native
10	gpt-5.2	70.3%	native

Benchmark improvements: - --tool-calling-method flag to force native/prompt-based per run - --debug flag saves per-request JSON with full AI responses - Engine bypass: benchmark calls provider directly (no engine tool conflicts) - Profile-aware routing: benchmark consults ModelProfile for tool mode - Prompt-based scoring fix: tool_json_in_content penalty removed for prompt-based mode

Benchmark Findings¶

The 27-model analysis identified 5 architectural gaps in chat.py:

Gap	Issue	v1.15.6 Fix	v1.16.0 Fix
1	Binary native/prompt decision too coarse	Model profiles encode per-model strategy	Profile-driven tool loop
2	Synthetic tool result messages	—	Proper `tool` role messages
3	Only first tool call processed	—	Multi-tool support
4	Tool JSON leaks in response text	`strip_tool_json_from_text()`	—
5	No fallback between modes	Profile `fallback_on_empty` flag	Adaptive fallback

Files Changed¶

New Files¶

ppxai/engine/providers/openai_native.py — Native OpenAI provider
ppxai/engine/model_profiles.py — Model profile system
tests/test_openai_native.py — OpenAI provider tests
tests/test_model_profiles.py — Model profile tests
docs/MODEL-BEHAVIOR-ANALYSIS.md — 27-model benchmark analysis
docs/archive/RELEASE-PLAN-v1.15.6-v1.16.0.md — Phased release plan (archived)
scripts/package-windows-zip.ps1 — Windows offline deployment packager
benchmarks/llm-eval/results/*.json — 54+ benchmark result files

Modified Files¶

ppxai/engine/tools/parser.py — Brace-counting parser, JSON stripping, truncated tool detection
ppxai/engine/providers/base.py — Added get_model_profile()
ppxai/engine/providers/gemini.py — Added get_model_profile()
ppxai/engine/chat.py — JSON stripping integration, AGENTS.md hints for native providers
benchmarks/llm-eval/engine_runner.py — Profile-aware routing, debug logging, engine bypass
benchmarks/llm-eval/response_quality.py — Prompt-based scoring awareness
benchmarks/llm-eval/test_cases.py — Profile-aware payload calibration

Migration Notes¶

No breaking changes. v1.15.6 is fully backwards-compatible:

Model profiles exist but are NOT consulted by chat.py (v1.16.0)
OpenAINativeProvider only affects openai provider; other providers unchanged
Existing ppxai-config.json configurations work without modification
Existing sessions load without changes

For OpenAI users: The openai provider now uses the native provider automatically. If you experience issues, you can switch back to the compatible provider by configuring your OpenAI endpoint as a custom provider instead.

Test Summary¶

Category	Count
Model profile tests	41
OpenAI provider tests	46
Parser tests (existing + enhanced)	~30
All other tests	~1,232
Total	1,349

All tests passing on Windows 11 (Python 3.12).