Skip to content

01. Benchmark: 25 Example Functions

  • Date: 2026-03-15
  • Status: draft

Why This Note Exists

This note supports a narrow release-facing question: can workflow_auto_assembler assemble simple linear workflows from typed tool schemas and refuse tasks that require missing capabilities?

It should be read as a release-supporting research note, not as a formal research paper.

Intro

Claim

When tools are exposed through explicit input and output schemas, an LLM can often synthesize a correct simple linear workflow by stacking those tools into a larger typed procedure, while also rejecting tasks that require missing capabilities.

Methodology

Setup

The benchmark is defined in artifacts/workflow_auto_assembler/examples/test_examples.py and executed by artifacts/workflow_auto_assembler/examples/run_benchmark.py.

The system under test is workflow_auto_assembler operating over a fixed synthetic tool catalog. The benchmark contains 25 task specifications and 5 repeated runs per task, for 125 total runs.

The stored runs were executed with the gpt-oss:20b model through the benchmark harness configuration.

Data / Cases

Tasks 1 to 20 are intended to be solvable with the available tools. They cover single-step actions, short multi-step chains, and larger linear compositions that require non-trivial field matching between tool outputs and downstream inputs.

Tasks 21 to 25 are intentionally unsupported and require capabilities absent from the tool catalog: database write-back, OCR, embeddings/vector indexing, contract compliance analysis, and crypto transfer execution.

Method

The benchmark uses a synthetic catalog of 24 typed tools defined in artifacts/workflow_auto_assembler/examples/functions.py.

Planning is treated as a schema-matching problem: identify a linear sequence of tool calls whose outputs satisfy downstream inputs and eventually satisfy the requested output model.

Success is evaluated at two levels: whether a workflow can be assembled and completed, and whether the completed workflow passes the runner-based benchmark cases attached to the task.

Limitations

The tools are synthetic and mostly deterministic. External side effects are mocked rather than executed against live services. The benchmark does not provide a controlled analysis of token usage, cost, or latency. The workflows are mostly linear and do not yet stress branching, loops, or persistent state.

Analysis

Results

The headline metrics below summarize aggregate run-level performance before breaking the benchmark into intended positive tasks and intentionally unsupported tasks.


metric value
0 Total runs 125.0000
1 Completed runs 99.0000
2 Run-level success rate 0.7920
3 Solved task types 20.0000
4 Completed test pass rate 1.0000
5 Average test retries 0.0480
6 Average workflow steps 5.2929
7 Distinct tools used 24.0000

At the run level, the stored benchmark produced 99 / 125 completed workflows. At the task level, all 20 intended positive tasks were solved at least once. Among completed workflows, the benchmark records 990 / 990 passing benchmark cases.

Those aggregate numbers are useful, but the more informative split is between intended positive tasks and intentionally unsupported tasks.


runs completed possible success_rate test_pass_rate avg_test_retries avg_workflow_steps
task_id
task_1 5 5 5 1.0 1.0 0.0 2.0
task_2 5 5 5 1.0 1.0 0.0 2.0
task_3 5 5 5 1.0 1.0 0.0 3.0
task_4 5 5 5 1.0 1.0 0.0 3.0
task_5 5 5 5 1.0 1.0 0.0 4.0
task_6 5 5 5 1.0 1.0 0.0 4.0
task_7 5 5 5 1.0 1.0 0.0 4.0
task_8 5 5 5 1.0 1.0 0.0 4.0
task_9 5 4 4 0.8 1.0 0.8 4.0
task_10 5 5 5 1.0 1.0 0.0 3.6
task_11 5 5 5 1.0 1.0 0.0 4.0
task_12 5 5 5 1.0 1.0 0.0 5.0
task_13 5 5 5 1.0 1.0 0.0 5.0
task_14 5 5 5 1.0 1.0 0.0 5.4
task_15 5 5 5 1.0 1.0 0.0 6.0
task_16 5 5 5 1.0 1.0 0.0 6.0
task_17 5 5 5 1.0 1.0 0.0 6.0
task_18 5 5 5 1.0 1.0 0.0 7.6
task_19 5 5 5 1.0 1.0 0.0 7.0
task_20 5 5 5 1.0 1.0 0.4 20.0

The positive set shows strong stability. 19 of the 20 intended tasks completed in all 5 / 5 repeats. task_9 completed in 4 / 5 repeats and is the only partially unstable positive case. task_20, the longest positive workflow, completed in all repeats with an average of 20.0 workflow steps.


runs completed possible success_rate
task_id
task_21 5 0 0 0.0
task_22 5 0 0 0.0
task_23 5 0 0 0.0
task_24 5 0 0 0.0
task_25 5 0 0 0.0

The unsupported tasks show an equally important property: none of them produced a workflow. The planner did not just succeed when capabilities existed; it also refrained from asserting workflows when the required capabilities were missing.


png

Interpretation

These results support a narrow claim about schema-first workflow synthesis, not a broad claim about general autonomous planning.

Explicit schemas appear to help in two ways: they guide tool selection and they make the assembled workflow testable as a composed typed object.

Capability boundaries also appear legible to the planner: unsupported tasks were rejected consistently instead of being forced into hallucinated workflows.


avg_duration_s min_duration_s max_duration_s
task_id
task_9 2517.0 1185.7 2972.0
task_20 2848.0 2688.9 2965.0
task_21 1366.9 1098.1 1518.4
task_22 1247.0 917.1 1407.2
task_23 1365.8 1244.9 1481.4
task_24 1222.3 1082.8 1502.5
task_25 1311.5 973.2 1500.0

The duration slice adds one practical nuance. Unsupported tasks were not immediate cheap exits in the stored runs; the system still spent planning effort before concluding that no valid workflow existed. That suggests a next optimization target for this style of system: faster impossibility detection based on missing schema-level capabilities.

Conclusions

Takeaways

The benchmark supports a limited but meaningful claim: when tools are exposed through explicit input and output schemas, an LLM can often synthesize a correct simple linear workflow by stacking those tools into a larger typed procedure.

In the stored benchmark, that claim is supported by strong performance on the positive task set, zero benchmark-case failures among completed workflows, and consistent rejection of unsupported tasks.

Next Steps

Test the same idea under harder conditions: noisier tools, more ambiguous schemas, branching workflows, and live side effects.

Compare WAA against a plain tool-calling baseline so the value of explicit workflow assembly is measured rather than assumed.