02. Benchmark: Math Reasoning
- Date: 2026-05-05
- Status: draft
{'runs': 75, 'completed_runs': 74, 'task_count': 15, 'tasks_solved_once': 15}
Intro
This benchmark tests WAA on deterministic math tasks where the difficulty comes from workflow synthesis rather than ambiguous answers. The useful question is whether WAA can consistently assemble and reuse the right workflow for a small but varied math task set.
Benchmark Cases
The benchmark cases fall into four small groups:
- Direct arithmetic: tasks where one tool or one obvious two-step workflow should be enough.
- Compositional arithmetic: tasks that require combining several primitive tools in a fixed order.
- Symbolic calculus and trigonometry: tasks that depend on derivative or trig identities.
- Stress cases: tasks where the planner has to reconstruct a familiar formula from the available tool set.
One small caveat is worth calling out. task_8 (sin(x)^2 + cos(x)^2) has the same expected output for every input because the identity always evaluates to 1. That makes it useful as an identity-recognition case, but weak as a pure composition test: a degenerate workflow that returns a constant can still pass numerically.
Direct Arithmetic
task_1
Add left and right and return the result as a single scalar value.
|
case |
left |
right |
output_value |
| 0 |
1 |
2.00 |
5.00 |
7.0 |
| 1 |
2 |
-3.00 |
10.00 |
7.0 |
| 2 |
3 |
1.50 |
2.50 |
4.0 |
| 3 |
4 |
0.00 |
7.00 |
7.0 |
| 4 |
5 |
-8.00 |
-2.00 |
-10.0 |
| 5 |
6 |
9.00 |
1.00 |
10.0 |
| 6 |
7 |
100.00 |
0.50 |
100.5 |
| 7 |
8 |
12.00 |
-2.00 |
10.0 |
| 8 |
9 |
3.25 |
4.75 |
8.0 |
| 9 |
10 |
-1.25 |
1.25 |
0.0 |
task_2
Subtract right from left and return the difference.
|
case |
left |
right |
output_value |
| 0 |
1 |
7.00 |
2.00 |
5.0 |
| 1 |
2 |
5.00 |
5.00 |
0.0 |
| 2 |
3 |
-4.00 |
3.00 |
-7.0 |
| 3 |
4 |
9.50 |
1.50 |
8.0 |
| 4 |
5 |
0.00 |
2.00 |
-2.0 |
| 5 |
6 |
20.00 |
-2.00 |
22.0 |
| 6 |
7 |
-8.00 |
-3.00 |
-5.0 |
| 7 |
8 |
4.25 |
0.25 |
4.0 |
| 8 |
9 |
3.00 |
9.00 |
-6.0 |
| 9 |
10 |
100.00 |
33.00 |
67.0 |
task_3
Multiply left and right and return the product.
|
case |
left |
right |
output_value |
| 0 |
1 |
3.0 |
4.0 |
12.00 |
| 1 |
2 |
-2.0 |
5.0 |
-10.00 |
| 2 |
3 |
1.5 |
2.0 |
3.00 |
| 3 |
4 |
0.0 |
10.0 |
0.00 |
| 4 |
5 |
-3.0 |
-7.0 |
21.00 |
| 5 |
6 |
8.0 |
0.5 |
4.00 |
| 6 |
7 |
12.0 |
3.0 |
36.00 |
| 7 |
8 |
9.0 |
-1.0 |
-9.00 |
| 8 |
9 |
2.5 |
2.5 |
6.25 |
| 9 |
10 |
11.0 |
11.0 |
121.00 |
task_4
Divide left by right and return the quotient.
|
case |
left |
right |
output_value |
| 0 |
1 |
8.0 |
2.0 |
4.00 |
| 1 |
2 |
9.0 |
3.0 |
3.00 |
| 2 |
3 |
7.5 |
2.5 |
3.00 |
| 3 |
4 |
-12.0 |
4.0 |
-3.00 |
| 4 |
5 |
1.0 |
4.0 |
0.25 |
| 5 |
6 |
100.0 |
5.0 |
20.00 |
| 6 |
7 |
-9.0 |
-3.0 |
3.00 |
| 7 |
8 |
3.6 |
1.2 |
3.00 |
| 8 |
9 |
22.0 |
11.0 |
2.00 |
| 9 |
10 |
81.0 |
9.0 |
9.00 |
Compositional Arithmetic
task_5
First add left and right, then square the result.
|
case |
left |
right |
output_value |
| 0 |
1 |
2.00 |
3.00 |
25.0 |
| 1 |
2 |
-1.00 |
4.00 |
9.0 |
| 2 |
3 |
1.50 |
0.50 |
4.0 |
| 3 |
4 |
0.00 |
7.00 |
49.0 |
| 4 |
5 |
-4.00 |
-2.00 |
36.0 |
| 5 |
6 |
10.00 |
-1.00 |
81.0 |
| 6 |
7 |
6.00 |
6.00 |
144.0 |
| 7 |
8 |
3.25 |
1.75 |
25.0 |
| 8 |
9 |
-8.00 |
9.00 |
1.0 |
| 9 |
10 |
0.50 |
0.50 |
1.0 |
task_6
Compute (left + right) multiplied by (left - right).
|
case |
left |
right |
output_value |
| 0 |
1 |
7.0 |
2.00 |
45.0000 |
| 1 |
2 |
5.0 |
5.00 |
0.0000 |
| 2 |
3 |
8.0 |
3.00 |
55.0000 |
| 3 |
4 |
1.5 |
0.50 |
2.0000 |
| 4 |
5 |
-4.0 |
2.00 |
12.0000 |
| 5 |
6 |
10.0 |
-1.00 |
99.0000 |
| 6 |
7 |
3.0 |
1.00 |
8.0000 |
| 7 |
8 |
12.0 |
4.00 |
128.0000 |
| 8 |
9 |
0.5 |
0.25 |
0.1875 |
| 9 |
10 |
-6.0 |
-2.00 |
32.0000 |
task_7
Return the distance between the two numbers on the real line.
|
case |
left |
right |
output_value |
| 0 |
1 |
10.00 |
4.00 |
6.0 |
| 1 |
2 |
-2.00 |
5.00 |
7.0 |
| 2 |
3 |
3.00 |
3.00 |
0.0 |
| 3 |
4 |
1.50 |
2.50 |
1.0 |
| 4 |
5 |
-10.00 |
-4.00 |
6.0 |
| 5 |
6 |
7.00 |
-1.00 |
8.0 |
| 6 |
7 |
0.00 |
9.00 |
9.0 |
| 7 |
8 |
2.25 |
0.25 |
2.0 |
| 8 |
9 |
-8.00 |
1.00 |
9.0 |
| 9 |
10 |
4.00 |
11.00 |
7.0 |
task_9
Compute the geometric mean of two positive numbers by multiplying them first and then taking the square root.
|
case |
left |
right |
output_value |
| 0 |
1 |
4.00 |
9.0 |
6.0 |
| 1 |
2 |
1.00 |
16.0 |
4.0 |
| 2 |
3 |
2.25 |
4.0 |
3.0 |
| 3 |
4 |
3.00 |
12.0 |
6.0 |
| 4 |
5 |
0.25 |
4.0 |
1.0 |
| 5 |
6 |
6.00 |
24.0 |
12.0 |
| 6 |
7 |
5.00 |
20.0 |
10.0 |
| 7 |
8 |
1.50 |
6.0 |
3.0 |
| 8 |
9 |
10.00 |
40.0 |
20.0 |
| 9 |
10 |
2.00 |
8.0 |
4.0 |
task_13
Compute the normalized radius sqrt(a squared plus b squared) divided by the magnitude of c.
|
case |
a |
b |
c |
output_value |
| 0 |
1 |
3.0 |
4.0 |
-2.0 |
2.5 |
| 1 |
2 |
5.0 |
12.0 |
13.0 |
1.0 |
| 2 |
3 |
8.0 |
15.0 |
-5.0 |
3.4 |
| 3 |
4 |
6.0 |
8.0 |
10.0 |
1.0 |
| 4 |
5 |
1.5 |
2.0 |
-0.5 |
5.0 |
| 5 |
6 |
7.0 |
24.0 |
5.0 |
5.0 |
| 6 |
7 |
9.0 |
12.0 |
-3.0 |
5.0 |
| 7 |
8 |
4.0 |
3.0 |
2.5 |
2.0 |
| 8 |
9 |
10.0 |
24.0 |
-2.0 |
13.0 |
| 9 |
10 |
0.6 |
0.8 |
0.5 |
2.0 |
task_14
Build a cubic interaction score by adding a and b, cubing the result, and dividing by the magnitude of c.
|
case |
a |
b |
c |
output_value |
| 0 |
1 |
1.0 |
2.0 |
-3.0 |
9.0 |
| 1 |
2 |
2.0 |
1.0 |
9.0 |
3.0 |
| 2 |
3 |
3.0 |
3.0 |
-6.0 |
36.0 |
| 3 |
4 |
0.5 |
1.5 |
2.0 |
4.0 |
| 4 |
5 |
-1.0 |
4.0 |
-3.0 |
9.0 |
| 5 |
6 |
5.0 |
-2.0 |
3.0 |
9.0 |
| 6 |
7 |
2.5 |
2.5 |
-5.0 |
25.0 |
| 7 |
8 |
4.0 |
1.0 |
2.5 |
50.0 |
| 8 |
9 |
6.0 |
-3.0 |
-9.0 |
3.0 |
| 9 |
10 |
1.2 |
0.8 |
0.5 |
16.0 |
Symbolic Calculus and Trigonometry
task_8
Compute sin(x)^2 plus cos(x)^2 for the provided x.
|
case |
x |
output_value |
| 0 |
1 |
0.000000 |
1.0 |
| 1 |
2 |
0.500000 |
1.0 |
| 2 |
3 |
1.200000 |
1.0 |
| 3 |
4 |
-0.700000 |
1.0 |
| 4 |
5 |
2.400000 |
1.0 |
| 5 |
6 |
1.047198 |
1.0 |
| 6 |
7 |
1.570796 |
1.0 |
| 7 |
8 |
3.000000 |
1.0 |
| 8 |
9 |
-2.500000 |
1.0 |
| 9 |
10 |
4.100000 |
1.0 |
task_10
Return the derivative of x squared evaluated at x.
|
case |
x |
output_value |
| 0 |
1 |
4.00 |
8.0 |
| 1 |
2 |
-1.50 |
-3.0 |
| 2 |
3 |
0.00 |
0.0 |
| 3 |
4 |
2.25 |
4.5 |
| 4 |
5 |
-3.00 |
-6.0 |
| 5 |
6 |
7.00 |
14.0 |
| 6 |
7 |
0.50 |
1.0 |
| 7 |
8 |
-8.00 |
-16.0 |
| 8 |
9 |
10.00 |
20.0 |
| 9 |
10 |
1.20 |
2.4 |
task_11
Return the derivative of x cubed plus sine of x evaluated at x.
|
case |
x |
output_value |
| 0 |
1 |
2.000000 |
11.583853 |
| 1 |
2 |
0.000000 |
1.000000 |
| 2 |
3 |
-1.000000 |
3.540302 |
| 3 |
4 |
1.500000 |
6.820737 |
| 4 |
5 |
1.570796 |
7.402203 |
| 5 |
6 |
-2.500000 |
17.948856 |
| 6 |
7 |
3.000000 |
26.010008 |
| 7 |
8 |
0.250000 |
1.156412 |
| 8 |
9 |
-4.000000 |
47.346356 |
| 9 |
10 |
5.000000 |
75.283662 |
task_15
Compute tangent of x by dividing sine of x by cosine of x.
|
case |
x |
output_value |
| 0 |
1 |
0.25 |
0.255342 |
| 1 |
2 |
0.50 |
0.546302 |
| 2 |
3 |
1.00 |
1.557408 |
| 3 |
4 |
-0.75 |
-0.931596 |
| 4 |
5 |
1.20 |
2.572152 |
| 5 |
6 |
-1.10 |
-1.964760 |
| 6 |
7 |
0.90 |
1.260158 |
| 7 |
8 |
-0.30 |
-0.309336 |
| 8 |
9 |
0.70 |
0.842288 |
| 9 |
10 |
-0.60 |
-0.684137 |
Stress Cases
task_12
Multiply two positive numbers, but do it through logarithms and exponentiation rather than a direct multiply tool.
|
case |
left |
right |
output_value |
| 0 |
1 |
2.00 |
8.0 |
16.0 |
| 1 |
2 |
1.50 |
4.0 |
6.0 |
| 2 |
3 |
3.00 |
9.0 |
27.0 |
| 3 |
4 |
0.50 |
6.0 |
3.0 |
| 4 |
5 |
10.00 |
2.0 |
20.0 |
| 5 |
6 |
4.00 |
4.0 |
16.0 |
| 6 |
7 |
1.25 |
8.0 |
10.0 |
| 7 |
8 |
7.00 |
3.0 |
21.0 |
| 8 |
9 |
2.50 |
2.0 |
5.0 |
| 9 |
10 |
12.00 |
0.5 |
6.0 |
Results
The results below come from the benchmark run documented in this note. Each task was run five times. The most useful metrics here are how many reruns completed, how much retry pressure each task needed, and how large the resulting workflows were.
The workflow artifacts used in this note do not currently persist wall-clock runtime, so this section uses saved retry counts instead of average duration.
|
task_id |
runs |
completed_runs |
success_rate |
avg_test_retries |
avg_workflow_steps |
| 0 |
task_1 |
5 |
5 |
1.0 |
0.0 |
2.0 |
| 1 |
task_10 |
5 |
5 |
1.0 |
0.0 |
2.0 |
| 2 |
task_11 |
5 |
5 |
1.0 |
0.0 |
4.4 |
| 3 |
task_12 |
5 |
5 |
1.0 |
1.2 |
5.0 |
| 4 |
task_13 |
5 |
5 |
1.0 |
1.6 |
7.0 |
| 5 |
task_14 |
5 |
5 |
1.0 |
0.0 |
5.0 |
| 6 |
task_15 |
5 |
5 |
1.0 |
0.0 |
4.0 |
| 7 |
task_2 |
5 |
5 |
1.0 |
0.0 |
2.0 |
| 8 |
task_3 |
5 |
5 |
1.0 |
0.0 |
2.0 |
| 9 |
task_4 |
5 |
5 |
1.0 |
0.0 |
2.0 |
| 10 |
task_5 |
5 |
5 |
1.0 |
0.0 |
3.0 |
| 11 |
task_6 |
5 |
5 |
1.0 |
0.0 |
4.0 |
| 12 |
task_7 |
5 |
5 |
1.0 |
0.0 |
3.0 |
| 13 |
task_8 |
5 |
4 |
0.8 |
3.0 |
1.2 |
| 14 |
task_9 |
5 |
5 |
1.0 |
0.0 |
3.0 |

Example Workflows
A few saved workflows are worth showing directly. One from each group and one exception worth noting.
|
task_id |
group |
workflow_id |
workflow_step_count |
test_retries |
| 0 |
task_1 |
Direct arithmetic |
9d418c7863d543959ae985642cbad0bb |
2 |
0 |
| 1 |
task_7 |
Compositional arithmetic |
de5ac14d307940c99e97321f9b9e705a |
3 |
0 |
| 2 |
task_8 |
Exception |
ff62880d4d93476fab5bb81802081a91 |
2 |
3 |
| 3 |
task_11 |
Symbolic calculus and trigonometry |
b4ef5b22b358428f9582b51622c88a98 |
4 |
0 |
| 4 |
task_12 |
Stress cases |
a9eb4705d55b4cceaa691a927e0a935f |
5 |
0 |
task_1: Add left and right and return the result as a single scalar value.
1. add
{
"left": "0.output.left",
"right": "0.output.right"
}
2. output_model
{
"value": "1.output.value"
}
task_7: Return the distance between the two numbers on the real line.
1. subtract
{
"left": "0.output.left",
"right": "0.output.right"
}
2. absolute_value
{
"x": "1.output.value"
}
3. output_model
{
"value": "2.output.value"
}
task_8: Compute sin(x)^2 plus cos(x)^2 for the provided x.
1. divide
{
"left": "1",
"right": "1"
}
2. output_model
{
"value": "1.output.value"
}
task_11: Return the derivative of x cubed plus sine of x evaluated at x.
1. derivative_cube
{
"x": "0.output.x"
}
2. derivative_sine
{
"x": "0.output.x"
}
3. add
{
"left": "1.output.value",
"right": "2.output.value"
}
4. output_model
{
"value": "3.output.value"
}
task_12: Multiply two positive numbers, but do it through logarithms and exponentiation rather than a direct multiply tool.
1. natural_log
{
"x": "0.output.left"
}
2. natural_log
{
"x": "0.output.right"
}
3. add
{
"left": "1.output.value",
"right": "2.output.value"
}
4. exponential
{
"x": "3.output.value"
}
5. output_model
{
"value": "4.output.value"
}
Conclusion
In this benchmark, all 15 tasks were run 5 times. 14 tasks completed in all 5 runs. task_8 completed in 4 of 5 runs and remains the main exception in this set.
task_8 is also defined in a way that lets the system in its current form exploit the test. Because all expected outputs are identical, the system can fill in the final answer directly instead of building the intended trig workflow. The numeric output is not necessarily wrong, but it is not the behavior this task was meant to reward.
This benchmark shows that, for this system, workflow composition is easier to trust when more than one test case is provided and the expected outputs vary across those cases. When every expected output is identical, a shortcut or overfitted answer may pass.