Welcome to Agential!
Benchmark Few-shot Examples
| Benchmarks | Number of few-shot examples |
|---|---|
| HotpotQA | 6 |
| FEVER | 3 |
| TriviaQA | 4 |
| AmbigNQ | 5 |
| GSM8k | 8 |
| SVAMP | 7 |
| TabMWP | 4 |
| MBPP | 3 |
| HumanEval | 0 |
| ALFWorld | |
| WebShop | |
| AgentBench |
Implementing...
- : not tested in the original paper
- : tested in the original paper
| Methods / Benchmarks | HotpotQA | FEVER | TriviaQA | AmbigNQ |
|---|---|---|---|---|
| ReAct | ||||
| Reflexion | ||||
| CRITIC | ||||
| Self-Refine | ||||
| ExpeL | ||||
| LATS |
| Methods / Benchmarks | GSM8k | SVAMP | TabMWP |
|---|---|---|---|
| ReAct | |||
| Reflexion | |||
| CRITIC | |||
| Self-Refine | |||
| ExpeL | |||
| LATS |
| Methods / Benchmarks | MBPP | HumanEval |
|---|---|---|
| ReAct | ||
| Reflexion | ||
| CRITIC | ||
| Self-Refine | ||
| ExpeL | ||
| LATS |
| Methods / Benchmarks | ALFWorld | WebShop | AgentBench |
|---|---|---|---|
| ReAct | |||
| Reflexion | |||
| CRITIC | |||
| Self-Refine | |||
| ExpeL | |||
| LATS |
Experimenting...
| Methods / Benchmarks | HotpotQA | FEVER | TriviaQA | AmbigNQ |
|---|---|---|---|---|
| ReAct | ||||
| Reflexion | ||||
| CRITIC | ||||
| Self-Refine | ||||
| ExpeL | ||||
| LATS |
| Methods / Benchmarks | GSM8k | SVAMP | TabMWP |
|---|---|---|---|
| ReAct | |||
| Reflexion | |||
| CRITIC | |||
| Self-Refine | |||
| ExpeL | |||
| LATS |
| Methods / Benchmarks | MBPP | HumanEval |
|---|---|---|
| ReAct | ||
| Reflexion | ||
| CRITIC | ||
| Self-Refine | ||
| ExpeL | ||
| LATS |
| Methods / Benchmarks | ALFWorld | WebShop | AgentBench |
|---|---|---|---|
| ReAct | |||
| Reflexion | |||
| CRITIC | |||
| Self-Refine | |||
| ExpeL | |||
| LATS |
Types of errors
CRITIC
Sure! Here's the section on CRITIC errors organized into a table for better readability:
Types of errors
CRITIC
Certainly! Here's the section organized with each benchmark in a single row and the error types listed in a numbered format within the same cell:
Types of errors
CRITIC, Self-Refine
| Dataset | Error Types |
|---|---|
| HUMANEVAL | 1. Logical error 2. Logical error 3. Logical error 4. No error 5. No error |
| MBPP | 1. Logical error 2. Logical error 3. Logical error 4. No error 5. No error |
| GSM8K | 1. Code efficiency 2. NameError (var not defined) 3. Logical error 4. Logical error 5. Logical error |
| SVAMP | 1. Logical error 2. No error 3. Logical error 4. No error 5. No error |
| TABMWP | 1. Incorrect answer format 2. NameError (var not defined) 3. No error 4. Incorrect answer format 5. No error |
| AMBIGNQ | 1. No error 2. No error 3. Incorrect answer/answer format 4. No error 5. Incorrect answer/answer format |
| HOTPOTQA | 1. No error 2. No error 3. No error 4. Incorrect answer/answer format 5. Incorrect answer/answer format |
| TRIVIAQA | 1. Incorrect answer 2. No error 3. Incorrect answer 4. No error 5. No error |
| FEVER | 1. Incorrect answer 2. Incorrect answer format 3. No error 4. No error 5. No error |
ReflexionCoT, ReflexionReAct
| Benchmark | ReflexionCoT | ReflexionReAct |
|---|---|---|
| HotpotQA | 1. Misinterpretation 2. Incorrect assumption 3. Misinterpretation 4. Misinterpretation 5. Misinterpretation |
1. Misled action 2. Misled action 3. Misread context 4. Wrong answer 5. Logical error |
| FEVER | 1. Insufficient info 2. Misinterpretation 3. Insufficient info 4. Insufficient info 5. Misinterpretation |
1. Ignored context 2. Insufficient info 3. Insufficient info 4. Ignore context 5. Ignore context |
| AmbigNQ | 1. Knowledge error 2. Knowledge error 3. Knowledge error 4. Misinterpret question 5. Knowledge error |
1. Incorrect assumption/Insufficient info 2. Insufficient info 3. Knowledge error 4. Incorrect answer format 5. Misread context |
| TriviaQA | 1. Incorrect assumption 2. Incorrect assumption 3. Incorrect assumption 4. Misinterpretation 5. Incorrect assumption |
1. Ignore context 2. Ignore context 3. Ignore context 4. Ignore context 5. Ignore context |
| GSM8K | 1. Logical error 2. Logical error 3. Misinterpret question 4. Logical error 5. Misinterpret question |
1. Logical error/Misinterpret question 2. Logical error/Misinterpret question 3. Logical error/Re-calculation error 4. Logical error/Re-calculation error 5. Logical error/Misinterpret question |
| SVAMP | 1. Logical error 2. Logical error 3. Logical error 4. Logical error 5. Logical error |
1. Misinterpret question 2. Logical error 3. Logical error 4. Logical error 5. Logical error |
| TabMWP | 1. Incorrect operator 2. Incorrect operator 3. Misinterpret question 4. Incorrect operator 5. Logical error |
1. Misinterpret question 2. Logical error 3. Logical error 4. Re-calculation error 5. Logical error |
| HumanEval | 1. Conceptual error 2. Logical error 3. Logical error 4. Logical error 5. Logical error |
1. Logical error 2. Logical error 3. Logical error 4. Logical error 5. Logical error |
| MBPP | 1. Logical error 2. Logical error 3. Incorrect function usage 4. Logical error 5. Logical error |
1. Incorrect function implementation 2. Logical error 3. Incorrect function usage 4. Logical error 5. Logical error |
LATS
| Benchmark | LATS Reflect | LATS Value |
|---|---|---|
| HotpotQA | 1. Misled action 2. Misled action 3. Misread context 4. Wrong answer 5. Logical error |
1. wrong (2) 2. right (10) 3. wrong (2) 4. wrong (5) 5. wip (10) |
| FEVER | 1. Ignored context 2. Insufficient info 3. Insufficient info 4. Ignore context 5. Ignore context |
1. wrong (2) 2. wrong (1) 3. wrong (2) 4. wip (10) 5. right (10) |
| AmbigNQ | 1. Incorrect assumption/Insufficient info 2. Insufficient info 3. Knowledge error 4. Incorrect answer format 5. Misread context |
1. wrong (2) 2. wrong (3) 3. wrong (4) 4. wip (10) 5. right (10) |
| TriviaQA | 1. Ignore context 2. Ignore context 3. Ignore context 4. Ignore context 5. Ignore context |
1. wrong (2) 2. wrong (2) 3. wrong (3) 4. wip (10) 5. right (10) |
| GSM8K | 1. Logical error/Misinterpret question 2. Logical error/Misinterpret question 3. Logical error/Re-calculation error 4. Logical error/Re-calculation error 5. Logical error/Misinterpret question |
1. wrong (1) 2. wrong (2) 3. wrong (6) 4. wip (10) 5. right (10) |
| SVAMP | 1. Misinterpret question 2. Logical error 3. Logical error 4. Logical error 5. Logical error |
1. wrong (4) 2. wrong (3) 3. wrong (4) 4. wip (10) 5. right (10) |
| TabMWP | 1. Misinterpret question 2. Logical error 3. Logical error 4. Re-calculation error 5. Logical error |
1. wrong (3) 2. wrong (4) 3. wrong (3) 4. wip (7) 5. right (10) |
| HumanEval | 1. Logical error 2. Logical error 3. Logical error 4. Logical error 5. Logical error |
1. wrong (2) 2. wrong (4) 3. wrong (3) 4. wip (5) 5. right (10) |
| MBPP | 1. Incorrect function implementation 2. Logical error 3. Incorrect function usage 4. Logical error 5. Logical error |
1. wrong (3) 2. wrong (2) 3. wrong (3) 4. wip (9) 5. right (10) |
For full documentation visit mkdocs.org.
Commands
mkdocs new [dir-name]- Create a new project.mkdocs serve- Start the live-reloading docs server.mkdocs build- Build the documentation site.mkdocs -h- Print help message and exit.
Project layout
mkdocs.yml # The configuration file.
docs/
index.md # The documentation homepage.
... # Other markdown pages, images and other files.