Skip to content

Welcome to Agential!

Benchmark Few-shot Examples

Benchmarks Number of few-shot examples
HotpotQA 6
FEVER 3
TriviaQA 4
AmbigNQ 5
GSM8k 8
SVAMP 7
TabMWP 4
MBPP 3
HumanEval 0
ALFWorld
WebShop
AgentBench

Implementing...

  • : not tested in the original paper
  • : tested in the original paper
Methods / Benchmarks HotpotQA FEVER TriviaQA AmbigNQ
ReAct
Reflexion
CRITIC
Self-Refine
ExpeL
LATS
Methods / Benchmarks GSM8k SVAMP TabMWP
ReAct
Reflexion
CRITIC
Self-Refine
ExpeL
LATS
Methods / Benchmarks MBPP HumanEval
ReAct
Reflexion
CRITIC
Self-Refine
ExpeL
LATS
Methods / Benchmarks ALFWorld WebShop AgentBench
ReAct
Reflexion
CRITIC
Self-Refine
ExpeL
LATS

Experimenting...

Methods / Benchmarks HotpotQA FEVER TriviaQA AmbigNQ
ReAct
Reflexion
CRITIC
Self-Refine
ExpeL
LATS
Methods / Benchmarks GSM8k SVAMP TabMWP
ReAct
Reflexion
CRITIC
Self-Refine
ExpeL
LATS
Methods / Benchmarks MBPP HumanEval
ReAct
Reflexion
CRITIC
Self-Refine
ExpeL
LATS
Methods / Benchmarks ALFWorld WebShop AgentBench
ReAct
Reflexion
CRITIC
Self-Refine
ExpeL
LATS

Types of errors

CRITIC

Sure! Here's the section on CRITIC errors organized into a table for better readability:

Types of errors

CRITIC

Certainly! Here's the section organized with each benchmark in a single row and the error types listed in a numbered format within the same cell:

Types of errors

CRITIC, Self-Refine

Dataset Error Types
HUMANEVAL 1. Logical error
2. Logical error
3. Logical error
4. No error
5. No error
MBPP 1. Logical error
2. Logical error
3. Logical error
4. No error
5. No error
GSM8K 1. Code efficiency
2. NameError (var not defined)
3. Logical error
4. Logical error
5. Logical error
SVAMP 1. Logical error
2. No error
3. Logical error
4. No error
5. No error
TABMWP 1. Incorrect answer format
2. NameError (var not defined)
3. No error
4. Incorrect answer format
5. No error
AMBIGNQ 1. No error
2. No error
3. Incorrect answer/answer format
4. No error
5. Incorrect answer/answer format
HOTPOTQA 1. No error
2. No error
3. No error
4. Incorrect answer/answer format
5. Incorrect answer/answer format
TRIVIAQA 1. Incorrect answer
2. No error
3. Incorrect answer
4. No error
5. No error
FEVER 1. Incorrect answer
2. Incorrect answer format
3. No error
4. No error
5. No error

ReflexionCoT, ReflexionReAct

Benchmark ReflexionCoT ReflexionReAct
HotpotQA 1. Misinterpretation
2. Incorrect assumption
3. Misinterpretation
4. Misinterpretation
5. Misinterpretation
1. Misled action
2. Misled action
3. Misread context
4. Wrong answer
5. Logical error
FEVER 1. Insufficient info
2. Misinterpretation
3. Insufficient info
4. Insufficient info
5. Misinterpretation
1. Ignored context
2. Insufficient info
3. Insufficient info
4. Ignore context
5. Ignore context
AmbigNQ 1. Knowledge error
2. Knowledge error
3. Knowledge error
4. Misinterpret question
5. Knowledge error
1. Incorrect assumption/Insufficient info
2. Insufficient info
3. Knowledge error
4. Incorrect answer format
5. Misread context
TriviaQA 1. Incorrect assumption
2. Incorrect assumption
3. Incorrect assumption
4. Misinterpretation
5. Incorrect assumption
1. Ignore context
2. Ignore context
3. Ignore context
4. Ignore context
5. Ignore context
GSM8K 1. Logical error
2. Logical error
3. Misinterpret question
4. Logical error
5. Misinterpret question
1. Logical error/Misinterpret question
2. Logical error/Misinterpret question
3. Logical error/Re-calculation error
4. Logical error/Re-calculation error
5. Logical error/Misinterpret question
SVAMP 1. Logical error
2. Logical error
3. Logical error
4. Logical error
5. Logical error
1. Misinterpret question
2. Logical error
3. Logical error
4. Logical error
5. Logical error
TabMWP 1. Incorrect operator
2. Incorrect operator
3. Misinterpret question
4. Incorrect operator
5. Logical error
1. Misinterpret question
2. Logical error
3. Logical error
4. Re-calculation error
5. Logical error
HumanEval 1. Conceptual error
2. Logical error
3. Logical error
4. Logical error
5. Logical error
1. Logical error
2. Logical error
3. Logical error
4. Logical error
5. Logical error
MBPP 1. Logical error
2. Logical error
3. Incorrect function usage
4. Logical error
5. Logical error
1. Incorrect function implementation
2. Logical error
3. Incorrect function usage
4. Logical error
5. Logical error

LATS

Benchmark LATS Reflect LATS Value
HotpotQA 1. Misled action
2. Misled action
3. Misread context
4. Wrong answer
5. Logical error
1. wrong (2)
2. right (10)
3. wrong (2)
4. wrong (5)
5. wip (10)
FEVER 1. Ignored context
2. Insufficient info
3. Insufficient info
4. Ignore context
5. Ignore context
1. wrong (2)
2. wrong (1)
3. wrong (2)
4. wip (10)
5. right (10)
AmbigNQ 1. Incorrect assumption/Insufficient info
2. Insufficient info
3. Knowledge error
4. Incorrect answer format
5. Misread context
1. wrong (2)
2. wrong (3)
3. wrong (4)
4. wip (10)
5. right (10)
TriviaQA 1. Ignore context
2. Ignore context
3. Ignore context
4. Ignore context
5. Ignore context
1. wrong (2)
2. wrong (2)
3. wrong (3)
4. wip (10)
5. right (10)
GSM8K 1. Logical error/Misinterpret question
2. Logical error/Misinterpret question
3. Logical error/Re-calculation error
4. Logical error/Re-calculation error
5. Logical error/Misinterpret question
1. wrong (1)
2. wrong (2)
3. wrong (6)
4. wip (10)
5. right (10)
SVAMP 1. Misinterpret question
2. Logical error
3. Logical error
4. Logical error
5. Logical error
1. wrong (4)
2. wrong (3)
3. wrong (4)
4. wip (10)
5. right (10)
TabMWP 1. Misinterpret question
2. Logical error
3. Logical error
4. Re-calculation error
5. Logical error
1. wrong (3)
2. wrong (4)
3. wrong (3)
4. wip (7)
5. right (10)
HumanEval 1. Logical error
2. Logical error
3. Logical error
4. Logical error
5. Logical error
1. wrong (2)
2. wrong (4)
3. wrong (3)
4. wip (5)
5. right (10)
MBPP 1. Incorrect function implementation
2. Logical error
3. Incorrect function usage
4. Logical error
5. Logical error
1. wrong (3)
2. wrong (2)
3. wrong (3)
4. wip (9)
5. right (10)

For full documentation visit mkdocs.org.

Commands

  • mkdocs new [dir-name] - Create a new project.
  • mkdocs serve - Start the live-reloading docs server.
  • mkdocs build - Build the documentation site.
  • mkdocs -h - Print help message and exit.

Project layout

mkdocs.yml    # The configuration file.
docs/
    index.md  # The documentation homepage.
    ...       # Other markdown pages, images and other files.