index
:
Auto-GPT.git
benchmark/concurrency
bringing-in-the-benchmark
data/benchmark-reports
forge/fixes
frontend-build/master
github-repo-stats
groq
master
python-coverage-comment-action-data
reinier/open-765-do-not-run-profile-generator-on-agent-creation
reinier/open-786-persist-contextitems-on-agent
reinier/open-807-fix-type-propagation-of-command-and-command
release-autogpt-v0.5.x
security/analysis-workflows-sandbox
self-feedback-rough-example
summary_memory
An experimental open-source attempt to make GPT-4 fully autonomous.
Torantulino
about
summary
refs
log
tree
commit
diff
log msg
author
committer
range
path:
root
/
benchmark
Age
Commit message (
Expand
)
Author
Files
Lines
2024-02-29
fix(benchmark/reports): Resolve error in format.py on `attempt.cost` is `None`
Reinier van der Leer
1
-1
/
+2
2024-02-20
fix(benchmark/reports): Make format.py executable
Reinier van der Leer
1
-0
/
+2
2024-02-20
fix(benchmark/challenges): Improve spec and eval of TicTacToe challenge
Albert Örwall
2
-2
/
+2
2024-02-19
feat(benchmark): Add reports/format.py script to convert report.json to markdown
Reinier van der Leer
1
-0
/
+136
2024-02-19
feat(benchmark): Include Steps in Report
Reinier van der Leer
4
-1
/
+16
2024-02-18
debug(benchmark): Improve `TestResult` validation error output format
Reinier van der Leer
1
-5
/
+8
2024-02-17
debug(benchmark): Add more debug code to pinpoint cause of rare crash
Reinier van der Leer
2
-15
/
+23
2024-02-17
debug(benchmark): Make sure `TestResult` validator error output is sufficient...
Reinier van der Leer
1
-1
/
+1
2024-02-17
debug(benchmark): Add log statement to validator on `TestResult`
Reinier van der Leer
1
-0
/
+8
2024-02-16
fix(benchmark): Fix `TestResult.fail_reason` assignment condition
Reinier van der Leer
1
-1
/
+1
2024-02-16
fix(benchmark): Unbreak `-N`/`--attempts` option
Reinier van der Leer
3
-4
/
+4
2024-02-16
feat(benchmark): Get agent task cost from `Step.additional_output`
Reinier van der Leer
3
-0
/
+18
2024-02-16
feat(benchmark/report): Add and record `TestResult.n_steps`
Reinier van der Leer
4
-0
/
+9
2024-02-16
lint(benchmark): Remove unnecessary `pass` statement in __main__.py
Reinier van der Leer
1
-1
/
+0
2024-02-16
fix(benchmark): Include `WebArenaSiteInfo.additional_info` (e.g. credentials)...
Reinier van der Leer
1
-7
/
+19
2024-02-16
feat(benchmark/cli): Add `challenge list`, `challenge info` subcommands
Reinier van der Leer
5
-6
/
+219
2024-02-16
refactor(benchmark): `load_webarena_challenges`
Reinier van der Leer
2
-22
/
+43
2024-02-15
feat(benchmark): Make report output folder configurable
Reinier van der Leer
5
-8
/
+15
2024-02-14
lint(benchmark): Remove unused imports
Reinier van der Leer
2
-2
/
+1
2024-02-14
fix(benchmark): Mock mode, python evals, `--attempts` flag, challenge definit...
Reinier van der Leer
6
-44
/
+63
2024-02-13
chore(benchmark): Update `python-multipart` dependency to mitigate vulnerability
Reinier van der Leer
2
-6
/
+6
2024-02-13
chore(benchmark): Update `aiohttp` and `fastapi` dependencies to mitigate vul...
Reinier van der Leer
2
-92
/
+92
2024-01-22
feat(benchmark): Add `-N`, `--attempts` option for multiple attempts per chal...
Reinier van der Leer
12
-137
/
+177
2024-01-19
feat(benchmark): JungleGym WebArena (#6691)
Reinier van der Leer
4
-1
/
+1005
2024-01-19
fix(benchmark/report): Fix and clean up logic in `update_challenges_already_b...
Reinier van der Leer
1
-8
/
+5
2024-01-19
fix(benchmark): Fix challenge input artifact upload
Reinier van der Leer
1
-1
/
+3
2024-01-18
refactor(benchmark): Interface & type consoledation, and arch change, to allo...
Reinier van der Leer
16
-814
/
+923
2024-01-16
chore(benchmark): Upgrade OpenAI client lib from v0 to v1
Reinier van der Leer
4
-23
/
+33
2024-01-16
refactor(benchmark): Disable Helicone integrations
Reinier van der Leer
5
-160
/
+123
2024-01-02
AGBenchmark codebase clean-up (#6650)
Reinier van der Leer
46
-7749
/
+2119
2023-11-21
Clean up & fix GitHub workflows (#6313)
Reinier van der Leer
5
-8
/
+10
2023-11-09
fix: Fixing Benchmarking
SwiftyOS
2
-0
/
+6
2023-10-20
reverting new challenges
Silen Naihin
6
-107
/
+0
2023-10-20
case sensitivity, updating challenges
Silen Naihin
7
-1
/
+13
2023-10-20
fix capitalization, rename
Silen Naihin
2
-2
/
+1
2023-10-19
fix data challenges
Silen Naihin
2
-2
/
+2
2023-10-19
scrape synthesize challenge additions
Silen Naihin
10
-5
/
+133
2023-10-17
fixing password gen and revenue retrieval 2 challenges
Silen Naihin
2
-16
/
+17
2023-10-17
Fix subproject dependency compatibility
Reinier van der Leer
2
-11
/
+31
2023-10-14
Update data.json
Silen Naihin
1
-1
/
+1
2023-10-13
Update test.py (#5721)
merwanehamadi
1
-6
/
+2
2023-10-09
fix label csv (#5656)
merwanehamadi
1
-12
/
+12
2023-10-06
Fix password generator (#5581)
merwanehamadi
1
-6
/
+2
2023-10-06
Fix agbenchmark client (#5578)
merwanehamadi
1
-12
/
+1
2023-10-05
Fix challenges (#5561)
merwanehamadi
2
-5
/
+5
2023-10-03
Fix custom_python not being copied (#5512)
merwanehamadi
2
-6
/
+12
2023-10-02
Correct create_game method definition in the challenge input (#5460)
Albert Örwall
1
-1
/
+1
2023-10-02
Fix benchmark ci (#5478)
merwanehamadi
2
-3
/
+4
2023-10-02
add load_dotenv (#5474)
merwanehamadi
1
-0
/
+2
2023-10-02
Correct revenue retrieval challenge (#5471)
merwanehamadi
2
-2
/
+2
[next]