aboutsummaryrefslogtreecommitdiff
path: root/benchmark
AgeCommit message (Expand)AuthorFilesLines
2024-02-29fix(benchmark/reports): Resolve error in format.py on `attempt.cost` is `None`Gravatar Reinier van der Leer 1-1/+2
2024-02-20fix(benchmark/reports): Make format.py executableGravatar Reinier van der Leer 1-0/+2
2024-02-20fix(benchmark/challenges): Improve spec and eval of TicTacToe challengeGravatar Albert Örwall 2-2/+2
2024-02-19feat(benchmark): Add reports/format.py script to convert report.json to markdownGravatar Reinier van der Leer 1-0/+136
2024-02-19feat(benchmark): Include Steps in ReportGravatar Reinier van der Leer 4-1/+16
2024-02-18debug(benchmark): Improve `TestResult` validation error output formatGravatar Reinier van der Leer 1-5/+8
2024-02-17debug(benchmark): Add more debug code to pinpoint cause of rare crashGravatar Reinier van der Leer 2-15/+23
2024-02-17debug(benchmark): Make sure `TestResult` validator error output is sufficient...Gravatar Reinier van der Leer 1-1/+1
2024-02-17debug(benchmark): Add log statement to validator on `TestResult`Gravatar Reinier van der Leer 1-0/+8
2024-02-16fix(benchmark): Fix `TestResult.fail_reason` assignment conditionGravatar Reinier van der Leer 1-1/+1
2024-02-16fix(benchmark): Unbreak `-N`/`--attempts` optionGravatar Reinier van der Leer 3-4/+4
2024-02-16feat(benchmark): Get agent task cost from `Step.additional_output`Gravatar Reinier van der Leer 3-0/+18
2024-02-16feat(benchmark/report): Add and record `TestResult.n_steps`Gravatar Reinier van der Leer 4-0/+9
2024-02-16lint(benchmark): Remove unnecessary `pass` statement in __main__.pyGravatar Reinier van der Leer 1-1/+0
2024-02-16fix(benchmark): Include `WebArenaSiteInfo.additional_info` (e.g. credentials)...Gravatar Reinier van der Leer 1-7/+19
2024-02-16feat(benchmark/cli): Add `challenge list`, `challenge info` subcommandsGravatar Reinier van der Leer 5-6/+219
2024-02-16refactor(benchmark): `load_webarena_challenges`Gravatar Reinier van der Leer 2-22/+43
2024-02-15feat(benchmark): Make report output folder configurableGravatar Reinier van der Leer 5-8/+15
2024-02-14lint(benchmark): Remove unused importsGravatar Reinier van der Leer 2-2/+1
2024-02-14fix(benchmark): Mock mode, python evals, `--attempts` flag, challenge definit...Gravatar Reinier van der Leer 6-44/+63
2024-02-13chore(benchmark): Update `python-multipart` dependency to mitigate vulnerabilityGravatar Reinier van der Leer 2-6/+6
2024-02-13chore(benchmark): Update `aiohttp` and `fastapi` dependencies to mitigate vul...Gravatar Reinier van der Leer 2-92/+92
2024-01-22feat(benchmark): Add `-N`, `--attempts` option for multiple attempts per chal...Gravatar Reinier van der Leer 12-137/+177
2024-01-19feat(benchmark): JungleGym WebArena (#6691)Gravatar Reinier van der Leer 4-1/+1005
2024-01-19fix(benchmark/report): Fix and clean up logic in `update_challenges_already_b...Gravatar Reinier van der Leer 1-8/+5
2024-01-19fix(benchmark): Fix challenge input artifact uploadGravatar Reinier van der Leer 1-1/+3
2024-01-18refactor(benchmark): Interface & type consoledation, and arch change, to allo...Gravatar Reinier van der Leer 16-814/+923
2024-01-16chore(benchmark): Upgrade OpenAI client lib from v0 to v1Gravatar Reinier van der Leer 4-23/+33
2024-01-16refactor(benchmark): Disable Helicone integrationsGravatar Reinier van der Leer 5-160/+123
2024-01-02AGBenchmark codebase clean-up (#6650)Gravatar Reinier van der Leer 46-7749/+2119
2023-11-21Clean up & fix GitHub workflows (#6313)Gravatar Reinier van der Leer 5-8/+10
2023-11-09fix: Fixing BenchmarkingGravatar SwiftyOS 2-0/+6
2023-10-20reverting new challengesGravatar Silen Naihin 6-107/+0
2023-10-20case sensitivity, updating challengesGravatar Silen Naihin 7-1/+13
2023-10-20fix capitalization, renameGravatar Silen Naihin 2-2/+1
2023-10-19fix data challengesGravatar Silen Naihin 2-2/+2
2023-10-19scrape synthesize challenge additionsGravatar Silen Naihin 10-5/+133
2023-10-17fixing password gen and revenue retrieval 2 challengesGravatar Silen Naihin 2-16/+17
2023-10-17Fix subproject dependency compatibilityGravatar Reinier van der Leer 2-11/+31
2023-10-14Update data.jsonGravatar Silen Naihin 1-1/+1
2023-10-13Update test.py (#5721)Gravatar merwanehamadi 1-6/+2
2023-10-09fix label csv (#5656)Gravatar merwanehamadi 1-12/+12
2023-10-06Fix password generator (#5581)Gravatar merwanehamadi 1-6/+2
2023-10-06Fix agbenchmark client (#5578)Gravatar merwanehamadi 1-12/+1
2023-10-05Fix challenges (#5561)Gravatar merwanehamadi 2-5/+5
2023-10-03Fix custom_python not being copied (#5512)Gravatar merwanehamadi 2-6/+12
2023-10-02Correct create_game method definition in the challenge input (#5460)Gravatar Albert Örwall 1-1/+1
2023-10-02Fix benchmark ci (#5478)Gravatar merwanehamadi 2-3/+4
2023-10-02add load_dotenv (#5474)Gravatar merwanehamadi 1-0/+2
2023-10-02Correct revenue retrieval challenge (#5471)Gravatar merwanehamadi 2-2/+2