E2E Test Results

4 run(s) recorded

Total Runs
4
Model Tiers
3
RunDateTest weakest mid strongest
239440134612026-04-03 11:20 pytest
dev-record/full_workflow
generator-coding/library_generator-baseline
generator-coding/library_generator-with_skill
review-skill/review_finds_seeded_issues
review-steps/review_preserves_vocabulary
review-skill/review_skill
review-steps/review
total $0.6826 · 346s · 69t $2.5124 · 900s · 85t $1.4135 · 403s · 77t
236857336272026-03-28 13:06 pytest
dev-record/full_workflow
generator-coding/library_generator-baseline
generator-coding/library_generator-with_skill
review-skill/review_finds_seeded_issues
review-steps/review_preserves_vocabulary
review-skill/review_skill
review-steps/review
total $0.2755 · 186s · 38t $0.3432 · 205s · 28t $0.6346 · 245s · 40t
235674395852026-03-25 22:37 pytest
dev-record/full_workflow
generator-coding/library_generator-baseline
generator-coding/library_generator-with_skill
review-skill/review_finds_seeded_issues
review-steps/review_preserves_vocabulary
review-skill/review_skill
review-steps/review
total $0.2532 · 188s · 33t $0.5236 · 274s · 38t $0.5632 · 207s · 39t
234015216202026-03-22 11:02 pytest
dev-record/full_workflow
generator-coding/library_generator-baseline
generator-coding/library_generator-with_skill
review-skill/review_finds_seeded_issues
review-steps/review_preserves_vocabulary
review-skill/review_skill
review-steps/review
total $0.2437 · 188s · 36t $0.5255 · 286s · 38t $0.6055 · 223s · 38t