Evaluating LLM outputs accurately is critical to being able to iterate quickly on a LLM system. Human annotations can be slow and expensive and using LLMs instead promises to solve this. However aligning a LLM Judge with human judgements is often hard with many implementation details to consider.

During the hackathon, let’s try build LLM Judges together and move the field forward a little by:

Productionizing the latest LLM-as-a-judge research
Improving on your existing judge
Building annotation UIs
Designing wireframes for collaborative annotation between humans and AI

This hackathon is for you if you are an AI Engineer who:

Runs LLMs in production or are planning to soon
Has LLM Judges and found them to be unreliable
Wants to learn more about using LLMs as a judge
Are a LLM Judge skeptic

LLM API credits will be provided to those who need them.

$5,000 cash equivalent prizes will be awarded for top 3 overall projects with a bonus category for most on-theme projects.

Judges

Greg Kamradt, Founder, Data Independent
Eugene Yan, Senior Applied Scientist, Amazon
Charles Frye, AI Engineer, Modal Labs
Shreya Shankar, ML Engineer, PhD at UC Berkeley
Shawn Lewis, CTO and Co-founder, W&B
Anish Shah, Growth ML Engineer, W&B
Tim Sweeney, Staff Software Engineer, W&B

Rules:

New projects only
Maximum team size: 4
Make friends
Prize eligibility:
- Project is open sourced on GitHub
- Use W&B Weave where applicable

Timing:

Saturday, Sept 21: 10am-10pm

Sunday, Sept 22: 9:30am-5pm

Please register here.