Name of project where we did some annotations on small budget to do a really good job for them and get our foot in the door.
Bouquet-Pilot
Anything using an adversarial mindset.
What is an uplift study and what is the difference vs. a red team (trick question kinda).
Comparison between two groups - a red team CAN be an uplift study.
Which benchmark created scientific protocols and then turned them into questions?
Proxy Evals Part III
Which country did we study in Horizon Scan (and bonus: why?)
PRC
The cheapest language project we've done.
Documentary
After a red team, there is another step where reviewers grade how well the red teamers did. What is this step called?
White team
What is the largest uplift study we have ever done?
Amazon
Find the budget for a benchmark where we developed 5 prompts and put it in chat.
FMF
Which project was a scale-up of project Biblio?
Dangerous Documents
The most expensive languages project we've done.
Languages.
What is a typical way that we support Nemesys during their red teams? (3 right answers)
Tech, onboarding, communications
We did 4 different uplift studies during Zookeeper. Why did we do 4?
AISF
What was the point of dangerous documents?
Pre-training data filtration
What project involved 100+ languages that we had to vet for accuracy.
Validation
What was the red team where the model broke, so we had to resort to using chatgpt?
Amazon (Fall 2024)
We did 1 uplift study in a wet lab. What was that called AND who was our partner.
TS2
What is the difference between a safeguard evaluation (benchmark) and a capability evaluation (benchmark)?
Safeguards are usually just binary - does your model respond. Capability evaluations are more nuanced.
Freebie - who absolutely crushed project management for a project where we validated 100+ languages (two answers)
Helen & Robin!
Find the budget for a project where we did translations in 120+ languages and put it in chat!
Olivia will check.
Find the materials used by red-teamers during an election workshop and drop the link in chat.
Olivia will check.
Which uplift study did we have the biggest problem on and (bonus) why?
TS1 - repeatability of results / SMEs being bad at red teams.
Why would you use a benchmark vs. a red team?
Benchmarks are faster, more scalable.
What was the biggest pain point on Meta prompt dataset?
1000 prompts required a lot of QC