The Dataset Strikes Back
100

This dataset — whose name combines the acronym of a German nonprofit research organization with a number reflecting its enormous scale — became a case study in how privacy problems, consent failures, and harmful content accumulate when AI training data is scraped from the open web without meaningful oversight. It was also the dataset at the center of Rachel Hong's research on CLIP-filtering and data consent mechanisms.

What is LAION-5B?

M
e
n
u