Your team must create a ground-truth sentiment dataset for 10 000 social-media posts. Because of budget limits, you can hire no more than three crowd workers per post, but the chief data scientist insists on at least 95 % label accuracy before the data are used for model training. Which strategy should you implement first to guarantee label quality without exceeding the budget?
Pre-train a weak language model on synthetic data and automatically overwrite any crowd label whose predicted probability is below 0.95.
Have every post labeled twice and keep only labels from pairs whose Cohen's kappa exceeds 0.8.
Increase the pay rate to attract experienced annotators but assign each post to a single worker to stay within budget.
Mix a hidden set of expert-labeled "gold" posts into each task and block annotators whose accuracy on these posts falls below a defined threshold.
Seeding each annotation batch with expert-labeled gold (honeypot) items lets you measure every worker's real-time accuracy against an objective standard and quickly disqualify low-quality annotators. This prevents large volumes of noisy labels from entering the dataset while still keeping the per-item worker count within budget. Computing inter-annotator agreement after labeling only detects problems after the fact; auto-correcting with a weak model can amplify errors; paying more for a single annotator removes the redundancy needed to verify quality.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a ground-truth dataset and why is it important?
Open an interactive chat with Bash
What are 'gold' posts and how do they improve labeling quality?
Open an interactive chat with Bash
How does Cohen's kappa measure inter-annotator agreement?