A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.