Data Classifier: An AI-driven approach to Label LLM Training Data

Abstract

Large language model (LLM) training and fine-tuning depend critically on the quality of underlying datasets. Imperfect datasets can introduce noise, bias, and safety risks that propagate through model behavior. This paper presents Data Classifier, a pragmatic rule-based system designed to label and triage candidate training examples for LLM datasets into categories such as unsafe, spammy, sensitive, personally identifying information (PII)-containing, and syntactically malformed. The system combines deterministic heuristics, lightweight linguistic analysis, and optional model-backed checks to produce interpretable and reproducible labels suitable for dataset curation, filtering, and downstream auditing. We describe the conceptual design, implementation considerations, detailed heuristics, and evaluation methodology for assessing classifier utility in curation workflows. Rather than reporting definitive empirical performance numbers—which require large-scale annotated corpora and controlled experiments beyond the scope of this manuscript—we provide a rigorous experimental protocol, qualitative analyses on representative examples, and a discussion of limitations, trade-offs, and integration strategies. We conclude that a transparent rule-based classifier offers a useful first-pass filter that complements model-based detectors and human review, enabling more efficient and accountable dataset preparation for LLM training. The source code is available at github.com/Pro-GenAI/DataClassifier.

Keywords: Large Language Models, LLMs, data curation, dataset labeling, rule-based classifier, language model safety, data hygiene, training data filtering, interpretability, Artificial Intelligence, Generative AI