Data Annotation

Overview


Annotated data is data that contains both the raw data and the label that represents the pattern that a machine learning algorithm is trying to learn. That is, the machine learning algorithm is designed to learn a function
{% f:\vec{x} \rightarrow y %}
Many raw datasets consist of records with the value of {% \vec{x} %}, but lacking the correct answer, {% y %}. Annotation is the process of having human annotate each raw data record, with the correct value of {% y %}.

Challenges


  • Menial Nature of the Work - annotating data can be tiresome and repetitive. And yet, on some datasets, you need experts to create the annotations. This creates the problem of motivating quality annotators and shouldering the cost for a task which is essentially menial.
  • Differing or Incorrect Annotations - different people may give different answers on a given record. Sometimes, a annotator may just make a mistake. Some solutions may include
    • Ignore errors, espeically when they represent a small portion of the dataset
    • Have multiple annotators for each record. Example records that have multiple answers can be examined in order to arrive at the correct answer. For true edge case, all annotated records can be included in the training dataset.