To annotate or not to annotate?

marco's picture

During the camp Tony (Veale) brought up the topic of manual annotation of data, arguing that the information a human can write down is much more reliable and reusable than what could be extracted by a corpus using machine learning (or other statistics-based techniques).

The argument mainly stands on the fact that these kind of techniques contain margin of errors that (even when very small) can lead to completely wrong assumptions.
This is certainly true but personally I would argue that, when we imagine the final step of NLP to be able to completely understand and produce language, having human annotations would not be sustainable to cover the enormity that is language.
Humans are also prone to errors and crowsourcing information would, without fail, lead to inconsistent and possibly completely erroneous data.

What if we have experts compile this data?
Even taking the NOC list and Scealextric developed by Tony, which I would consider an expert, we can find some intriguing information that probably reflect a part of Tony's conception and vision of the world.

Can we consider this kind of data to be high quality and reusable?
Of course, but it's also subjective and temporally located: some people might not agree with some annotations and if we imagine to move forward a hundred years some of the data might not make more sense to future people.

This might sound very much like a rant to endorse machine learning, but it's not.
I actually think that expert knowledge plays a key role in computational creativity (see my research on music), yet I think it's not sustainable to rely uniquely on it and we should explore more mixed methods.
For example we could start from a relatively small corpus of human annotated data and use this data to have our algorithms make more informed choices, without having to infer everything from zero.
We applied a similar methodology in our annotation of some basic Jungian archetypes with pretty interesting results: take a look at the previous post.

Marco

Scholarly Lite is a free theme, contributed to the Drupal Community by More than Themes.