We test a
methodology to automatically filtering, coding and reducing the huge amount of
data retrieved from Twitter, as a previous task to be done before the analysis
of Big Data, and to determine the reliability of the methodology after being
applied to a dataset of 500,000 tweets on the ‘desahucios’ (evictions)
thematic. We explain the process followed to achieve these tasks. Basically, we
extracted a random sample
of 1,000 clusters from a dataset of 500,000 tweets around the ‘desahucios’
thematic that was retrieved from 10 April 2013 to 28 May 2013 period. Hashtags
on this sample were automatically filtered, codified and reduced according to the
Levenshtein distance metric.[1]
Different automatic algorithms were applied to the 100,000 sample of tweets for
filtering, coding and reducing the number of hashtags. After this operation, a
new statistically representative sample of hashtags was selected in order to determine
the reliability of the automatic algorithm created. In this last step two
researchers manually checked case by case if the hashtags were correctly
clustered. Results present all the process and the evaluation of the best
algorithm for reducing twitter data on the eviction thematic.
[1] Informally, the
Levenshtein distance between two words is the minimum number of
single-character edits (i.e. insertions, deletions or substitutions) required
to change one word into the other. http://en.wikipedia.org/wiki/Levenshtein_distance
(retrieved 13.03.2014).
No hay comentarios:
Publicar un comentario