domingo, 27 de abril de 2014

What Happen After Crawling Big Data?

We test a methodology to automatically filtering, coding and reducing the huge amount of data retrieved from Twitter, as a previous task to be done before the analysis of Big Data, and to determine the reliability of the methodology after being applied to a dataset of 500,000 tweets on the ‘desahucios’ (evictions) thematic. We explain the process followed to achieve these tasks. Basically, we extracted a random sample of 1,000 clusters from a dataset of 500,000 tweets around the ‘desahucios’ thematic that was retrieved from 10 April 2013 to 28 May 2013 period. Hashtags on this sample were automatically filtered, codified and reduced according to the Levenshtein distance metric.[1] Different automatic algorithms were applied to the 100,000 sample of tweets for filtering, coding and reducing the number of hashtags. After this operation, a new statistically representative sample of hashtags was selected in order to determine the reliability of the automatic algorithm created. In this last step two researchers manually checked case by case if the hashtags were correctly clustered. Results present all the process and the evaluation of the best algorithm for reducing twitter data on the eviction thematic.




[1] Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. http://en.wikipedia.org/wiki/Levenshtein_distance (retrieved 13.03.2014).

No hay comentarios:

Publicar un comentario