Twitter Author Classification [WINTER 2013]
Data Science Project
The goal of this project was to label a tweet as written by either a news source or an individual. I also explored the classification of news and celebrity tweets. Data was captured from some of Twitter’s most prolific tweeters.
Structure in news tweets makes them individually identifiable. We have identified a feature vector and algorithm pair that can classify these news sources with 95% efficiency.
Tweets generated by people do not always have known exploitable structure. While we have found cases were it is possible to distinguish between two individual people accurately, increasing the number of individuals/labels to classify between has been shown to decrease accuracy using our feature vector. Our best implementation, using Naïve Bayes, reached an accuracy of 63% with 10 sources after many iterations of feature vector development. This result is much better than guessing, but leaves room for future improvement.
Finally, the common structure of a news tweet enables tweets from either a person or from a news source to be properly labeled with 97% accuracy.
The news source test data consisted of 10 highly prolific Twitter news sources: @guardiannews, @CNET, @CNN, @engadget, @TechCrunch, @WIRED, @Telegraph, @BBCWorld, @nytimes and @thetimes.
The people test data consisted of 10 highly prolific tweeters: @edgarwright, @jimcarrey, @jimmyfallon, @joelmchale, @justinbieber, @mileycyrus, @shauntfitness, @simonpegg, @stephenathome and @theellenshow.
500 tweets from each of the above handles were taken, starting at the most recent post (on November 25th 2013) and going back in time. In total, 5000 news tweets and 5000 celebrity tweets were collected. It deserves mention that both of the Twitter APIs (Search and REST) wrap any URL in a tweet with the Twitter URL, whereas the URLs in tweets viewed using a web browser are unmodified. The data used in this project came from the latter case, as it was scraped without the aid of either Twitter API.
The algorithm chosen needed to would work as a multiclassifier, which converged and could easily handle large and sparse feature vectors. For these reasons, Naïve Bayes (with Laplace smoothing) was chosen. As a baseline comparison, the Perceptron algorithm was run using a 1-VS-All multiclassifier on the same test cases.
Using a multiclassifier with a label for each news source, we were able to distinguish news sources with 95% accuracy using Naïve Bayes with unigram, bigram and the additional features discussed in the document linked below. The confusion matrix shows that most news sources can be identified distinctly with little mistake. The feature vector along with the Naïve Bayes training has captured distinguishable structure in news source tweets.
Using a multiclassifier with a label for each person, we were able to distinguish people with 63% accuracy using Naïve Bayes with unigram, bigram and the additional features discussed in the document linked below. Unlike the new sources, as more people/labels were added to the multiclassifer the accuracy decreased. The results show that are some individuals are easily distinguished from others. For example, tweets from Edgar Wright are never mistakenly identified from Justin Bieber. However, there are also individuals that are often incorrectly identified like Simon Pegg. The lack of distinguishable structure between tweets causes the overall accuracy of the Naïve Bayes training to drop with each new source added.
Finally, using a binary classifier with only the labels of “news” and “person,” we managed to properly classify a tweet 97% of the time using Naïve Bayes with unigram, bigram and the additional features discussed in the document linked below. The inherent structure in news source tweets makes it possible to accurately label a tweet as from a news source or from a person (not from a news source).