This thesis is mainly motivated by the analysis, understanding, and prediction of human behavior by means of the study of their digital fingerprints. Unlike a classical PhD thesis, where you choose a topic and go further on a deep analysis on a research topic, we carried out a breadth analysis on the research topic of complex networks, such as those that humans create themselves with their relationships and interactions. These kinds of digital communities where humans interact and create relationships are commonly called Online Social Networks.
First, in 2013, we studied the media content people shared on these online social networks, such as Twitter. Our collection of tweets (text-messages database), namely corpus, was provided by the Spanish Society for Natural Language Processing (SEPLN in Spanish) for their workshop on NLP. We have basically applied the state-of-the-art techniques for Natural Language Processing, widely developed and tested on English texts, in a collection of Spanish Tweets and we compare the results for both Topic Detection and Sentiment Analysis tasks. The first conclusion of our study is that none of the techniques explored is the silver bullet for Spanish tweet topic classification, i.e., none made a clear difference when introduced in the algorithm. The second conclusion is that tweets are very hard to deal with, mostly due to their brevity and lack of context. The results of our experiments are encouraging though, since they show that it is possible to use classical methods for analyzing Spanish texts. Besides, the highest accuracy we reported (58% for topics and 42% for sentiment) is not far from the highest scores in the workshop we participated in. Thus, these conclusions reflect there is still room for improvement, justifying further efforts.
Next, in 2014, we focused on Topic Detection, creating our own classifier and applying it to the former tweets dataset. Our classifier builds graphs from the input texts and it relies on graph similarity techniques to classify short texts by topic. After preprocessing and filtering the texts, each word of the text represents a node, and two words are linked by an edge if they appear on the same tweet; not necessarily one next to the other. We use weighted graphs in the following way: Then, two nodes are connected with an edge whose weight is the product of their respective number of appearances in the text. The breakthroughs are two: our classifier relies on text-graphs from the input text and we achieved a figure of 70% accuracy, outperforming previous results.
After that, we moved to analyze the network structure (or topology) and their data values to detect outliers. We hypothesize that in social networks there is a large mass of users that behaves similarly, while a reduced set of them behave in a different way. However, especially among this last group, we try to separate those with high activity, or low activity, or any other parameter/feature that make them belong to different kinds of outliers. We aim to detect influential users in one of these outliers set, and in an unsupervised way, since we do not define influencers in advance on our method. We propose a new unsupervised method, Massive Unsupervised Outlier Detection (MUOD), labeling the outliers detected as of shape, magnitude, amplitude or combination of those. Our method relies on FDA theory (Functional Data Analysis) and we applied it to a subset of roughly 400 million Google+ users, identifying and discriminating automatically sets of outlier users. The results are promising. Our method is highly scalable and parallelizable for multicore machines. Based on the preliminary tests performed on synthetic datasets with controlled outliers, the performance is similar (usually better) to those methods on the literature but only one of such state-of-the-art methods scales for millions of users along with ours. Besides, our method yields different groups of outliers by nature, because not all outliers are necessarily influencers. Actually, the result reveals that the different outlier classes identified by MUOD include users that respond to different definitions of influence previously used in the literature. Hence, the results show strong evidences of the utility of MUOD as an algorithm to support the unsupervised identification of influencers when a predefined type of influential user does not exist. MUOD algorithm can be applied to a myriad of problems, in which the nodes/users/entities can be defined by a set of properties mapped into a signal.
Finally, we find interesting to address the monitorization of real complex networks. We leverage the characterization of complex networks by two cost-effective metrics to monitorize abrupt changes in a network stream by simply detecting abrupt changes on those metrics. We aim to economize resources (computation and time) while having a low loss accuracy on the target monitorization metric. We created a framework to dynamically adapt the temporality of large-scale dynamic networks, reducing compute overhead by at least 76%, data volume by 60% and overall cloud costs by at least 54%, while always maintaining accuracy above 88%.
Luis F. Chiroque received his BSc in Telematics Engineering from the Polytechnic University of Madrid. He completed an MSc in Mathematical Engineering at University Carlos III of Madrid, Spain. Chiroque started his PhD at IMDEA Networks and UC3M in October 2013 and his research interests are graph theory, network science, machine learning, big data, and data mining.
PhD Thesis Advisor: Dr. Antonio Fernández Anta, IMDEA Networks Institute, Spain
University: University Carlos III of Madrid, Spain
Doctoral Program: Mathematical Engineering
PhD Committee members: