Dissertation
Understanding the behaviour
and influence of automated
social agents
Syed Zafar ul Hussan Gilani
Selwyn College
University of Cambridge
Computer Laboratory
Email: Zafar.Gilani@cl.cam.ac.uk
Principal investigator: Prof. Jon Crowcroft
This dissertation is submitted on 24/8/2018 as a requirement for the degree of
Doctor of Philosophy

Abstract
Online social networks (OSNs) have seen a remarkable rise in the presence of
automated social agents, or social bots. Social bots are the new computing viral,
that are surreptitious and clever. What facilitates the creation of social agents is
the massive human user-base and business-supportive operating model of social
networks. These automated agents are injected by agencies, brands, individu-
als, and corporations to serve their work and purpose; utilising them for news
and emergency communication, marketing, social activism, political campaign-
ing, and even spam and spreading malicious content. Their influence was recently
substantiated by coordinated social hacking and computational political propa-
ganda. The thesis of my dissertation argues that automated agents exercise a
profound impact on OSNs that transforms into an array of influence on our so-
ciety and systems. However, latent or veiled, these agents can be successfully
detected through measurement, feature extraction and finely tuned supervised
learning models. The various types of automated agents can be further unrav-
elled through unsupervised machine learning and natural language processing, to
formally inform the populace of their existence and impact.

Declaration
I declare that this Dissertation is my own work and contains nothing which is
the outcome of work done in collaboration with others, except where specified
in the text. All of the technical work in this dissertation such as code, tests,
deployment, experiments, and results have been completed by myself. All of the
writing work done in collaboration has been completed by myself as the lead,
while collaborating authors only guiding, rewriting or reviewing my text along
the way. This report is not substantially the same as any that I have submitted
for a degree or diploma or other qualification at any other university. This report
does not exceed the prescribed limit of 60,000 words1.
The copyright of this dissertation rests with the author and may not be re-
produced without prior written consent of the author. Use of parts of the work
herein is permitted provided it is properly cited.
Ethical considerations of my research. All of the datasets that we col-
lected, stored and consumed strictly adheres to the practices and guidelines of the
Computer Laboratory at the University of Cambridge. We followed the ethical
considerations and procedures outlined by the University of Cambridge Institu-
tional Review Board (IRB) for all mentioned annotation tasks. Datasets collected
through the Twitter Streaming API also follows the Twitter policy2. In order to
protect individual privacy we follow Twitter’s data usage policy: all data kept in
an encrypted storage, not redistributed, no personal or sensitive information is
used, and we only analyse aggregated statistics. Moreover, the results (classifica-
tion, categorisation, tagging) are only indicative and informative, not disruptive
or decisive. The Twitter bot deployed for studying Web bots was approved by
the IRB. The bot was well within the ethical boundaries, since it was non-invasive
and non-engaging.
1ps2ascii thesis.pdf | wc -w calculates that this document is approx. 47,069 words.
2https://dev.twitter.com/overview/terms/agreement-and-policy
Informing the user of our research. Our shortener service tnyURL.uk
(and later renewed to tnyURL.co.uk) homepage notifies the users about the pur-
pose of our research. We inform them that we are collecting the data triggered by
their activity on our tweets or URLs, but while maintaining strict anonymity and
privacy so we or anyone else cannot discern the identity of the Twitter user who
clicked on our tweet or URL. Furthermore, we only collect information such as
timestamp and User Agent string of the web browser a Twitter user is utilising to
access our tweet or URL. Informed user consent was not required as the Twitter
bot we deployed was non-invasive, neither engaged in direct communication with
other Twitter users nor tried to identify individual users by any means.
Funding and fellowship. This work was partially funded by the Marie Curie
ITN METRICS research grant EC607728 and the EPSRC Global Challenges
Research Fund research grant EP/R512783/1.
Syed Zafar ul Hussan Gilani
21/8/2018
Acknowledgements
There is a long list of people I would like to thank. I will begin with Jon Crowcroft
(my principal investigator for his support and encouragement), Arjuna Sathi-
aseelan (for his support during the first year of my PhD), Fahad Satti (for his
assistance in making possible the elusive human annotation task for binary clas-
sification), Jatinder Singh (his priceless and kind support for actively helping
me in going through the painstaking process of securing additional funding for 6
months, that enabled me to finish o↵ my doctoral work and dissertation).
As it can be appreciated computing research is hardly ever a solitary exer-
cise. I have thus been fortunate to have the helping hands of Reza Farahbakhsh3
for Chapter 4 and § 6.5; Gareth Tyson4 for Chapters 4 and 6 (under submis-
sion); Ekaterina Kochmar5 for Chapter 5 and initial bot detection algorithm; and
Mario Almeida6 for § 3.3 and initial bot development. Perhaps the most common
contribution of my colleagues has been text and reviews. Despite their priceless
contributions, I have led the work and have been the first author in all of the
publications related to the social bot work. The list of my personal contributions
include, but are not limited to: (i) research questions and drafts, (ii) framework
design and conceptualisation, (iii) writing and implementation of software, code
and tools, (iv) system deployment, maintenance and troubleshooting, (v) analy-
sis, results, figures and descriptive text, and (vi) paper/chapter drafting, compil-
ing and submitting.
I would like to thank Andrew Moore and Cecilia Mascolo for their helpful
comments during first and second year reports. I would also like to thank An-
drew Rice and Nishanth Sastry for their all-important feedback during the viva
3Dr. R. Farahbakhsh, Institut Mines Telecom Paris – reza.farahbakhsh@it-sudparis.eu
4Dr. G. Tyson, Queen Mary University of London – g.tyson@qmul.ac.uk
5Dr. E. Kochmar, University of Cambridge – ekaterina.kochmar@cl.cam.ac.uk
6M. Almeida, UPC Barcelona – mario.almeida@est.fib.upc.edu
– it included some of the best suggestions I received during my writing and cor-
rections. The feedback helped me improve the arguments and increase the rigour
of findings through additional considerations.
And finally, my family - father, mother, my two sisters, my niece, and my
wife. Without them nothing would have been possible. My father, who taught
me that true pride is in education, critical thinking and reasoning, sociopolitical
narrative and discourse, arts and culture, and not in money or materialism. My
mother, who showed me what sacrifice, encouragement and unconditional love
is, to learn to soak all the sorrows and emanate calmness in commotion. My
two sisters, who took pride in calling me an elder brother, and being my closest
friends when I needed them. My niece, who is the greatest pleasure of our lives,
a simple look at her makes me so happy, she is the loveliest kid I know and no
less than my own daughter. And finally, my wife, the true meaning of a big
heart, a wonderful partner, gentle and kind, immensely loving, such wonderful
personality, such wonderful sense of humour, my closest friend to discuss my
biggest fears and my greatest failures, with whom I shared whatever I had been
through, with whom I share my life.
Contents
List of Figures 13
List of Tables 15
1 Introduction and Motivation 19
2 Background 25
2.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 Web bots . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2 Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.3 Game bots . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.4 Sybil and fake accounts . . . . . . . . . . . . . . . . . . . . 29
2.1.5 Useful social bots . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.6 User behaviour . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.7 Social botnets . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.8 Social media infiltration experiments . . . . . . . . . . . . 32
2.1.9 Bots in politics . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1.10 Social influence of bots . . . . . . . . . . . . . . . . . . . . 38
2.1.11 Bot detection . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.1.12 Bot detection avoidance techniques . . . . . . . . . . . . . 40
2.1.13 Typification of bots . . . . . . . . . . . . . . . . . . . . . . 42
3 Stweeler : Twitter Computation System 43
3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 What is Twitter? Why and how do bots exist on Twitter? . . . . 44
3.3 Stweeler Framework . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Datasets, Feature Extraction and Annotation Methodology . . . . 46
3.4.1 Characterisation and Detection dataset . . . . . . . . . . . 47
3.4.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 48
3.4.3 Human Annotated dataset . . . . . . . . . . . . . . . . . . 50
3.4.4 Typification dataset . . . . . . . . . . . . . . . . . . . . . 52
3.4.5 Honeypot dataset . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Stweeler Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Measuring and Characterising Social bots 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.1 Data Collection and Feature Extraction . . . . . . . . . . . 59
4.2.2 Bot Classification via Human Annotation Task . . . . . . . 60
4.2.3 Media Extraction and Processing . . . . . . . . . . . . . . 61
4.3 Which manners maketh the Bot? . . . . . . . . . . . . . . . . . . 61
4.3.1 Content Generation . . . . . . . . . . . . . . . . . . . . . . 62
4.3.2 Content Popularity . . . . . . . . . . . . . . . . . . . . . . 64
4.3.3 Content Consumption . . . . . . . . . . . . . . . . . . . . 66
4.3.4 Account Reciprocity . . . . . . . . . . . . . . . . . . . . . 66
4.3.5 Tweet Generation Sources . . . . . . . . . . . . . . . . . . 67
4.3.6 Media Upload . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 A World without Bots? . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.1 How Influential are Bots? . . . . . . . . . . . . . . . . . . 72
4.4.2 What happens if Bots disappear? . . . . . . . . . . . . . . 73
4.5 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Detecting Social bots 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Human Annotation Task . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Classifying Bots and Humans . . . . . . . . . . . . . . . . . . . . 85
5.4.1 Classifying bots by training and testing on all groups with
5-fold cross-validation . . . . . . . . . . . . . . . . . . . . 88
5.4.2 Classifying bots by training on all and testing on specific
groups with 5-fold cross-validation . . . . . . . . . . . . . . 89
5.4.3 Cross-group experiments . . . . . . . . . . . . . . . . . . . 90
5.4.4 Hypotheses testing . . . . . . . . . . . . . . . . . . . . . . 92
5.5 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6 Typification of Social bots 95
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2.1 Data Collection and Pre-Processing . . . . . . . . . . . . . 98
6.3 Typifying Bots: A Methodological Approach . . . . . . . . . . . . 99
6.3.1 Typification Methodology . . . . . . . . . . . . . . . . . . 99
6.3.2 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . 100
6.3.3 Clustering Results . . . . . . . . . . . . . . . . . . . . . . 103
6.4 Deep Diving into Bot Behaviours . . . . . . . . . . . . . . . . . . 106
6.4.1 What bot software is used? . . . . . . . . . . . . . . . . . 106
6.4.2 What topics do bots discuss? . . . . . . . . . . . . . . . . 109
6.4.3 Do bots exhibit sentiment? . . . . . . . . . . . . . . . . . . 113
6.4.4 What content do bots share? . . . . . . . . . . . . . . . . . 117
6.5 The Social Cost of Web Bots . . . . . . . . . . . . . . . . . . . . . 121
6.5.1 Setting up a bot account . . . . . . . . . . . . . . . . . . . 122
6.5.2 Bot detection . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.5.3 Characterisation . . . . . . . . . . . . . . . . . . . . . . . 123
6.6 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7 Final Remarks 129
7.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 129
7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.3 Last Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Bibliography 140
A Tasks, Experiments and Ethics Approval 141
A.1 Human Annotation Task for Binary Classification . . . . . . . . . 141
A.1.1 Task Description . . . . . . . . . . . . . . . . . . . . . . . 141
A.1.2 Ethics Approval #379 . . . . . . . . . . . . . . . . . . . . 144
A.2 Honeypot Experiment . . . . . . . . . . . . . . . . . . . . . . . . 145
A.2.1 Task Description . . . . . . . . . . . . . . . . . . . . . . . 145
A.2.2 Ethics Approval #556 . . . . . . . . . . . . . . . . . . . . 146
B Publications 149
C Press, News and Print Media 151
D Environment - Platforms, Systems, Resources, Dashboard 153
List of Figures
3.1 Stweeler analyses framework. . . . . . . . . . . . . . . . . . . . . 46
3.2 Accuracy of language detection (langdetect) and translation (textblob)
libraries: Original text. . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Stweeler dashboard. . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1 Spearman’s rank correlation coe cient (⇢) between bots and hu-
mans per measured feature. The figure shows none (0.0) to weak
correlation (0.35) across all features, indicating clear distinction
between the two entities. . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Content Creation: Tweets issued, Retweets issued, Replies and
Mentions, Follower-friend ratio. . . . . . . . . . . . . . . . . . . . 63
4.3 Content Popularity: Likes per tweet, Retweets per tweet. . . . . . 65
4.4 Content Consumption: Likes performed, Favouriting behaviour. . 67
4.5 Tweet Sources: Count of Activity Sources, Type of Activity Sources. 68
4.6 Content Creation: URLs in tweets, Content uploaded on Twitter. 69
4.7 Media (photos, animated images, videos) uploaded by bots and
humans on Twitter. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8 Visiting trends to popular URLs by bots and humans. . . . . . . . 70
4.9 Bots vs. Humans - graphs for retweets and quotes of 10M popular-
ity group. Black dots are vertices, edges represent an interaction.
Red edges represent bots and Blue edges represent humans. . . . . 73
4.10 Bots vs. Humans - graphs for retweets and quotes of 100k popu-
larity group. Black dots are vertices, edges represent a interaction.
Red edges represent bots and Blue edges represent humans. . . . . 74
13
4.11 Bots vs. Humans - graphs for replies and mentions of 10M and
100k popularity groups. Black dots are vertices, edges represent
an interaction. Red edges represent bots and Blue edges represent
humans. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1 Classifying bots by training and testing on all groups with 5-fold
cross-validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Classifying bots by training on all and testing on specific groups
with 5-fold cross-validation. . . . . . . . . . . . . . . . . . . . . . 90
5.3 Cross-group experiments. . . . . . . . . . . . . . . . . . . . . . . . 91
6.1 Empirical distributions for behavioural activities of bot clusters: 0
(Young producers), 1 (Young assistants), 2 (Assistants), 3 (Popu-
lar content producers), 4 (Popular content redirectors), 5 (Stellar
active engagers), 6 (Stellar passive engagers), 7 (Social chameleons).104
6.2 Empirical distributions for behavioural activities of bot clusters: 0
(Young producers), 1 (Young assistants), 2 (Assistants), 3 (Popu-
lar content producers), 4 (Popular content redirectors), 5 (Stellar
active engagers), 6 (Stellar passive engagers), 7 (Social chameleons).105
6.3 Types of most prevalent Twitter activity sources for bot clusters. 108
6.4 Distribution of top 20 activity sources per cluster: percentages are
calculated per source per cluster (i.e. normalised for di↵erent
sources in each cluster). . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 Word Clouds of extracted bot clusters with their statistical labels
(Table 6.2) and topic labels: Advertisements & Marketing (A),
Daily A↵airs & Lifestyle (D), International A↵airs (I), News (N),
Politics (P), Online Social Networks (O), Sports (S), Television (T).112
6.6 Word Clouds of extracted bot clusters with their statistical labels
(Table 6.2) and topic labels: Advertisements & Marketing (A),
Daily A↵airs & Lifestyle (D), International A↵airs (I), News (N),
Politics (P), Online Social Networks (O), Sports (S), Television (T).113
6.7 Word Cloud of 11,379 human accounts. . . . . . . . . . . . . . . . 114
6.8 Distributions of polarity and subjectivity per bot cluster vs. humans.116
6.9 Clinton vs. Trump: Normal distributions of polarity and subjectivity.117
6.10 How Stweeler bot works. . . . . . . . . . . . . . . . . . . . . . . 122
14
6.11 Click logs dataset - Clicks, Revisits. . . . . . . . . . . . . . . . . . 124
6.12 Click logs dataset - IPs and requests, IPs and ASs used by bots. . 124
D.1 A typical CPU workload graph during data processing. . . . . . . 153
D.2 Stweeler dashboard. . . . . . . . . . . . . . . . . . . . . . . . . . 154
D.3 A typical time graph during data collection. . . . . . . . . . . . . 154
15
16
List of Tables
3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Summary of Twitter dataset post-annotation. . . . . . . . . . . . 52
3.3 Accuracy of language detection (langdetect) and translation (textblob)
libraries: Translated text. . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Summary of Twitter bot dataset (Dec 2016) for typification. . . . 53
3.5 Click logs dataset – statistics. . . . . . . . . . . . . . . . . . . . . 54
4.1 Types of bot tra c uploaded by Twitter users. . . . . . . . . . . . 61
4.2 Feature inclination: B is more indicative of bots, whereas H is
more indicative of human behaviour, and   is neutral (i.e. both
exhibit similar behaviour). * represents magnitude of inclination:
* is considerable di↵erence, ** is large di↵erence. signif. shows
statistical significance of each feature as measured by t-test. . . . 76
5.1 Average inter-annotator agreement (%-age). . . . . . . . . . . . . 83
5.2 Average Cohen’s . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Dataset benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4 Validation results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.5 Machine learning experiments results. . . . . . . . . . . . . . . . . 90
5.6 Cross-group experiments results. . . . . . . . . . . . . . . . . . . . 91
5.7 Feature significance. . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Clusters produced by Spectral clustering, their comparative ten-
dency vs. other clusters for distinctive behavioural properties (bold
and italic signify di↵erent tendencies), and descriptive labels. . . . 102
6.3 Types of most prevalent Twitter activity sources for bot clusters. 107
17
6.4 Inter-cluster a nity scores and review labels vs. humans. Cluster
labels could be any combination of categories: Advertisements &
Marketing (A), Daily A↵airs & Lifestyle (D), International A↵airs
(I), News (N), Politics (P), Online Social Networks (O), Sports
(S), Television (T). . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.5 Average polarity and subjectivity for bot categories and their for-
mulating clusters vs. humans. . . . . . . . . . . . . . . . . . . . . 115
6.6 Tweet polarity scores for Clinton vs. Trump. . . . . . . . . . . . . 117
6.7 Polarity scores for Clinton vs. Trump by renowned news outlets. . 118
6.8 Shortened URI hosts used for redirection, per bot cluster. . . . . . 118
6.9 Top most URI hosts post-resolution, per bot cluster (similar URL
types are colour-coded), and accounts most typically tweeting a
URL (e.g. 01 is Cluster 0 account 1, and 02 is Cluster 0 account 2).120
6.10 Data collected through click logging. . . . . . . . . . . . . . . . . 122
A.1 HAT example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
D.1 System specification. . . . . . . . . . . . . . . . . . . . . . . . . . 155
18
Chapter 1
Introduction and Motivation
“To err is human, but to really foul things up you need a computer” are the famous
words of Paul R. Ehrlich. Biologist by training, he is best known for his warnings
about consequential changes to population, food, computers, etc. And some of
these warnings are not entirely ill-founded. One could argue existential threats
often have humble beginnings, nurtured by the goodwill of scientific discovery
and invention to achieve a better and sustainable human condition.
Humankind, social and political in nature, has adapted to the environment
and created technology to vanquish problems that arise from limited physical ca-
pabilities of humans, such as: speed, e ciency (we need to eat and sleep to reju-
venate), availability and consistency. The age of automation brought mechanical
robots and later software robots, that were designed to augment physical capabil-
ities of humans, as well as process vast volume of transactions to deliver products
and services to customers, and process a large array of datasets for informative
analytics and internal audits.
Software robots, or bots, were software adaptation of mechanical robots. There
could be a many (probably uncountable) types of software robots, such as system
daemons, computer viruses, Web crawlers, indexers, content curators, malicious
spiders, virtual assistants and even chat bots. Automated social agents, or social
bots (as we better know them), are one such extension of this technology. A
social bot is a type of automated software robot that controls and operates a
social media account. Unlike, a regular automated software robot, a social bot
may likely exist surreptitiously on a social network while maintaining a profile
and activities that are akin to a real person. While it is a common belief that
19
most social bots (and even software robots) are malicious, not all bots are created
equal. Domain experts would even argue that social bots are unethical – especially
if they have a latent existence.
The existence of social bots depends on the social network platform and
whether the platform allows automated actions or not. Social bots may have
started as friendly and benign hobbies, but were quickly adapted as digital work-
ers to serve their human masters in a number of di↵erent settings on social
network platforms. These include but are not limited to news, emergency com-
munication, political campaigning, social activism, targeted social marketing,
spamming, etc. Bots, one may argue, have quickly become a phenomenon of
their own atop the social network phenomena.
This brings us to the potential usage of social bots for sociopolitical cam-
paigning and spreading fake news. While online social networks (OSNs) were
first e↵ectively used by Barack Obama during the 2008 U.S. presidential election
to reach out to masses and propagate his campaign, it is speculated bots first
truly made an impact through proliferation during the UK’s EU Referendum
– since then better known as Brexit (see § 2.1.9). The trend has not reversed
since then. It has been found and is now a subject of an FBI inquiry1 pending
thorough investigation and subsequent decision that the 2016 U.S. presidential
election was marred by Trump-Russia collusion2 throughout the campaign. The
resources used during the campaign involved online social media, targeted mar-
keting services and bots. Bots have also been found to infiltrate the 2017 French
presidential election and the Venezuelan politics (see § 2.1.9). The a↵ect, as the
reader can well imagine, is both profound and unprecedented.
Realising the importance of investigating social bots, part of this work devel-
ops a generically designed modular platform that is built through measurement
and research. The platform delivers the basis for measuring and characterising
bots through exploratory data science, detecting bots through supervised ma-
chine learning, and categorising bots to discern types using unsupervised machine
learning, as well as collecting and experimenting with data from the Web that is
otherwise not available from Twitter.
1FBI inquiry into 2016 U.S. presidential election (last accessed
16 June 2018) – https://www.nytimes.com/2017/12/30/us/politics/
how-fbi-russia-investigation-began-george-papadopoulos.html
2Trump-Russia inquiry indictment (last accessed 16 June 2018) – http://www.bbc.co.uk/
news/world-us-canada-43095881
20
Terms and definitions: For the purposes of research carried out in this
dissertation I set forth a few definitions of terms I will be using throughout this
dissertation. Conceptually, a ‘bot’ is an entity that simulates human activity
through imitation of actions and behaviour. Operationally, this translates to
a ‘bot’ being in control of any social media account that consistently involves
automation during the observed period, e.g. use of the Twitter API or other
third party tools, performing actions such as automated likes, tweets, retweets,
etc. For the purposes of this dissertation the following four terms mean the
same thing: bots, social bots, automated agents, and automated social agents. A
tweet is defined as an original status and not a retweet. A retweet is a tweet
in which the text is prefixed with ‘RT’. A status is either a tweet or a retweet,
and therefore total statuses are the sum of tweets and retweets. Content on
Twitter is limited to whatever is contained within a tweet: text, URL, image,
and video. A favourite or like is the activity of liking a status. A mention is
the act of quoting a Twitter handle of a Twitter user in a status. We define
a bot type or category as a grouping of similar accounts together that exhibit
similar behavioural characteristics (features), tweeting about similar topics, and
exhibiting similar sentiments.
Thesis statement: Automated social agents exercise an influence (social and
otherwise) upon human online social populace. Surreptitious or otherwise, these
agents can be successfully detected through carefully executed measurement, fea-
ture extraction and finely tuned supervised machine learning models. We can
further decompose the social bot population into various types or categories us-
ing unsupervised machine learning methods to formally inform the populace of
their existence and impact.
Goals and objectives: The goals and objectives of research encompassed in
this dissertation are manifold and require concrete steps that are measurable and
time-bounded. To investigate automated entities in online social network, a flexi-
ble and modular framework is required that utilises methods and techniques from
data science and machine learning. This requires understanding the functional-
ity of the framework such that it is able to continuously collect large datasets
and process these for analyses. The framework should also be generic within the
bounds of the domain, enabling researchers to explore a wide range of domain-
specific problems. In addition to the design of the framework, a methodology for
creating a ground-truth dataset will also be required (for training machine learn-
21
ing algorithms). A thorough study of behavioural and network properties would
be required to di↵erentiate bots from humans. This will be done by extracting
principal features that are most representative of bots.
The second goal will be to use the outcomes of the first goal to extend the
framework by implementing an automated supervised learning method that reli-
ably classifies bots and humans. This will also require evaluating the bot classi-
fier against current state of the art using the collected and manually annotated
datasets.
The third goal is to use the outcomes of the first and second goals to extend the
framework further by implementing an automated bot typification tool using an
unsupervised learning method which categorises bots into algorithmically learned
categories. A classified bot dataset will be created using the work fulfilled in
the second goal. In addition to this, tools will be needed for topic modelling
and sentiment analysis to analyse content and sentiment shared by various bot
categories.
The final goal of this dissertation will be to demonstrate generalisability of
the framework. Firstly, the framework will be extended to study influence of
‘Web’ bots on social content, to explore bot influence beyond the social networks
and onto the Web. Secondly, the framework will be applied to study a problem
statement analysing human influence on OSNs.
Contributions: Bots widely exist in OSNs. They contribute a significant
amount of activities, both consume and produce content, and even interact with
human users. As the analysis on human behaviours is crucial to understanding
OSNs, a thorough research on bot demography is equally important. This disser-
tation contributes the following: (i) definition of what is a ‘bot’ (this chapter),
(ii) a thorough comparative literature survey and state-of-the-art in this domain
(Chapter 2), (iii) creating a ground-truth dataset using a manual or human anno-
tation task (Chapter 3 and Appendix A), (iv) performing a detailed characterisa-
tion of bots and humans to extract most representative features and behavioural
properties to clearly di↵erentiate automated social agents from humans (Chap-
ter 4), (v) using these characterisations I implement a detection algorithm to
automatically discern automated social agents from humans (Chapter 5), (vi)
building bot taxonomies (Chapter 6), (vii) perform bot typification to explore
and distinguish various bot categories (Chapter 6), (viii) exploring bots on the
Web (Chapter 6), and (ix ) contributing characterisation, detection and categori-
22
sation datasets3 to the research community.
3Stweeler processed datasets – http://www.cl.cam.ac.uk/%7Eszuhg2/data.html
23
24
Chapter 2
Background
The World Wide Web (WWW) has seen a massive growth in variety and usage
of OSNs. Twitter, with its 313 million active monthly users, is one of the biggest
OSNs in the world. The rising population of users on Twitter and its open nature
has made it an ideal platform for various kinds of opportunistic pursuits, such as
from distributing content (news or spam) to promoting businesses and enterprises
(ads, marketing). These opportunistic pursuits are exploited through automated
social agents, or social bots. Bots are automated programs that operate social
media accounts via automated control commands and exist in vast quantities in
online social networks.
Estimates suggest 51.8% of all Web tra c is thought to be generated by bots1.
A media analytics company found that 54% of the online ads shown in 2012 and
2013 were viewed by bots rather than humans2. In 2014 Twitter itself reported
that 13.5 million (5% of the total at the time) of its accounts were either fake,
fraudulent or spam3. My own work in this dissertation finds that slightly less than
half (43.13%) of the Twitter population in the datasets collected are operated by
bots or some sort of automation.
Bots are created for a number of purposes, e.g. news, marketing, link farm-
ing,4 political infiltration (§ 2.1.9), spamming and spreading malicious content.
1Bot tra c report 2016 (last accessed 16 June 2018) – https://www.incapsula.com/blog/
bot-traffic-report-2016.html
2Fake ads tra c (last accessed 16 June 2018) – http://observer.com/2014/01/
fake-traffic-means-real-paydays/
3Twitter’s 2014 Q2 SEC filing (last accessed 16 June 2018) – https://www.adweek.com/
digital/twitter-says-over-13-million-accounts-may-be-bots-and-fakes-159458/
4Link farming (last accessed 16 June 2018) – http://observer.com/2014/01/
25
The rise of bots (particularly spambots) on Twitter is substantiated by a number
of studies (see § 2.1.4, § 2.1.7–2.1.8) and articles5. Despite the phenomenal rise,
not all bots are created exclusively for malevolent purposes (i.e. spam). There
are bots which are benign and benevolent, such as news and emergency commu-
nication, art and discovery6, content aggregation, fun and humour7, marketing
and business promotion, and social activism [71].
This massive rise in bot population on Twitter is not new – bots have existed
on Twitter since its inception. The existence of bots on Twitter is owed to a num-
ber of reasons: soft inspection during registration (an email address, a CAPTCHA
recognition and a phone number are the only requirements), but mostly due to
the Twitter API that lets programmers automate actions on Twitter. Studying
the bot phenomenon is important in order to understand dynamics of: (i) in-
fluence on social systems exercised through user (human or bot) behaviour, and
(ii) human-bot interaction from sociological perspective.
I focus on bots in Twitter primarily because of three reasons: Twitter content
is mostly public8, it allows automation through its APIs9, and studies below
indicate a substantial presence of automated programs on Twitter. Compared to
other social networks, such as Facebook or Instagram, Twitter is an information
social network that exposes most of its content publicly by default. Facebook,
therefore, can be thought of as a pure social network since it keeps everything
enclosed (or private) for a user unless a user chooses to make public a certain piece
of content, or a user accepts a ‘friend’ request from another user (in which case
the befriended user can view most of the content). Instagram, from a technical
point of view sits between Facebook and Twitter. While Instagram has an API
that can be used by third-party apps (for business purposes), it does not allow
the API to be used for automation, as directed by its terms of use10 and platform
fake-traffic-means-real-paydays/
5Bots in press and blogs – http://www.cl.cam.ac.uk/%7Eszuhg2/docs/papers/
bots-discussions.txt
6Art and discovery bots (last accessed 16 June 2018) – https://qz.com/572763/
the-best-twitter-bots-of-2015/
7Fun and humour bots (last accessed 16 June 2018) – https://qz.com/279139/
the-17-best-bots-on-twitter/
8Twitter Public APIs (last accessed 16 June 2018) – https://developer.twitter.com/
en/docs
9Twitter Developer Agreement (last accessed 16 June 2018) – https://developer.
twitter.com/en/developer-terms/agreement-and-policy
10Instagram Terms of Use (last accessed 16 June 2018) – https://help.instagram.com/
26
policy11. Secondly, neither Facebook nor Instagram expose any public data via
an API, thus leaving data scraping via Web crawlers (that require the input of
specific keywords, hashtags, etc) as the only option. Facebook and Instagram
have maintained an extremely strict policy towards bots and suspend accounts
instantly that are found to have unusual activity. Therefore, bots on both the
platforms are extremely short-lived (a few hours on average).
In this chapter I will provide a background literature survey of the current
state of the art, and shortcomings that I contribute to.
2.1 Literature Survey
Research on social bots has generally focussed on a number of aspects, ranging
from user behaviour and social media infiltration to social influence and bot
detection schemes. Relevant work can be categorised into a total of thirteen
domains discussed below.
2.1.1 Web bots
Though di↵erent in nature and purpose to social bots, Web bots mostly serve the
needs of search engines and archives by visiting and recording a massive amount of
webpages. Though most commonly referred to as ‘bots’ since the beginning [66],
these were also known as ‘indexers’, ‘crawlers’, ‘worms’ or even ‘spiders’ These
bots do not interact directly with humans via a social platform. Despite this,
Web bots can create an indirect impact on information being displayed on social
platforms to human users. For instance, given the open nature of Twitter, Web
bots can contribute to tra c and activity generated that could consequently
impact the popularity of content. Hardly any research explores impact of Web
bots on social platforms. I explore and measure this impact in Chapter 6.
2.1.2 Chatbots
Chatbots are as old as computers, e.g. ELIZA [89], and interact with humans
through an interface medium which is usually text. The idea behind a chatbot,
478745558852511
11Instagram Platform Policy (last accessed 16 June 2018) – https://www.instagram.com/
about/legal/terms/api/
27
or sometimes referred to as ‘chatterbot’, is to employ natural language processing
to process human user’s input text to produce a dialogue response [23]. Recent
examples include conversational virtual assistants such as Apple Siri, Amazon
Alexa, Microsoft Cortana or Google Now. Chatbots have become widely common
as conversational virtual assistants and for service provisioning and task manage-
ment in many instant communication apps (e.g. Skype, Slack, Telegram) [29].
Business and corporate sector have employed chatbots to improve experience of
their clients and customers.
These types of bots are extremely relevant when it comes to social bot research
(especially those that are disguised) because these bots not only interact directly
with humans, but can also be used in political scenarios with significant impact.
More on this in § 2.1.9.
2.1.3 Game bots
Game bots are usually a type of intelligent software that are designed to play a
computer game. These could either be static – designed to follow waypoints for
each level or terrain map in the game, or dynamic – designed to learn the levels
or terrain maps by leveraging machine learning. Game bots could be designed
for a variety of games, including massively multiplayer online role-playing games
(MMORPGs). These bots automate gameplay by mainly imitating perceptions
and reactions in human gameplay traces [82].
Game bots can cause problems for publishers as well as human players. Con-
cerns have been identified that link to collapse of game balance and player dis-
satisfaction that often leads to discontinuation. To mitigate this, researchers
have proposed keystroke detection, game tra c and CAPTCHA tests. Following
player dissatisfaction with disrupted gaming experience, many researchers have
independently devised similar techniques for non-interactive game bot detection
approach.
Chen et al. in [17] proposed a manifold learning approach to identify game
bots. The method uses actual gameplay data logs to learn the di↵erences between
the trajectories of humans and bots. The researchers found that despite bots
simulating humans, there are certain types of human behaviour that is not easy
to imitate. They used more than 200 dimensions with 3D to 2D dimensional
mapping from actual gameplay data logs of Quake 2 to detect and evaluate bot
28
detection. They found that the scheme achieves an accuracy of up to 98% on a
trace of 700 seconds.
In [79], Thawonmas et al. propose similar technique using discrepancies be-
tween humans and bots in action frequencies and action types in gameplay logs.
The researchers propose classifying characters as bots if frequencies for particular
actions are higher than human players, and adjusting the classification based on
action types. They evaluated their technique on Cabal Online gameplay logs and
found that the accuracy is 38–60% for 15–60 mins of detection time.
Similarly, in [35] Gianvecchio et al. use human observation proofs to passively
monitor input actions that are harder for bots to imitate. The researchers use
World of Warcraft gameplay logs to first characterise human and bot behaviour
during gameplay. They next develop a neural network that uses human obser-
vation proof system for analysing input actions. Using a gameplay log of more
than 95 hours, their system is able to identify game bots within 40 seconds.
Although, game bots have a lower relevance to the social bot phenomena
explored in this dissertation, I do see a possibility of the two aligning in future
when social bots become intelligent enough to pass through game-oriented bot
prevention techniques on many Web platforms.
2.1.4 Sybil and fake accounts
Social bots are often considered as an adversary in information security domain.
Security experts sometimes use the term ‘sybil’ to represent these bots that use
fabricated identities for a number of purposes. These include spamming, manip-
ulating discussions, spreading malicious links and advertisements, and exploiting
personal information extracted from the network.
Cao et al. in [15] introduce a tool called SybilRank that uses the social graph
to detect ‘sybil’ or fake accounts. The tool has been evaluated by the researchers
on a test dataset from Tuenti – the largest OSN in Spain. SybilRank found that
90% of 200,000 designated suspicious accounts by Tuenti were actually fake. In
comparison the manual user-reported system only achieved 5% accuracy.
While ‘sybil’ or fake accounts are mostly interested in causing users to click a
link, astroturf accounts want to create a false view of consensus about a topic or
message. Legitimate users may inadvertently help spread the message by being
deceived. Therefore, one of the biggest unsolved problems for social networks is
29
to detect inorganic campaigns from organic ones. Ratkiewicz et al. in [69, 70]
create a tool called Truthy to bridge the gap by detect astroturfing, smear and
misinformation campaigns. Truthy helps analyse meme di↵usion through mining
and classification of tweet streams of events being posted on Twitter.
2.1.5 Useful social bots
Twitter is full of useful bots that exist in many forms. One of these bots is
the @dscovr epic bot which is an uno cial bot created by Russ Garrett that
posts pictures from Earth Polychromatic Imaging Camera (EPIC) on the NASA
DSCOVR spacecraft12. The bot brings (to its 15,000 Twitter followers) rare
Earth and Moon images captured during di↵erent time periods.
Two other useful uno cial bots post pictures of exhibits from Museum of
Design Zurich (@GestaltungBot) and Metropolitan Museum of Art New York
(@MuseumBot). These bots help bring history and knowledge by sharing museum
exhibits to their 9,000 Twitter followers.
Twitter bots are not only limited to posting pictures from other sources.
@DearAssistant is a Twitter bot that answers questions just like Apple’s Siri or
Google’s Now would. Created by Amit Agarwal, a Google Apps script developer,
the bot uses WolframAlpha (a computational knowledge engine) to post replies
to questions that are asked of it.
2.1.6 User behaviour
Investigation of user behaviour can reveal traces of activity that can prove im-
mensely valuable in characterising di↵erences between bots and humans. Features
that represent frequency of activity, nature of activity, typical behaviour, and how
it is posted online are all important in knowing the true nature of an entity.
In [55] authors used follower-to-following ratio on Twitter to classify the users
into broadcasters (having significantly more followers than following), acquain-
tances (congruent follower-to-following ratio), and miscreants and evangelists. In
a related work [85] authors use principal component analysis to identify devia-
tions in anomalous user behaviour from normal user behaviour. The authors then
apply unsupervised anomaly detection technique to address the problem of de-
12NASA DSCOVR – https://epic.gsfc.nasa.gov/
30
tecting subversive promotion techniques via fake and compromised accounts, and
collusion networks or bot farms on Facebook. Both of these works perform user
classification to detect subversive and attacker strategies in online social settings,
but do not focus on automation.
In [12], Boyd et al. inspected user behaviour by examining retweets, focussing
on how people tweet, as well as why and what people retweet. The authors found
that participants retweet using di↵erent styles, and for diverse reasons (e.g. for
others or for social action). Closely related, in [90] Wu et al. study marionette
users on Weibo13, created or employed by puppeteers or human masters either
manually or through programs. These marionette users are used to perform
specific tasks to earn financial rewards, such as follow certain users or re-share
certain posts to increase popularity and visibility. Similar to Twitter, artificially
increasing followers and retweet counts leads users to incorrect perception of
popularity of posts and search for topical experts. The authors profile such users
through analysis of users’ posting behaviours and social interactions. The authors
apply a probabilistic classification model that uses influence received by a user
from its neighbours (such as through likes and re-sharing) to classify a user as
either normal or marionette. Their experiments reveal an accuracy of 0.892 for a
labelled dataset of 2,000 users.
These are relevant to my own work, as I also study retweets and tweeting
patterns through tweet frequency and tweet-retweet distribution. In contrast, my
work provides further insights on important di↵erences and striking similarities
between bots and humans in terms of retweet patterns, account lifetime, account
reciprocity, content creation, content popularity, content consumption, content
propagation and entity interaction. In addition to studying these features above,
I also study sources or endpoint apps that are used to produce activity on Twitter
by humans and bots. These sources reveal important information that can be used
to di↵erentiate between human activity and bot activity. This forms the basis
for a reliable bot detection algorithm.
2.1.7 Social botnets
Botnets are generally considered a threat to cyberspace. Due to the small world
properties and reachability advantage of many of the social networks, bot masters
13Weibo is a Chinese microblogging service, similar to Twitter.
31
operate their botnets by using social networks as their command-and-control
(C&C) centres (also known as soft-infrastructures). Social networks, such as
Twitter, give bot masters the ability to control individual bots through API calls.
Typically, C&C enables stimulation of botnets [95] and allows quick evolution of
strategies to target people and adjust according to the countermeasures.
Botnets and campaigns on social networks, Twitter in particular, are a com-
mon phenomena, explaining why Twitter has a dedicated anti-spam team 14 to
watch over and mitigate the problem. A number of botnets and campaigns
have been discovered, among them are the Naz botnet [65], the KOOBFACE
botnet [5], Facebook spam botnet [32], Twitter spam botnets and spam cam-
paigns [42, 81, 19], Twitter cyber criminal botnets [92] and Twitter link farming
campaigns [33].
Measuring C&C strategies is important to understand strengths and weak-
nesses of social botnets. In [54] Kartaltepe et al. characterise the social network-
based botnet C&Cs. The authors explore application-centric approach of de-
tection and subsequent countermeasures, as compared to network-centric and
host-centric approaches. The network-centric approach requires network tra c
information including IP addresses, server names, packet content, and the host-
centric approach performs an in-depth inspection of the software stack to find
malicious processes that use data from network as parameters in system calls.
The authors find that the application-centric approach is more e↵ective than the
above two approaches while requiring less data and not compromising system
performance. The application-centric approach requires a simple detection mech-
anism that uses a Web service to classify text (from the social network content
updater) for determining if a text is suspicious.
2.1.8 Social media infiltration experiments
Some of the research mentioned in this subsection sits at the boundary of what is
considered ethical. However, I include these works because of the knowledge they
provide.
Researchers have detected and studied as well as created their own social
infiltration experiments (or ‘honeypots’), that interact with other social network
14Twitter anti-spam team – https://help.twitter.com/en/safety-and-security/
report-spam
32
users, in the hopes to understand how these honeypots operate.
Similarly, spam now widespread over email has started spoiling the social
network experience. In [76] Stringhini et al. create ‘honey-profiles’ on three social
networks including Facebook and Twitter, to log tra c and activity. This activity
is in the form of friend requests, messages and invitations received from other users
of the network. They find that 173 of 3,831 (4.52%) friend requests on Facebook
and 361 of 397 (90.93%) follows on Twitter are from spammers. The main reason
for such a big di↵erence between Facebook and Twitter is Twitter API’s that
allows people to interact with the platform via computer programs. Rather,
such instances of automated spam accounts on Facebook mainly exist because
of the major technical challenge [88] of accurately automating classification of
inauthentic or spam accounts. With over 2 billion active monthly users, taking
the manual route of identifying such accounts is out of question. Some believe
that alleviating the ‘bot problem’ is as simple as enforcing strict real names15,
thus also triggering the debate of anonymity and privacy on the Internet [13, 45].
Though, sadly this is not true given the existence of legitimate as well as stealthier
intelligent bots imitating humans. In fact, the only e↵ective way might be to
disallow API interaction with these platforms, i.e. making private all public
APIs that allow a computer program to execute callable actions.
The researchers in [76] also categorised bots by behaviour in these spam re-
quests. These included: (i) spam bots that display content on their profile (least
e↵ective strategy for spreading spam), (ii) bots that post messages on their own
profile (thus only reaching people who befriend or follow them), (iii) bots that
post messages directly on profiles of people in their friend or follow list (most
e↵ective way of spamming as its visible to the friends of that user’s profile), and
(iv) bots that send direct private messages to people in their friend or follow list
(only visible to the recipient). The authors were also able to distinguish greedy
bots – those that include spam content in every message, and stealthy bots –
those that include spam content once in a while. In Chapter 6 I typify bots based
on the activity type and frequency in order to annotate latent categories of bots
that exist on the Twitter platform.
In [10, 11], Boshmaf et al. evaluate vulnerability of Facebook against large-
scale infiltration by deploying a social bot network of 102 profiles. They found
15How to fix Facebook (last accessed 30 June 2018) – https://www.nytimes.com/2017/10/
31/technology/how-to-fix-facebook-we-asked-9-experts.html
33
that 86% of bots infiltrated up to 50 user profiles and 10% bots were able to infil-
trate up to 80 user profiles. They also found that a successful infiltration reveals
users’ private information, and security defences are not su cient to guard from
a stealthy infiltrator. Similarly, in [31] Freitas et al. manually evaluate infiltra-
tion strategies on Twitter using 120 social bot profiles. They use three metrics to
quantify the infiltration of social bots: followers, popularity score, and message-
based interaction (other users favouriting, retweeting, replying or mentioning the
bot). They found that bots can successfully evade Twitter defences (only 38 out
of their 120 bots got suspended over the course of 30 days), and can successfully
infiltrate Twitter (20% of the bots had more than a 100 followers). They conclude
that infiltration is indeed successful, can a↵ect influence/popularity scores and
possibly impact the social network as bots can manipulate trending topics during
political and social campaigns.
In [94] Zhang et al. create a social botnet for spam distribution by buying
1,000 accounts. The researchers carry out di↵erent experiments by designing
botnets that simultaneously post tweets, or by creating a 10-ary tree of depth 2
where root bot tweets a post and its descendants retweet at random intervals. The
result of these experiments reveal that complete botnets tweeting simultaneously
get suspended within 6 hours, whereas only the root bot gets suspended within
6 hours but the descendant bots remain unsuspended. Repeating the second
experiment by reallocating a root bot and shu✏ing the descendants produces
the same results, i.e. only the root bot gets suspended. The researchers also
investigate digital influence of accounts by using third-party Web services such
as Klout, Kred, and Retweet Rank, with interesting results. They find that the
number of followers impacts Klout influence score the least, whereas Kred and
Retweet Rank are most a↵ected. This means that while botnets can increase
their Kred and Retweet Rank scores, they are unable to increase Klout influence
scores by acquiring fake followers or by purposefully following each other. All
three scores are impacted in terms of retweeting since retweets depict influence of
an account in the local neighbourhood. However, this makes the influence scores
vulnerable to botnets collaborating to retweet each other or any other user. Fake
following and purposeful retweeting has been widely studied in political scenarios
(see § 2.1.9). Similarly, all three scores are impacted in terms of mentioning which
could prove to be exploitable by botnets through collaborative mentioning of each
other or any other user.
34
I do not perform any infiltration experiments, as this is beyond the scope
of my research, as well as borderline on ethical grounds. In fact, Facebook has
previously faced public backlash because they systematically manipulated user
environments to test user reactions [43]. Any such experiment requires obtaining
informed user consent, without which it is deemed unethical. However, I use some
of the understandings derived from the aforementioned research to study Web bot
tra c on Twitter. Studying this Web bot tra c is important to understand as
these bots could be infiltrating the Twitter platform.
2.1.9 Bots in politics
Bots have been used in political scenarios going as far back as 2012. Top presiden-
tial candidates of Mexico started using armies of bots16 during the 2012 Mexico
presidential election to either target opponents via defamation campaigns, or ben-
efit themselves. These campaigns are labelled ‘hashtag mischief’ by researchers,
which are perpetrated by bots with the intention to make these hashtags trend
and eventually become a part of Twitter’s trending topics.
Another such campaign that year was observed in Russia. Social activists took
to Twitter to discuss the 2012 Russian presidential election. Thomas et al. in [80]
found that a coordinated bot campaign was used to post spam hashtags in order
to inundate the political discussion. The bot campaign used 25,860 fraudulent
Twitter accounts to inject 440,793 tweets into legitimate conversations. Stag-
geringly, researchers also found that these fraudulent accounts belong to a larger
pool of a million fraudulent accounts, kept dormant during the campaign possibly
for future use. Furthermore, the campaign used machines across the globe, 39%
of which appeared in IP blacklists, therefore suggesting usage of compromised
hosts. Even more staggeringly, 56% of users were found to be located in Russia,
whereas only 1% of spam accounts were located within Russia.
Later in 2012 during the U.S. presidential election Mitt Romney mysteriously
acquired 116,922 more followers17 (17% more increase) on 21 July 2012. Re-
searchers uncovered that about 23% of these followers had never tweeted, while
16Mexico presidential election campaign 2012 (last accessed 15 June 2018) – https://www.
technologyreview.com/s/428286/twitter-mischief-plagues-mexicos-election/
17Mitt Romney acquires 116K followers in 1 day (last
accessed 15 June 2018) – https://www.cnet.com/news/
mitt-romney-suspiciously-gets-116k-twitter-followers-in-one-day/
35
a tenth of these followers had been suspended by the time news was published
on 6 August 2012. Most astonishingly, a quarter of these followers were less than
3 weeks old, while 80% of these were less than 3 months old. It is believed that
followers came from Twitter follower services that sell follower accounts, likes,
tweets and retweets.
In [30] Forelle et al. uncovered the strategic role of sociopolitical bots. The
researchers analysed activity patterns (follow, tweet, retweet) by examining Twit-
ter feeds of prominent Venezuelan politicians from 2015. They find that 10% of
all retweets come from bots, where most of bots are used by the Venezuelan op-
position. They also find that bots are mostly pretending to be the politicians,
leaders, political entities, and government rather than citizens.
By 2016 political bot phenomenon reached its height, taking its shape as
masqueraded campaigns during U.K.-E.U. referendum and 2016 U.S. presidential
election. Researchers in [48] analyse Twitter data collected for relevant positive,
negative and neutral hashtags between 5 June 2016 and 12 June 2016. By col-
lecting more than 1.5 million tweets from more than 313K unique accounts the
researchers were able to quantify strategic role of bots from both campaigns Re-
main (popularly called ‘Bremain’) and Leave (popularly called ‘Brexit’). Firstly,
the hashtags associated with ‘Leave’ campaign dominated hashtags from ‘Re-
main’ campaign by as much as 3-6⇥ (341,839 for #voteleave vs. 110,653 for
#strongerin, respectively). Secondly, di↵erent perspectives utilised di↵erent lev-
els of automation. For example, the most popular ‘Remain’ hashtag #strongerin
accounted for only 14.6% (186,279) of the tweets out of which only 15.1% (28,075)
were generated by automated sources. Whereas, the most popular ‘Leave’ hash-
tag, #brexit, accounted for 51.8% (662,745) of the tweets out of which 14.7%
(97,431) were generated by automated sources. In fact, 5.7% (13,436) of neutral
tweets (18.3% or 234,170 in total) were also generated by automated sources.
Thirdly, less than 1% of 313,832 unique accounts generated one-third of the
tweets.
Similarly, in [6] Bastos et al. uncovered a network of 13,493 Twitter bots that
tweeted during the U.K.-E.U. referendum, but disappeared shortly after the U.K.
voted for leaving the E.U. The researchers compare normal users with political
bots in terms of tweeting behaviour, and retweet proportion and frequency to
find strategies of bot usage. The authors made two important discoveries: (i)
the ability of bots to rapidly generate short-lived retweet cascades containing
36
user-generated partisan news, and (ii) criteria-driven botnets organised to either
replicate active users or replicate content posted by other bots.
This was quickly followed by the 2016 U.S. presidential election, where re-
searchers discovered bots behind distortion campaigns in online discussions [8].
They found that 1 out of 5 tweets regarding the elections was posted by a bot,
i.e. 4 million tweets by 400K bots in the month leading to the election. Twit-
ter’s interface does not specify the software platform of the tweet, thus making
it di cult for humans to determine whether a tweet is posted by a bot or a hu-
man. This meant bots were being retweeted at the same rate as humans. The
authors found the bots to be biased, e.g. pro-Trump bots were producing sup-
portive and positive content, thus ensuring a false public perception of grassroot
support for Trump. In fact, negative campaign by Clinton supporters against
the opponent candidate Trump was so unsuccessful that it accrued a 50-50 split
of positive and negative responses. Whereas, negative campaign by Trump sup-
porters against Clinton accrued 15.92% more negative responses than positive.
Using geo-analysis bots were found to exercise strong support in the Midwest
and Southern states, especially Georgia. Whereas, humans were found to exer-
cise influence in most populated states such as California, Texas, Florida, Illinois,
New York and Massachusetts. The study also classified top five hashtags for both
presidential candidates. It found that among the 7,112 Clinton supporters 590
(8.3%) were bots, whereas among 17,202 Trump supporters 1,867 (10.85%) were
bots.
Unfortunately, the covert and unwarranted use of bots in political campaigns
had by now become an unimpeded norm. During the 2017 French presidential
election Ferrera et al. [27] investigated the #MacronLeaks disinformation cam-
paign against the candidate Emmanuel Macron. By collecting 17 million tweets
between 27 April 2017 and 7 May 2017 the author discovered 18,324 bots (18%)
and 81,054 humans participating in the #MacronLeaks disinformation campaign.
The author discovered that some bot accounts were originally created prior to
the 2016 U.S. presidential election. These bot accounts first went dormant after
November 2016 (upon completion of the U.S. presidential election) and were later
recycled for use in #MacronLeaks disinformation campaign in the beginning of
May 2017. This further raises the suspicion of existence of social botnet black
markets.
37
2.1.10 Social influence of bots
In [3], authors use a bot on aNobii, a social networking site aimed at readers, to
explore the trust, popularity and influence of bots. They show that gaining pop-
ularity does not require individualistic user features or actions, but rather simple
social probing (i.e. bots following and sending messages to users randomly). The
authors also found that an account can circumvent trust if it is popular (since
popularity translates into influence). Similarly, in [26], Edwards et al. highlight a
positive view on the existence of bots on social media by studying the di↵erences
in perceptions of the quality of communication for a human agent and a bot agent
on Twitter. They find that Twitter bots can be viewed as credible, attractive,
competent in communication, and interactive. Taking inspiration from this work,
I extend exploration to the Twitter platform. However, instead of infiltrating a
social network with honeypot bot(s), I study the characteristics of existing bots.
Closely related is [86], which develops models to identify users who are sus-
ceptible to social bots, i.e. likely to follow and interact with bots. The authors
use a dataset from the Social Bot Challenge 201118, and make a number of in-
teresting findings, e.g. that users who employ more negation words have a higher
susceptibility level. Similarly, users with a higher temporal balance i.e. who tweet
more often, and those who discuss morbid topics more often tend to have higher
percentage of interaction with bots. In my work, I study the characteristics of
existing bots in detail and argue that this provides far broader vantage into real
bot activities. Hence, unlike studies that focus on the influence of individual bots
(e.g. the Syrian Civil War [1]), I gain perspective on the wider spectrum of how
bots and humans operate, and interact. Additionally, in Chapter 7 I devised
a non-infiltrating honeypot experiment to study the impact of bots on content
popularity.
Mitter et al. in [61] explore if social bots can be used to influence link creation
between targeted human users. The authors use a dataset from Pacific Social Ar-
chitecting Corporation 2011 [63] to launch bots for investigating the creation of
new social links. The authors find that approximately 12% links are caused by
bots mediating suggestions for connecting to target humans. In Chapter 4 I ex-
plored the degree of bot and human inter-connectedness and intra-connectedness
18I did not use this dataset as it was outdated: Twitter suspends unusual accounts, bots
evolve, and so does the technology that drives these entities.
38
by exploring retweets and quotes, and replies and mentions.
2.1.11 Bot detection
The importance of bot detection on social media has recently gained momentum
due to the rapid rise of bots. In [91], Yan et al. studied if an automated Turing
test such as the CAPTCHA is su cient to verify that an entity behind a computer
is a human or an algorithm. The study concludes that CAPTCHA, apart from
being inappropriate for some usability concerns, is insu cient to discern humans
from bots. In a comprehensive work [18], Chu et al. distinguish and identify
Twitter accounts operated by three entities: humans, cyborgs and bots. The
authors make this classification by observing the di↵erences among the three
entities in terms of tweeting behaviour, tweet content and account properties.
Using 1,000 training samples the authors devised a system that classified their
subset of the Twitter population into 5:4:1 proportions for human:cyborg:bot,
respectively. However, they neither provide an API for evaluation nor share
datasets. Comparably, I find that approximately half (43.13% to be exact) of the
user accounts in my Twitter datasets are operated by bots.
In another e↵ort DARPA organised a Twitter bot challenge in 2016 [77] to
detect influence bots – bots that illicitly shape topical discussions on Twitter to
serve the purposes of their masters. DARPA provided 7,038 accounts as ground
truth labels that they knew about to the six teams who participated. The report
concludes that detection of evolving influence bots requires carefully designed
workflow and machine learning does not always work.
Coincidentally, there has been a recent surge in research focused on automat-
ing content generation [75] that looks to have been produced by humans. Also,
some techniques are focussed on discerning anomalous from normal, spam from
non-spam, and fake from original, but they fail to distinguish (or compare) the
types of users. I clearly distinguish between my task of agent classification and
spam detection. Spam is usually subversive and malicious in nature [76], is often
found to be high in volume and frequency, and contains URLs (that point to
malicious websites) and spam words [7, 57]. However, as briefed earlier, automa-
tion is not exclusively employed for malevolent purposes. There could be many
variants of automation due to the usage of APIs and third-party services, and
it can often involve direct human intervention (Chapter 3–5). Also, there are
39
no guarantees that a successfully detected spam account is operated by an agent
and not a human – it could be either. This forms a strong basis for detecting
automation without any prior judgement.
However, as mentioned most of the techniques neither expose their datasets
nor their tools, which makes evaluation tough. To the best of my knowledge there
is only one freely available and useable research tool, BotOrNot19 [22, 83], that
detects bots on Twitter. The tool applies a Random Forests classifier and uses
1,000 features divided into six groups to classify accounts as ‘bots’ or ‘humans’.
The model is trained using a list of social bots identified in [58] and a dataset
from the Twitter Search API of 200 most recent tweets of these bots and 100
most recent tweets mentioning these bots. This yields a dataset of 15K manually
verified social bots and 16K human accounts. The authors report a ten-fold
cross-validation score of 0.95 AUC.
Apart from using a Random Forests classifier and a more specific feature-set, I
use raw historical data to cater for evolution of agents and stealthier agents. I use
a dataset partitioned into four popularity bands representing Twitter population
at a more granular level, as agents di↵er according to the popularity and purpose
of their creation and presence. I also use 14 novel features from a set of total
22 attributes. Furthermore, I employ account categorisation in the preprocessed
and partitioned datasets, and perform ablation tests to identify distinct group of
features that are most e↵ective for each popularity band (Chapter 4).
2.1.12 Bot detection avoidance techniques
Social bots created and intended for unapparent purposes, such as human mim-
icking, sociopolitical campaigning, distortion of online discussion, advertisements
and spam, use a number of techniques. These techniques could include the use
of any combination of intelligent content retweeting, variable tweeting frequency,
manipulation of tweet source endpoint, automated text summarisation, auto-
mated text generation, and automated discourse response. Though, social bot
detection avoidance has not been particularly studied, one can a liate specific
application of relevant technologies for bootstrapping stealthy social bots.
For instance, many social media integration management apps (e.g. Tweet-
19BotOrNot is now rebranded as Botometer, but I continue to use BotOrNot to refer to
the said.
40
Deck, Bu↵er, and HootSuite) provide paid-for value-added services. These ser-
vices allow users to manage and setup tweet scheduling, intelligent retweeting and
adjusting tweeting frequency to maximise reach to their audiences (i.e. Twitter
followers) through daytime posting of tweets, or tweeting during spikes of social
activity. Similarly, social media optimisation apps (e.g. SocialFlow) that run
their own proprietary URL link proxies help users to amplify the delivery of mes-
sages through timing and utilisation of key engagement metrics (such as clicks
per tweet, retweets per tweet, followers, etc).
Quite a lot of work has been done in creating automated techniques for sum-
marisation, categorisation and generation of text. One of the more popular works
by Hovy [47] focused on a series of studies over a number of years on automated
generation of multi-sentence texts. The paper argued that the central structural
role of textual discourse is determined by communicative intentions. Mainly the
work describes the discourse structure relations by focusing on things such as
sentence planning and text formatting. Another popular work by Hovy et al. [46]
focuses on internal workings of a system called SUMMARIST. This system identi-
fied topics, structural position of a piece of text, bonus phrases (likely candidates
for summary) vs. stigma phrases, topic signature and discourse structure identi-
fication (text being a hierarchical structure of sentences).
In another work Huang et al. [49] propose an integrated solution to construct
an abstraction of content that allows users to consume meaningful units of ex-
tracted content. The proposed technique integrates di↵erent media sources, such
as from broadcast news, to generate semantic hierarchical representation of con-
tent. The authors perform a two stage process that (i) recognises anchorperson
from a broadcast news using Gaussian Mixture Models, and (ii) news story ex-
traction through text-based discourse tokenisation. They evaluated the technique
to find a news classification error rate of less than 10% and anchorperson identi-
fication to have an accuracy of 92%.
Researchers have uncovered several ways that enable social botnets to evade
detection approaches. For instance, Zhang et al. in [94] found that if Twitter bots
are placed in a simple 10-ary tree of depth 2 with root to post spam messages and
descendants to retweet, only the root bot gets suspended. A simple reallocation of
root bot among the descendants can carry the process forward with descendants
remaining unsuspended every time.
Ji et al. [51] perform a comprehensive study of social bot detection avoidance
41
mechanisms. First, bots exploit implicit trust on content coming from friends of
users, that enables propagating content rapidly through retweets [93]. In fact,
malicious URLs also spread faster and cover a wider range [14] while using URL
shorteners to hide true URL domain to avoid blacklisting [87]. Second, keeping
track of their activities through cookies, use of C&C centres for coordination [95],
and having a hierarchical root-descendant setup to avoid large-scale account sus-
pension [94]. Third, bots can imitate activities of humans on OSNs [74] to avoid
or lower suspicion level. Fourth, purposefully and randomly delaying an action,
e.g. tweeting, retweeting, responding, etc. The authors also suggest improve-
ments to current detection mechanisms by using information derived from above
mentioned behaviours.
2.1.13 Typification of bots
There is hardly any research that explores a general methodology to categorise
bots. However, research has focused on topical analysis, such as bots running
political campaigns (see § 2.1.9) during 2016 U.S. presidential election for and
against the two main candidates, Donald Trump and Hillary Clinton. While it
was found that bots vastly followed Donald Trump and positively campaigned
for him, usage of pro-Clinton bots were not far behind. I list a few astonishing
insights on Trump-Clinton bot campaigning in Chapter 6.
Bots were also found to be involved in a disinformation campaign during 1202
Mexico presidential election, and also against Emmanuel Macron during 2017
French presidential election in support of the far-right candidate Marine Le Pen.
See § 2.1.9 for more on bots used in politics.
Mostly, I perform completely new work using Chapters 4–6 to typify bots into
various categories, learned automatically by the bot classification algorithm from
the characterised dataset.
42
Chapter 3
Stweeler : Twitter Computation
System
In this dissertation I design and develop a framework, Streaming Twitter Com-
putation System (STCS), dubbed Stweeler 1, as one of the major contributions
of this research. Throughout the course of this dissertation, from Chapters 4–6,
Stweeler will evolve from being a data collection and characterisation tool to a
fully-functional machine learning driven data science framework that enables bot
classification and typification. Stweeler enables (among other things): (i) ex-
tracting most representative features and behavioural properties to di↵erentiate
automated agents from humans, (ii) automated supervised learning to discern
agents from humans, (iii) automated typification of agents to distinguish various
categories, (iv) topic modelling and sentiment analysis, and (v) analysing the
influence of Web bots.
3.1 Research Questions
I ask a set of most pressing research questions for bot analyses to understand
what the Stweeler framework should be (§ 3.3) for exploring answers to these
questions.
Bots vs. humans: The first key aspect is the di↵erences and similarities
of bots from humans. What are the key activities of bots compared to humans,
when measured through content generation, content popularity and content con-
1Stweeler– https://github.com/zafargilani/stcs/
43
sumption. What is the quantity of activity and content generated by bots and
humans? What is the degree of similarity between content produced by humans
and content produced by bots? Which attracts more attention or which drives
popularity and why? Do bots form critical nodes and largest connected com-
ponents of the social graph? Can we accurately detect bots using the above
knowledge?
Bot engagement, impact and types: The second key aspect is how bots
engage with other users of the social media platform. In what di↵erent capacities
are bots used to disseminate content? Do bots manipulate popularity of content,
i.e. make topics ‘trend’? Would bots impact network systems in future by gen-
erating more content than humans do? Can bots be generalised into categories?
Do bots have preferred topics? Do bots represent certain sentiments like humans
do?
3.2 What is Twitter? Why and how do bots
exist on Twitter?
The word twitter means ‘a call consisting of repeated light tremulous sounds’
– similar to chirps from birds. The product name according to Jack Dorsey2,
founder of Twitter, exactly denoted the philosophy of the company, i.e. a plat-
form for short bursts of inconsequential meaningless information, where meaning
is entirely dependent on the recipient.
The existence of bots on Twitter is owed to three main reasons. First, Twitter
identifies itself as an information social network, thus clearly focussing on global
reach and wide social penetration. This focus meant that the business would gen-
erate wide-scale usage, adoption and economics by allowing developers to create
thin clients, apps and tools atop the platform. Thus, Twitter provides publicly
accessible APIs that enable both organisations and individuals to algorithmically
program, control and automate actions on the platform.
Second, organisations and businesses, governments and individuals use Twit-
ter for a multitude of purposes, either organically (via human operators) or in-
organically (via automation or bot operators). Using automation legitimately
2Jack Dorsey talks about Twitter’s founding document (last accessed 14 Jul 2018) – http:
//latimesblogs.latimes.com/technology/2009/02/twitter-creator.html
44
provides organisations and individuals an accelerated path to attain global reach
in quick time while incurring fractional costs.
Third, registering a Twitter account is usually a simple process. Individuals
are usually expected to provide an email address, pass a soft inspection through
CAPTCHA recognition, and more recently a mobile phone number to verify
individuals and promote fair usage. However, bypassing or circumventing the
mobile phone requirement has been found to be easy due to a number of options,
such as virtual mobile networks3. Given the realtime global reach, a massive
336 million active monthly user-base4 and an easy registration process, Twitter
inadvertently becomes a great enabler for both illegitimate and dark activity such
as spam, astroturfing, trolling, social and political campaigning, etc.
3.3 Stweeler Framework
The Stweeler analysis framework is laid out in Figure 3.1 as a toolkit that com-
prises of a number of modular components. The components include: stream
collectors (for data collection); stats, decomposition and graphs (for exploratory
bot vs. human comparison); classifiers (for bot detection); clustering and lan-
guage processing (for bot typification, topic and sentiment analysis). It accepts
raw tweets (usually in JSON format) as inputs (left), processes the inputs using
the toolkit (centre) and outputs the analyses (right). The framework contains a
tool to run a third-party bot detection tool via a callable API. The framework
also presents a bot and a web server for an alternate study on a↵ects of bots on
content popularity.
Bot behaviour has been often found to vary from human behaviour [34, 50].
Using the insights derived from Twitter account properties I can perform classi-
fication to label users as ‘bot’ or ‘human’. Properties as indicated and measured
in [18] include tweets, retweets, follower-friend ratio, URLs posted, and sources
or devices used to tweet. These properties are augmented by doing an in-depth
study to di↵erentiate between bots and humans, such as account age, media
types uploaded, size of media uploaded, account favouriting frequency, favourites
3Using Google to bypass Twitter phone verification (last accessed 14 Jul 2018) – https:
//woorkup.com/how-to-bypass-the-twitter-phone-verification-for-new-account/
4Twitter active monthly users Q1 2018 (last accessed 14 Jul 2018) – https://www.
statista.com/statistics/282087/number-of-monthly-active-twitter-users/
45
Stweeler 
bot
WS
shortener
(1) Trending topic (2) Shorten URL
(3) Assemble 
tweet
(4) Post tweet
(5) Log clicks
BOT ANALYSIS FRAMEWORK
./collectors
./stats
./decomposition
./graphs
./classifiers
./clustering
./language-processingR
aw
 T
w
ee
ts
 (J
SO
N
) Bot vs Human comparison
Bot vs Human 
labelling
Bot categories, 
topics and 
sentiments
General data collection tool
./bot
./shortener-ws
./botornot Bot launcher, 
shortener, WSBON API
Figure 3.1: Stweeler analyses framework.
received, retweets received, etc. Such a study enables building a bot classification
model that can reliably di↵erentiate bots from human users.
The language processing module dissects content based data such as trends,
topics, sentiments, keyword, popular hashtags and (if available) geo-coordinates
to provide bot impact on Twitter in terms of activity and data volume gener-
ated. This will also analyse bot influence on Twitter in terms of followers, and
how much the bots morph OSNs and relationship trees. The nature of the bot
(content producer or consumer) will determine the nature of the impact. Using
text classification and sentiment analysis I could categorise bot types into news,
marketing or advertisements, social or political campaigning, spam or suspicious.
3.4 Datasets, Feature Extraction and Annota-
tion Methodology
In this section I describe various datasets: characterisation and detection dataset,
creation of human annotated dataset (used in Chapters 4–5), the typification
46
dataset (used in Chapter 6), and the honeypot dataset (used in Chapter 6).
As part of the aforementioned framework I devise a smart yet simple way5 of
collecting vast amounts of data from Twitter’s publicly accessible Streaming API.
A generic data collection software is a premium tool in exploratory data science
since it enables exploring related aspects of a problem-space. It also mitigates
the problems associated with collecting new datasets and ensuring quality and
conformance with previous ones, and most importantly solves the issue of archiv-
ing historical datasets6. I do not filter by any keywords, location or language and
collect everything o↵ered by the Streaming API.
3.4.1 Characterisation and Detection dataset
For the purposes of characterisation in Chapter 4, I needed a dataset that could
form the basis for establishing a detailed understanding in di↵erences between
bots and humans. This dataset is later labeled (§ 3.4.3) to be used as ground-truth
labels for the purposes of detection by training a classifier in Chapter 5. Using
the Stweeler collector I curated a raw tweet dataset for 30 days in April 2016.
This raw dataset is approximately 65 million tweets recorded for approximately
2.9 million unique accounts.
This data contains a range of accounts across the spectrum of popularity
(i.e. number of followers), from most popular (celebrity status) to least popu-
lar (virtually unknown). The purpose, activity and influence of an account dif-
fers based on popularity exercised passively (follower count) or actively (through
tweets and mentions), as noted in [16]. Hence, I partition profiles into popularity
groups to enable a detailed understanding of the dataset. The hypothesis behind
dataset partitioning is that popularity intrinsically reveals profile and network
attributes. For instance, most credible accounts will have high following, whereas
it is much more likely that spam/malicious or dark accounts will have lower pop-
ularity. In other words, most popular accounts are mostly legitimate, irrespective
of being automated or human operated.
The partitions are described as follows:
G10M+– celebrity status: This is the subset of Twitter users with the
highest number of followers, i.e. >9M followers. These are the most popular
5Stweeler collector – https://github.com/zafargilani/stcs/blob/master/lib/
collector.rb
6The collection service, if kept running, would automatically segregate tweets into daily files.
47
users, who hold celebrity status and are globally renowned. Popular and credible
organisations (e.g. CNN, NatGeo) use these accounts for various purposes, which
makes them free of spam, thus having high credibility and trustworthiness.
G1M– very popular: This subset of Twitter users is amongst the most
popular on the platform, i.e. 900K to 1.1M followers. These users are close to
celebrity status and global recognition (e.g. nytfood, pcgamer).
G100k– mid-level recognition: This subset represents popular accounts
with mid-level recognition (e.g. Amtrak, CBSNewYork), i.e. 90k to 110k followers.
G1k– lower popularity: This subset represents more ordinary users, i.e. 0.9k
to 1.1k followers. These users (e.g. hope bot, Taiwan Agent) form a large base
and, though they show lower individual or accumulated activity, they do form
the all-important tail of the distribution.
I create four partitions as it succinctly covers the entire popularity spectrum,
from most to least popular, while clearly di↵erentiating bots and humans. G10M+
and G1M are similar in their characteristics (see § 4.3) and constitute 0.65% of
the total 105k accounts I partitioned in the dataset. G1k represents the bulk of
Twitter, constituting 94.40% of the total partitioned accounts. G100k bridges the
gap between the most popular and least popular groups, constituting 4.93% of
the total partitioned accounts. A possible G10k would be similar to G1k, and a
possible G50k will be similar to G100k.
The dataset7 is a representative sample as shown in § 5.4.
3.4.2 Feature Extraction
Using tweets from these user profiles I extract all associated metadata and com-
pute values for features (e.g. number of tweets). I then use Principal Component
Analysis from scikit-learn [67] machine learning library8 to test the relevance
and importance of selected features. A set of 22 features across account profile,
network and activity reveals  2 of almost 100%. This means that this feature-set
is representative of most of the variance found in the dataset. The feature-set
and associated hypothesis is listed in Table 3.1.
In addition to known features studied in [18, 22] (age, tweets, retweets, favourites,
replies and mentions, URL count, follower-friend ratio, etc), I also analyse a set of
7Datasets can be found here – http://www.cl.cam.ac.uk/%7Eszuhg2/data.html
8Stweeler PCA – https://github.com/zafargilani/stcs/blob/master/lib/
decomposition/pca.py
48
Table 3.1: Features
Feature Description and Hypotheses
Age of account The age of the Twitter account in days. The assumption is that humans have older
accounts.
Favourites-to-tweets
ratio
‘Favourites’ or ‘likes’ received for all user tweets. I expect humans to receive more
‘likes’.
Lists per user Lists subscribed to. I expect bots to follow more lists for obtaining lists of users to
follow.
Followers-to-friends ra-
tio
Previous research [18] shows that humans typically have this ratio close to 1.
User favourites Tweets ‘favourited’ by a user. ‘Liking’ a post suggests an agreement, thus it should
point to human-like behaviour.
Likes/favourites per
tweet
‘Favourites’ received by a user. I expect humans to receive more ‘likes’, owing to
content originality and topic diversity.
Retweets per tweet ‘Retweets’ received by a user. I expect humans to receive more ‘retweets’, owing to
content originality and topic diversity.
User replies Tweets replied to by a user. I assume humans will engage in conversations with other
users, but bots will not.
User tweets User-generated tweets. Bots should tweet more aggressively, given that they do not
experience ‘human’-like limitations.
User retweets User-generated retweets. Aggressive retweeting is a sign of automation [18].
Tweet frequency Daily tweet frequency of a user. Bots are expected to tweet much more often than
humans per day.
URLs count URLs are used to redirect tra c to elsewhere from Twitter platform. Presence of
URLs within tweets suggests automation [18].
Activity source type A ‘source’ is the endpoint from where a user posts tweets, denoted as Sn. This is
categorised as: browser or web client (S1), mobile device apps (S2), social media
management apps (S3), social media scheduling and automation (S4), marketing and
brand promotion (S5), news content web services (S6), any other not part of the
defined list (S0). Humans are expected to use S1, S2, and S3; whereas bots are
expected to use S4, S5, and S6.
Source count The number of the endpoints used. I assume humans will use more sources.
CDN content size Content (pictures and videos, respectively) uploaded on Twitter. Bots should be able
to upload more content on Twitter.
eight novel features not explored in past bot research. These are: (i) favourites-
to-tweets ratio, (ii) lists per user, (iii) likes/favourites per tweet, (iv) retweets per
tweet, (v) user replies, (vi) 7 activity source identity (or source type) categories,
(vii) source count, and (viii) CDN content size. The selection of features is driven
by [25].
In addition to the list of features, Table 3.1 also lists hypotheses or assump-
tions attached to these features. The hypothesis per feature per entity (bot or
human) indicates the expectation of how it might perform. Deviation from ex-
pected behaviour per feature per entity would define an inclination either towards
bot or human behaviour.
49
3.4.3 Human Annotated dataset
I recruit four human participants to perform a manual data labelling or human
annotation task9 to identify bots and humans. Chosen annotators are trained
computer scientists and active Twitter users. Each account was reviewed by all
participants independently, before being aggregated into a final judgement using
a majority count and final collective review (via discussion if needed). Each
review was completed using a tool that automatically presents Twitter profiles
for reviewing content and URLs posted. This allows the participants to annotate
the profile with a classification (bot or human) and add any extra comments.
The participants were asked to check each profile generally but paying special
attention to the activity during the month of April 2016. Particular attention to
activity in April 2016 was necessary to assure that account activity stays con-
sistent, thus justifying the annotation. For performing reviews the participants
were given Twitter profiles as well as summary data. This included information
already available on each Twitter profile, such as: account creation date, aver-
age tweet frequency, content posted on user Twitter page, account description,
whether the user replies to tweets, likes or favourites received and the follower-
friend ratio. Availability of this information enabled the participants to find any
changes in profiles from April 2016 to other observed time periods.
I also provided participants with a list of the ‘sources’ used by the account
over the month, e.g. Twitter app, browser, etc. The human workers consider
both the number of sources used, and the types of sources used. This is because
sources can reveal traces of automation, e.g. use of the Twitter API. However,
the participants were asked to weigh their best judgement over what the task
document described. This would mitigate the possibility of biasing the results
along with considerations such as detailed observation, individual judgement and
final discussion for unclear or di cult profiles.
Overall, I presented participants with randomised lists that fell into the four
popularity groups described in § 3.4.1. Human annotators were instructed to
filter out any account that matched the following criteria: an account that does
not exhibit activity (i.e. no tweet, retweet, reply, and mention), and an account
that is suspended. Each account is marked as either human or bot, and final
9Details of human annotation task can be found in Appendix A.1 or at http://www.cl.
cam.ac.uk/%7Eszuhg2/docs/papers/human-annotation-task.txt
50
ground truth labels are used i↵ majority vote holds between all annotators. This
majority vote is the final annotation that is derived from the four annotations.
If there is a tie (i.e. 2-2 vote split among annotators) it is discussed among the
annotators and re-annotated for a majority vote (i.e. for final annotation). In
total, the volunteers successfully annotated 3,536 active accounts: 1,525 were
classified as bots (43.12%), 2,010 as humans and 1 tie.
Though ties are an exception in my dataset (1 out of 3,536), it is important to
highlight the importance of properly handling noisy labels. There are three ap-
proaches in current research to tackle this problem: (i) detecting and correcting
incorrect labels [2, 78], (ii) weigh the data labels using a loss function according
to peripheral information such as noise rates [64, 53, 59], and (iii) ignoring or
discarding the noisy labels [52, 62, 60, 24]. Incorrect or noisy labels generally
tend to mislead learning models [24], especially if they are in high proportions.
To handle this I devised a simple solution by having annotators revisit a tie by
having an open discussion. The purpose of the discussion is to quickly highlight
individual findings, view the account collectively and re-annotate to what the ma-
jority decides. This approach provides quality results and does not deviate from
the majority vote requirement for an annotation decision (i.e. final annotation).
Annotated partitioned groups are described as follows:
G10M+– celebrity status: Out of a total of 102 accounts, 50 were success-
fully annotated within the given timeframe. Out of these 50 user profiles, 24 are
identified as bots and 26 as human accounts.
G1M– very popular: Out of a total of 893 such accounts in my dataset, 746
accounts were successfully annotated within the given timeframe. Out of these
746 user profiles, 295 are identified as bots, 450 as human accounts, and 1 tie.
This tie is annotated in the dataset as a ‘human’ as majority of the annotators
after a discussion were convinced of this account being operated by a human.
G100k– mid-level recognition: Out of 9691 such accounts, a total of 1,447
were successfully annotated within the given timeframe. Out of these 1,447 user
profiles, 707 are identified as bots, and 740 as human accounts.
G1k– lower popularity: Out of 795,861 such accounts, only 1,293 accounts
were annotated successfully within the given time. Out of these 1,293 user profiles,
499 are identified as bots and 794 as human accounts.
Summary of the annotated data is provided in Table 3.2.
51
Table 3.2: Summary of Twitter dataset post-annotation.
Group #Bot accts #Human accts #Bot statuses #Human statuses
G10M+ 24 26 71,303 79,033
G1M 295 450 145,568 157,949
G100k 707 740 148,015 82,562
G1k 499 794 24,328 13,351
Total 1,525 2,010 389,214 332,895
3.4.4 Typification dataset
For the purposes of exploring bot categories in Chapter 6, I collect a completely
new dataset using Streaming API for 30 days in December 2016. The total
data collected was approximately 65 million tweets, with information recorded
on approximately 3 million unique accounts. This dataset is di↵erent to the one
described in § 3.4.1–3.4.3 and used in Chapters 4–5. The reason for this change is
to obtain a larger dataset for an in-depth exploration of types of bots. Moreover,
this mitigates the problem of using past datasets that might contain suspended,
deactivated and deleted Twitter accounts.
I initially collect data on 3 million accounts, out of which the Stweeler bot
classifier identifies 11,379 as humans and 11,102 as bots. This reduction occurs
due to two main reasons: (i) filtering inputs, such as removing suspended ac-
counts and accounts with no tweets, and (ii) time constraints – the Stweeler
bot classifier was kept running for a week for a sizeable dataset, though theoreti-
cally it could exhaustively process the raw dataset for an input of any number of
days. Next, the dataset is cleaned by removing all empty lines from these tweet
files, and removing all those bot users which had produced less than two tweets.
I remove low activity bots to achieve higher accuracy during the clustering task.
Some of these low activity accounts are classified as bots because of the activity
source endpoint they use (such as automated services), having very low account
reciprocity (0 followers or 0 friends) and lack of an original tweet.
Next I augment the dataset in § 3.4.4 to detect languages and translate non-
English text to English text in order to label categories more conveniently and
accurately10. I only use the most popular languages on Twitter to capture max-
imum data without compromising performance, i.e. English (34%), Japanese
(16%), Spanish (12%), Portuguese (6%), Arabic (6%), French (2%), and Turkish
10There is scarcity of reliable and accurate non-English topic modelling tools, thus applying
a limit to translate non-English corpus to English.
52
(2%). To detect the language I employ Python’s langdetect library, and to ac-
curately translate I use Python’s textblob library for improved results. Though
textblob can also be used to detect text language, it is much slower compared to
langdetect since it uses the massive nltk corpora database. The langdetect
on the other hand utilises Google’s language detection database. Execution per-
formance aside, both of the toolkits provide high accuracy for language detection
and manipulation. Figure 3.2 shows the original text and Table 3.3 shows the
translated text from one such example.
Figure 3.2: Accuracy of language detection (langdetect) and translation
(textblob) libraries: Original text.
Table 3.3: Accuracy of language detection (langdetect) and translation
(textblob) libraries: Translated text.
Conversion type ar ! en
Translated text RT @AJArabic: UN accuses Hezbollah of obstructing implementation
of evacuation agreement
The accuracy of these libraries for complex phrases might be a topic for further
discussion. However, for the purposes of generating topic models from text corpus
of bot categories the accuracy is acceptable. Table 3.4 shows the summary of the
dataset used for typification in Chapter 6.
Table 3.4: Summary of Twitter bot dataset (Dec 2016) for typification.
Stat Count (%age of total dataset)
Extracted bot accts 11,102 (100%)
# Extracted statuses 951,481 (100%)
Processed bot accts 5,088 (45.83%)
# Processed statuses 715,081 (75.15%)
# Translated statuses 446,378 (46.92%)
3.4.5 Honeypot dataset
To explore the impact of Web bots on popularity of content posted on Twitter
in Chapter 6, I perform a honeypot experiment by deploying a Twitter bot.
53
Table 3.5 shows statistics for data collected by my Web server from 21-11-2015
to 08-01-2017. My Twitter bot account received more than 223,000 clicks, out of
which more than 44.91% had been performed by bots. Surprisingly, the volume
of activity of Twitter bots (53.90% of total statuses) and Web bots (44.91% of
total clicks) on Twitter is very similar. Details of the experiment and the results
of the analyses are presented in § 6.5.
Table 3.5: Click logs dataset – statistics.
Fact Figures
Timeframe From 21-11-2015 to 08-01-2017
Total clicks 223,062
Clicks by bots 100,194 (44.91%)
Unique visitors 2,563
Unique recurring bots 113 (4.08%)
3.5 Stweeler Dashboard
Figure 3.3: Stweeler dashboard.
54
This work also contributes a live non-invasive non-engaging Twitter bot and
a dashboard from the live clicks dataset collected using the bot (Chapter 6).
Using a live Web server I deployed a Twitter bot for a honeypot experiment that
captures live clicks by other bots that interact with open Twitter content. These
bots could be Twitter bots or wider Web bots (such as content curators, crawlers
and spiders). The Web server has a dashboard11 to display analytics around the
clicks dataset (Figure 3.3). More can be found about the bot and how live clicks
dataset is used in Chapter 6.
3.6 Takeaways
In this chapter I presented a list of important questions, explained how and why
bots exist on Twitter, and presented the Stweeler framework as an e↵ective tool
to study the bot presence. I aim to answer most of the raised questions by using
the Stweeler analyses framework to build a comprehensive understanding of the
bot population on Twitter. In the chapters that follow, I perform a detailed
characterisation of bots and humans (Chapter 4), using these characterisations I
implement a detection algorithm (Chapter 5), perform bot typification to explore
and understand types of bots (Chapter 6) and finally conclude in Chapter 7.
11Stweeler dashboard – http://svr-szuhg2-web.cl.cam.ac.uk/graph/graphs
55
56
Chapter 4
Measuring and Characterising
Social bots
In the previous chapter I listed the contributions of this dissertation. In this
chapter I utilise Stweeler to extract a wide spectrum of features. I study these
features in detail for the purposes of an in-depth comparative analyses on the
usage and impact of bots and humans on Twitter. In order to accomplish this I
collect a large-scale Twitter dataset and define various features based on tweet
metadata using Stweeler . The human annotation task (§ 3.4.3) is used to as-
sign ‘bot’ and ‘human’ ground-truth labels to the dataset. The annotations are
compared against a state-of-the-art bot detection tool for evaluation (I build my
own bot detection tool in Chapter 5). I then ask a series of questions to discern
important behavioural characteristics of bots and humans using features within
and among these popularity groups. From the comparative analysis I draw dif-
ferences and interesting similarities between the two entities, thus paving the way
for reliable detection of bots in Chapter 5. Moreover, this enables exploring in-
fluence and categories, and extends the Stweeler platform so it can be used for
studying automated political infiltration and advertisement campaigns.
4.1 Introduction
The rise of bots constitutes a radial shift in the nature of content production,
which has traditionally been the realm of human creativity (or at least interven-
tion). Although there have been past studies on bots (see § 2.1), this chapter
57
is particularly focused on exploring their role in the wider social ecosystem, and
how their behavioural characteristics di↵er from humans. This is driven by many
factors. The limited cognitive ability of bots clearly plays a major role, however,
it is also driven by their diverse range of purposes, ranging from curating news to
answering customer queries. This raises a number of interesting questions regard-
ing how these bots operate, interact and a↵ect online content production: What
are the typical behaviours of humans and bots, in terms of their own activities as
well as the reactions of others to them? What interactions between humans and
bots occur? How do bots a↵ect the overall social activities? How do bots a↵ect
the overall social activities, and what would the impact of their removal be? The
understanding of these questions can have deep implications in many fields such
as social media analysis and systems engineering.
Beyond the social implications, the combined popularity of social media and
online bots may mean that a significant portion of network tra c can be at-
tributed to bots. This conjecture is not without support: according to an esti-
mate 51.8% of all Web tra c is generated by bots1. This, however, constitutes
a radical shift from traditional views on web tra c bringing about both new
research questions and engineering opportunities. For example, can we measure
the amount of tra c produced by bots? This is of importance for future network
engineering, as preliminary evidence suggests that substantial amount of network
congestion is caused by (low priority) bots.
Contributions of this chapter: To answer the above questions, I have per-
formed a large-scale measurement and analysis campaign on Twitter (§ 4.2). I
analyse the most descriptive features from the dataset, as outlined in a social cap-
italist study [25], including six which have not been used in the past to study bots.
This chapter o↵ers a new and fundamental understanding of the characteristics of
bots vs. humans, observing a number of clear di↵erences (§ 4.3). For example, hu-
mans generate far more novel content, while bots rely more on retweeting. I also
observe less intuitive trends, such as the propensity of bots to tweet more URLs,
and upload bulkier media (e.g. images). There are divergent trends between dif-
ferent popularity groups (based on follower counts), with, for example, popular
celebrities utilising bot-like tools to manage their fanbase. I further analyse the
social interconnectedness of bots and humans to characterise how they influence
1Bot tra c report 2016 (last accessed 16 June 2018) – https://www.incapsula.com/blog/
bot-traffic-report-2016.html
58
the wider Twittersphere. Observation reveals that although human contributions
are generally considered more important via typical features (e.g. number of likes,
retweets), bots manage to sustain significant influence over content production
and propagation. My experiments confirm that the removal of bots from Twitter
could have serious ramifications for information dissemination and content pro-
duction on the social network. Specifically, by simulating content dissemination
I find that bots are involved in 54.59% of all information flows (defined as the
transfer of information from one user to another user). I also seek to discover:
(i) the amount of data tra c bots generate on Twitter, and (ii) the nature of
this tra c in terms of media type, i.e. URL, photo (JPG/JPEG), animated im-
age (GIF), and video (MP4). This chapter also sheds light on the possibilities
of how this ever-increasing bot tra c might a↵ect networked systems and their
properties. As well as providing a powerful underpinning for social bot detection
(Chapter 5), this chapter makes contributions to the wider field of social content
automation. Such understanding is critical for future studies of social media,
which are often skewed by the presence of bots.
4.2 Methodology
I build upon my work Stweeler 2 for data collection, pre-processing, feature ex-
traction, bot classification through human annotation, and analysis.
4.2.1 Data Collection and Feature Extraction
Every single action performed on Twitter by a user is recorded as a tweet (status),
whether a tweet, retweet, reply or mention. I collect data on bot and human
behaviour for 30 days in April 2016 from the Twitter Streaming API. This data
contains a range of accounts in terms of their popularity (i.e. number of followers).
Hence, I extract and partition user accounts into four popularity groups to enable
a deeper understanding. Please see § 3.4.1 for full details about the dataset used
in this Chapter. Features I consider in this study are defined in Table 3.1, details
of which are explained in § 3.4.2. The feature-set along with the correlation
among di↵erent popularity groups is shown in Figure 4.1.
2Stweeler– https://github.com/zafargilani/stcs
59
4.2.2 Bot Classification via Human Annotation Task
To compare bots with humans, it is necessary to identify which accounts are
operated by bots. I experimented with the updated release of BotOrNot [22,
83], a state of the art bot detection tool (to the best of my knowledge this is
the only available online bot detection tool). However, inspection of the results
indicated high inaccuracy with di↵erent thresholds (40% to 60%) to label an
account as ‘bot’. Cresci et al. in [21] reported similar inaccuracy measures. I
cannot say for certain why BotOrNot was very inaccurate due to the internal
workings (code) being kept inaccessible by its authors. However, there were three
reasons in my understanding that explained why BotOrNot performed below
average: (i) it works live and therefore can only access a subset of tweets (thus
missing the complete picture), (ii) it is trained on old data, (iii) claims to use
far too many features (the authors claim to use a 1,000 or more features).
Hence, I chose to take a manual approach to establish a highly reliable set of
classifications, that would serve the exploratory purpose of this chapter, as well
serve as ground-truth labels for bot detection (Chapter 5). The dataset created
via this manual approach can be found in § 3.4.3. Details of the human annotation
task can be found in Appendix A.1. In total, I found 43.13% bots in my Twitter
dataset, responsible for 53.90% statuses.
For context, I cross validated by comparing the agreement of final annota-
tions by the human workers to the BotOrNot annotation. The average inter-
annotator agreement compares the pairs of labels by each human annotator to
capture the percentage of accounts for which all four annotators unanimously
agree. The average agreement is measured as a percentage of agreement: 0%
shows lack of agreement and 100% shows perfect agreement. The human an-
notation task shows very high unanimous agreement between human annotators
for each popularity group: G10M+ (96.00%), G1M (86.32%), G100k (80.66%),
and G1k (93.35%). Whereas, BotOrNot shows lower than average agreement
with the final labels assigned by the human annotators: G10M+ (46.00%), G1M
(58.58%), G100k (42.98%), and G1k (44.00%). Since, BotOrNot yields a lower
accuracy, I chose to use the dataset of accounts that were manually annotated.
I perform a more thorough comparison with BotOrNot in Chapter 5 while
designing my own bot detection tool.
60
4.2.3 Media Extraction and Processing
Table 4.1: Types of bot tra c uploaded by Twitter users.
Type Description
URL & schemes URL hosts and URI schemes (4,849 http and 289,074 https instances).
These are extracted from the [text] tweet attribute. 162,492 URLs
by bots and 131,431 by humans.
photos (JPG/JPEG) A photos is extracted from the URL in [media url https] attribute.
In total 23.31 GB of photo data is uploaded by 3,536 bots and humans
in one month.
animated images (GIF) Though these are animated photos, Twitter saves the first image in
the sequence as a photo, and the animated sequence as a video under
the [video info] attribute. In total 2.92 GB of animated image data
is uploaded.
videos (MP4) Video files accompany a photo which is extracted by Twitter from
one of the frames of the video. A video is pointed to by the URL
in [video info][url] attribute. In total 16.08 GB of video data is
uploaded.
As well as text, users are allowed to tweet content such as video and images.
These are identified by metadata within Twitter data. Table 4.1 summarises
the types of media content I observed. The dataset is the same as defined in
Table 3.2. For each tweet created, I extract the media and URLs. Importantly,
Twitter automatically creates di↵erent resolutions of photos and videos, as well
as generating images from animated sequences or videos to accompany static
display with each dynamic media. Note that I only consider the media originally
uploaded by users. This is pointed to by [sizes][large]. I do not consider
media created or uploaded by Twitter itself as part of my dataset.
4.3 Which manners maketh the Bot?
The purpose of this study is to discover the key account characteristics that are
typical (or atypical) of bots and humans. Recall that I take a broad perspective
on what a ‘bot’ is, i.e. any account that consistently involves automation over the
observed period, but may involve human intervention. This definition is justified
by the purpose of automation, i.e. humans act as bot managers, whereas bots
are workers. To explore this, I use this data (§ 4.2) to empirically characterise
bots (dashed lines in figures) and humans (solid lines in figures). To begin, I
simply compute the correlation between each feature for bots and humans, in
order to highlight similarities and di↵erences. Figure 4.1 presents the results as
a heatmap (where perfect correlation is 1.0). Notice that most features exhibit
61
very poor correlations (0.0 to 0.35), indicating significant discrepancies between
bot and human behaviour. I spend the remainder of this chapter exploring these
di↵erences in depth.
10M 1M 100K 1K
user statuses
user tweets
user retweets
user favourites
user replies & mentions
likes per tweet
retweets per tweet
lists account age ratio
follower friend ratio
lifetime statuses freq
favourite tweet ratio
age of account in days
sources count
urls count
daily favouriting freq
photo in kb
gif in kb
video in kb
total in kb
0.05
0.1
0.15
0.2
0.25
0.3
0.35
co
rre
lat
ion
Figure 4.1: Spearman’s rank correlation coe cient (⇢) between bots and humans
per measured feature. The figure shows none (0.0) to weak correlation (0.35)
across all features, indicating clear distinction between the two entities.
4.3.1 Content Generation
I begin by asking if bots generate more content on Twitter than humans? Intu-
itively, one might imagine bots to be capable of generating more content, however,
creativity could be a major bottleneck. I initially consider two forms of content
creation: a tweet, which is an original status written by the account, and a retweet,
which is repetition of an existing status. As briefed earlier the term status refers
to the sum of both tweets and retweets. First, I inspect the amount of content
shared by computing the number of statuses (i.e. tweets + retweets) generated
by each account across the 30 days. As anticipated, humans post statuses less
frequently than bots (monthly average of 192 for humans vs. 303 for bots), in all
popularity groups except G10M+, where surprisingly humans post slightly more
than bots. The sheer bulk of statuses generated by G10M+ (on average 2,852
62
100 101 102 103 104 105
user-tweets
0
0.2
0.4
0.6
0.8
1
CD
F
Bot-10M
Human-10M
Bot-1M
Human-1M
Bot-100K
Human-100K
Bot-1K
Human-1K
(a) Number of tweets issued by a user.
100 101 102 103
user-retweets
0
0.2
0.4
0.6
0.8
1
CD
F
Bot-10M
Human-10M
Bot-1M
Human-1M
Bot-100K
Human-100K
Bot-1K
Human-1K
(b) Number of retweets issued by a user.
100 101 102 103 104
user-replies
0
0.2
0.4
0.6
0.8
1
CD
F
Bot-10M
Human-10M
Bot-1M
Human-1M
Bot-100K
Human-100K
Bot-1K
Human-1K
(c) Replies and mentions posted by a user.
100 101 102 103 104 105 106
follower-friend-ratio
0
0.2
0.4
0.6
0.8
1
CD
F
Bot-10M
Human-10M
Bot-1M
Human-1M
Bot-100K
Human-100K
Bot-1K
Human-1K
(d) Follower-friend ratio of a user.
Figure 4.2: Content Creation: Tweets issued, Retweets issued, Replies and Men-
tions, Follower-friend ratio.
for bots, 3,161 for humans in a month) is likely to acquire popularity and new
followers. Overall, bots constitute 51.85% of all statuses in this dataset, even
though they are only 43.13% of the accounts.
An obvious follow-up is what do accounts tweet? This is particularly pertinent
as bots are often reputed to lack original content. To explore this, I inspect the
number of tweets vs. retweets performed by each account. Figures 4.2a and 4.2b
present the empirical distributions of tweets and retweets, respectively, over the 30
days. It is observed that the retweet distribution is rather di↵erent to tweets. Bots
in G1M, G100k and G1k are far more aggressive in their retweeting; on average,
bots generate 2.20⇥more retweets than humans. The only exception to this trend
isG10M+ where humans retweet 1.54⇥ more often than bots. This is likely driven
63
by the large number of tweets generated by celebrity users. Typically, humans
do generate new tweets more often, while bots rely more heavily on retweeting
existing content. Generally, humans post 18 tweets for every retweet, whereas
bots post 13 tweets for every retweet in all popularity groups except G10M+
(where both entities show similar trends).
Whereas tweets and retweets do not require one-to-one interaction, a further
type of messaging on Twitter, via replies, does require one-to-one interaction.
These are tweets that are created in response to a prior tweet (using the @ nota-
tion). Figure 4.2c presents the distribution of the number of replies issued by each
account. I anticipate that bots post more replies and mentions given their auto-
mated capacity to do so. However, in G10M+ both bots and humans post a high
number of replies, and bots post only marginally more than celebrities. While
bot-masters in G10M+ deploy chatbots to address simple user queries, celebrities
reply in order to engage with their fanbase. It is also possible that celebrities
employ managers as well as automation and scheduling tools (§ 4.3.5) for such
a purpose. Bots in the remaining popularity groups respond twice as frequently
as their human counterparts. Again, this is driven by the ease by which bots
can automatically generate replies: only the most dedicated human users can
compete.
4.3.2 Content Popularity
The previous section has explored the amount of content generated by accounts,
however, this does not preclude such content from being of a low quality. To
investigate this, I compute standard popularity features for each user group.
First, I inspect the number of favourites or likes received for tweets generated
by the accounts. This is a reasonable proxy for tweet quality, where the assump-
tion is that bots will considerably lag behind humans. Figure 4.3a presents the
empirical distribution of the number of favourites or likes received for all the
tweets generated by the profiles in each group. As expected a significant discrep-
ancy can be observed. Humans receive far more favourites per tweet than bots
across all popularity groups except G1k. Close inspection revealed that bots in
G1k are typically part of larger social botnets that try to promote each other
systematically for purposes as outlined in § 4.1. In contrast, human accounts are
limited to their social peers and do not usually indulge in the ‘influence’ race.
64
For G10M+, G1M and G100k popularity groups, humans receive an average of
27⇥, 3⇥ and 2⇥ more favourites per tweet than bots, respectively. G1k bots are
an exception that receive 1.5⇥ more favourites per tweet than humans. These
findings suggest that: (i) the term popularity may not be ideally defined by the
number of followers, (ii) human content gathers greater engagement due to its
personalised attributes.
A further stronger sign of content quality is another user retweeting con-
tent. This is potentially an even stronger signal of endorsement, as a retweet
will explicitly be listed on a user’s wall. Humans are expected to receive retweets
manifold as compared to bots. Humans consistently receive more retweets for all
popularity groups G10M+: 24-to-1, G1M and G100k: 2-to-1, except G1k: 1-to-1.
This di↵erence, shown in Figure 4.3b, is indicative of the fanbase loyalty, which
is vastly higher for individual celebrities than reputable organisations. In other
words, the quality of human content appears to be much higher. I then inspect
who performs the retweets, i.e. do bots tend to retweet other bots or humans? We
find that bots retweeting bots is over 3⇥ greater than bots retweeting humans.
Similarly, humans retweeting humans is over 2⇥ greater than humans retweeting
bots. Overall, bots are retweeted 1.5⇥ more often than humans. This indicates
a form of homophily and assortativity.
100 101 102 103 104
likes-per-tweet
0
0.2
0.4
0.6
0.8
1
CD
F
Bot-10M
Human-10M
Bot-1M
Human-1M
Bot-100K
Human-100K
Bot-1K
Human-1K
(a) Likes per tweet received by a user.
100 101 102 103 104 105
retweets-per-tweet
0
0.2
0.4
0.6
0.8
1
CD
F
Bot-10M
Human-10M
Bot-1M
Human-1M
Bot-100K
Human-100K
Bot-1K
Human-1K
(b) Retweets per tweet received by a user.
Figure 4.3: Content Popularity: Likes per tweet, Retweets per tweet.
65
4.3.3 Content Consumption
Whereas the previous features have been based on content produced by the ac-
counts under study, my dataset also includes the consumption preferences of the
accounts themselves. Hence, I ask how often do bots ‘favourite’ content from
other users and how do they compare to humans? Intuitively, bots would be able
to perform far more likes than humans (who are physically constrained). Fig-
ure 4.4a shows the empirical distribution of the number of likes performed by
each account. It can be seen that, actually, for most popularity groups (G1M,
G100k, G1k), humans favourite tweets more often than bots (on average 8,251 for
humans vs. 5,445 for bots across the entire account lifetimes). Linking into the
previous discussion, it therefore seems that bots rely more heavily on retweeting
to interact with content. In some cases, the di↵erence is significant; e.g. humans
in G1M and G100k place twice as many likes as bots do. G10M+, however, has
an average of 1,816 likes by humans compared to 2,921 by bots.
There could be several reasons for this trend: (i) humans appreciate what
they like, (ii) bots are workers for their human managers and serve a purpose
(e.g. promotion via tweets), (iii) humans have an incentive to like other tweets,
potentially as a social practice (with friends) or in the hope of receiving likes in
return [72]. To explore these strategies further, Figure 4.4b plots the number of
favourites performed by an account vs. the age of the account. Firstly, bots are
as old as humans: the oldest bot account is 3,437 days old vs. 3,429 days for the
oldest human account. Secondly and more importantly, it can be seen that more
recent (i.e. more modern) bots are significantly more aggressive in liking other
tweets. Older bots, instead, use this feature less frequently; deeper inspection
suggests this is driven by the trustworthy nature of older bots, which are largely
run by major organisations.
4.3.4 Account Reciprocity
As well as content popularity, I also measure reciprocity (i.e. friendship). Twitter
classifies two kinds of relationships: reciprocal follower-relationship i.e. when two
accounts follow each other, and non-reciprocal relationship i.e. an account has
many followers who are not followed in return (this is often the case for celebri-
ties). This is measured via the Follower-Friend Ratio. Figure 4.2d shows empiri-
cal distribution of the Follower-Friend Ratio for each group of accounts. Humans
66
100 101 102 103 104 105
user-favourites
0
0.2
0.4
0.6
0.8
1
CD
F
Bot-10M
Human-10M
Bot-1M
Human-1M
Bot-100K
Human-100K
Bot-1K
Human-1K
(a) Tweets favourited (liked) by a user.
0 500 1000 1500 2000 2500 3000 3500
age (days)
0
50
100
150
200
250
300
350
400
450
500
fa
vo
ur
itin
g 
fre
qu
en
cy
Bot Human
(b) Number of favourites performed vs. age
of the account.
Figure 4.4: Content Consumption: Likes performed, Favouriting behaviour.
display higher levels of friendship (G10M+: 4.4⇥, G1M and G100k: 1.33⇥, G1k:
15⇥) and thus a lower Follower-Friend Ratio than bots.
Previous research [18] argues that humans typically have a ratio close to 1,
however, my analysis contradicts this assumption. For celebrities, very popular
and mid-level recognition accounts this ratio is in the order of thousands-to-1,
irrespective of whether an account is a bot or a human (G10M+: 629,011-to-1 for
bots vs. 144,612-to-1 for humans, G1M: 33,062-to-1 for bots vs. 24,623-to-1 for
humans, G100k: 2,906-to-1 for bots vs. 2,328-to-1 for humans). In fact, even the
ratios for low popularity accounts are not 1, but consistently greater (G1k: 30-
to-1 for bots vs. 2-to-1 for humans). This is caused by the human propensity to
follow celebrity accounts (who may not follow in return), as well as the propensity
of bots to indiscriminately follow large numbers of other accounts (largely in the
hope of being followed in return).
4.3.5 Tweet Generation Sources
In this subsection I inspect the tools used by bots and humans to interact with
Twitter. This is possible because each tweet is tagged with the source that
generated it; this might be the website, third-party app or tools that employ the
Twitter API. Figure 4.5a presents the number of sources used by human and bot
accounts of varying popularities. Bots are expected to use a single source (i.e. an
API or own tool) for tweeting. Surprisingly, it can be seen that bots actually
67
Bot-10MHuman-10MBot-1MHuman-1MBot-100KHuman-100KBot-1KHuman-1K
0
5
10
15
20
25
30
35
40
45
50
so
ur
ce
s-
co
un
t
(a) Activity sources used by a user (Red dot
is µ of the distribution).
0 2 4 6 8 10
x 104
SocialFlow
Twitter Web Client
Twitter for mobile
TweetDeck
Buffer
IFTTT
Hootsuite
ICGroupInc
Sprinklr
SnappyTV
Spredfast
dlvr it
twittbot
SocialOomph
Instagram
UberSocial
Ap
p.
/T
oo
l
count of usage
 
 
Bot usage Human usage
(b) Bar chart of accounts that use each type
of Twitter source.
Figure 4.5: Tweet Sources: Count of Activity Sources, Type of Activity Sources.
inject tweets using more sources than humans (see Table 4.2).
To explore this further, Figure 4.5b presents the number of accounts that use
each source observed. The expectation is to observe humans utilising multiple
sources (such as Web interface, app, third-party tools), expectedly more than
bots (that may not always be programmed to switch from an API to third-party
service, or vice versa). It can be seen, somewhat contrary to the expectation,
bots use a multitude of third-party tools. Bot news services (especially from
G10M+) are found to be the heaviest users of social media automation manage-
ment and scheduling services (SocialFlow, Hootsuite, Sprinklr, Spredfast), as well
as a Cloud-based service that helps live video editing and sharing (SnappyTV).
Some simpler bots (from G100k and G1k groups) use basic automation services
(Dlvr.it, Twittbot), as well as services that post tweets by detecting activity on
other platforms (IFTTT). A social media dashboard management tool seems to
be popular across most groups except G1k(TweetDeck). Interestingly, it can also
be seen that bot accounts regularly tweet using Web/mobile clients — pointing to
the possibility of a mix of automated and human operation. In contrast, 91.77%
of humans rely exclusively on the Web/mobile clients. That said, a small number
(3.67%) also use a popular social media dashboard management tool (Tweet-
Deck), and automated scheduling services (Bu↵er, Sprinklr). This is particularly
the case for celebrities, who likely use the tools to maintain high activity and fol-
lower interaction — this helps explain the capacity of celebrities to so regularly
68
reply to fans (§ 4.3.1).
4.3.6 Media Upload
100 101 102 103 104 105
content uploaded (KB)
100
101
102
103
104
105
106
UR
Ls
Bot Human
(a) URLs vs. Content uploaded.
101 102 103 104 105
cdn-content-in-kb
0
0.2
0.4
0.6
0.8
1
CD
F
Bot-10M
Human-10M
Bot-1M
Human-1M
Bot-100K
Human-100K
Bot-1K
Human-1K
(b) Size of CDN content in KByte uploaded
by a user.
Figure 4.6: Content Creation: URLs in tweets, Content uploaded on Twitter.
In this subsection I inspect the actual content of the tweets being generated
by the accounts. This is done using two features: number of URLs posted by
accounts, and the size of media uploaded, where bots are expected to show their
actual impact. Figure 4.6a presents the scatter plot of the number of URLs (y-
axis) and content uploaded in KB (x-axis). Bots place far more external URLs in
their tweets than humans (see Table 4.2): 162% in G10M+, 206% more in G1M,
333% more in G100k, and 485% more in G1k. Bots are a clear driving force for
generating tra c to third party sites, and upload far more content on Twitter
than humans. Figure 4.6b presents the distribution of the amount of content
uploaded by accounts (e.g. photos). Account popularity has a major impact on
this feature. Bots in G10M+ have a 102⇥ lead over bots in other popularity
groups. That said, humans in G10M+ have a 366⇥ lead over humans in other
popularity groups. Overall, bots upload substantially more bytes than humans do
(see Table 4.2): 141% in G10M+, 975% more in G1M, 376% more in G100k, and
328% more in G1k. This is due to their ability to automate tasks, while humans
are limited by their physical capacity. It is also worth noting that both content
upload and URL inclusion trends are quite similar, suggesting that both are used
with the same intention, i.e. spreading content. Since bots in G10M+ mostly
69
belong to news media – sharing news headlines is clearly a means of operating
their business. This resonates with the well known problem of catering demand
for heavy users, which is well explored in cellular networks [28]. This potentially
has a big impact on the tra c produced as well as the required network capacity.
Since the amount of tra c is correlated to the cost and energy [84], identifying
the content produced by a bot is a key step to reshaping or optimising the way
that service providers should deal with this type of tra c and content.
100 102 104 106
#URI (Http + Https)
100
105
#p
ho
to
s
Bot Human
(a) No. of photos
(JPG/JPEG) uploaded
by bots & humans per URI
(http + https).
100 102 104 106
#URI (Http + Https)
100
105
#a
nu
m
at
ed
 g
if
Bot Human
(b) No. of animated images
(GIF) uploaded by bots &
humans per URI (http +
https).
10-5 100 105 1010
#URI (Http + Https)
100
105
#v
ide
os
Bot Human
(c) No. of videos (MP4) up-
loaded by bots & humans
per URI (http + https).
Figure 4.7: Media (photos, animated images, videos) uploaded by bots and hu-
mans on Twitter.
102 104
#visits
www.youtube.com
youtu.be
gyac.tix.com
www.haberdar.com
m.youtube.com
www.ahmnews.com
www.teamiblends.com
itunes.apple.com
kbsworld.co.kr
espn.go.com
www.oprah.com
uk.sports.yahoo.com
m.facebook.com
coobis.com
www.prizeo.com
smarturl.it
petfilm.biz
headlines.yahoo.co.jp
jamie-dornan.org
www.facebook.com
vis
ite
d 
ur
l
Bots
Human
(a) Human popular URLs.
102 104
#visits
www.sunfrogshirts.com
www.youtube.com
byvue.com
www.haberdar.com
www.ahmnews.com
youtu.be
dzxcq.com
cjsab.com
www.fenerbahce.org
moca-news.net
smarturl.it
gcnhu.com
entertain.naver.com
www.facebook.com
www.alwatan.com.sa
liferdefempire.com
time.com
paper.li
aldiyar.net
www.amazon.com
vis
ite
d 
ur
l
Bots
Human
(b) Bot popular URLs.
100 102 104
#visits
www.youtube.com
youtu.be
www.haberdar.com
m.youtube.com
www.ahmnews.com
smarturl.it
itunes.apple.com
entertain.naver.com
m.facebook.com
www.facebook.com
paper.li
petfilm.biz
aldiyar.net
headlines.yahoo.co.jp
www.amazon.com
www.buzzfeed.com
toyzone.com.sa
profile.empowr.com
www.instagram.com
www6.nhk.or.jp
vis
ite
d 
ur
l
Bots
Human
(c) Combined popular
URLs.
Figure 4.8: Visiting trends to popular URLs by bots and humans.
I can also inspect the specific types of the media uploaded. The dataset reveals
a significant presence of media content generated by bots. Figure 4.7 presents a
scatter plot comparing the number of media types uploaded per URI (one URI
is a single object). It can be seen that both bots and humans upload significant
quantities, however, it is clear that bots contribute the most. In total, bots
70
account for 55.35% (12.90 GB) of the total photo tra c uploaded on Twitter;
53.58% (1.56 GB) of the total animated image tra c uploaded; and 40.32% (6.48
GB) of the total video tra c uploaded on Twitter. This is despite the fact that
they only constitute 43.13% of the accounts under study and contribute 53.90%
of the total tweets collected. When combined, bots account for a total of 49.52%
(20.95 GB) tra c uploaded on Twitter.
It is also worth noting that many bot accounts post URLs. In fact, 55.28%
of all URLs are posted by bots, despite the fact that bots only make up 43.13%
of the accounts. This is important because these have the potential to trigger
further tra c generated amongst the accounts that view the tweets. To explore
this, Figure 4.8 presents the most popular domains posted by bots and humans.
Significant di↵erences can be observed. For example, whereas humans tend to
post mobile sites (e.g. m.youtube.com, m.facebook.com), bots rather post the
desktop version (e.g. youtube.com, facebook.com). We can observe a range of
websites exclusively posted by humans, e.g. espn.com and oprah.com. One can
also see a few URLs posted by bots, but never by humans. These di↵erences
highlighted the di↵ering goals of bots and humans when posting content, with
more well-known websites dominating the human dataset. For example, the most
regularly posted URL in my bot dataset is sunfrogshirt.com, which is actually
a website for purchasing bespoke t-shirts. This highlights a common purpose of
media posting on Twitter: spam and marketing. Note that bots infiltrate human
popular URLs more often than humans infiltrate bot popular URLs. This shows
that bots can reach further due to their automated ability and can considerably
impact systems in unusual ways.
4.4 A World without Bots?
The previous section has discussed the characteristics that make bots and humans
di↵erent. However, one of the most important things on Twitter is its social
graph, i.e. the interconnections between users. Hence, in this section, I will
briefly inspect the social impact or influence that bots have on Twitter, as well as
the impact of removing them. In this context, influence is defined as the capacity
or the ability to drive an action, e.g. sharing an item (whether text, photo or
video) on social media that induces or generates a response. Graphs throughout
71
this section are created using Gephi3.
4.4.1 How Influential are Bots?
I begin by inspecting the social influence that bots and humans exercise on Twit-
ter. Influence (sometimes referred to as induction) is the phenomenon where
actions of an individual are a↵ected by other individuals through social interac-
tion. I therefore construct a graph of direct interactions, whereby vertices are
users (bots or humans), and edges represent interactions, i.e. retweeted statuses,
quoted statuses, replies, or mentions. As previous research shows [4], influence in
OSNs is directional and position-dependent (i.e. position in the social network).
Therefore, influence of a user (vertex) in this context is the sum of direct in-
teractions (edges) it has been engaged in by other users (vertices). Note that
in order to engage in direct interaction, at least one user has to retweet, quote,
reply or mention the other user. Furthermore, each interaction could have two
perspectives from a user’s viewpoint: (i) influencer interaction when a user be-
longing to one of these popularity groups exercises influence over another user,
(ii) influenced interaction when a user is influenced by one of the users in these
popularity groups.
To answer how influential bots are, I present interaction graphs that depict
retweeted statuses, quoted statuses, replies, and mentions of bots and humans by
their followers. I use two popularity groups: users with 10M and 100k followers,
and the users who are involved in the direct interaction, i.e. influenced interaction.
I do not present results for the 1M and 1k popularity groups as they show similar
graphs and properties to 10M and 100k groups, respectively. I use directed edges
for the interaction graphs, where an edge is directed from the influencer to the
influenced.
The mean degree for the 10M popularity group is very similar for both bots
(1.18) and humans (1.176). This shows that both humans and bots are tightly
intra-connected within their respective assortative neighbourhoods: the assorta-
tive intra-connectedness is stronger than diversified inter-connectedness. I also
find that bots (4.025) have almost 2⇥ the mean degree than humans (2.164) for
the 100k popularity group. This shows that bots have accumulated a large influ-
ence both within their assortative as well as diversified neighbourhoods. This is
3Gephi – http://gephi.github.io
72
partly driven by the more aggressive tweeting activity of the bots under-study.
4.4.2 What happens if Bots disappear?
The above confirms that bots have significant influence in Twitter. Thus, an
obvious question is what would happen if all bots were blocked or removed from
Twitter? This may shed light on the overall impact (positive or negative) that
bots have, as has been topically studied for UK-EU referendum [48] and 2016 US
Presidential Election [8]. If bots produce high amounts of content (tweets, URLs,
content size), then their existence should be critical for intermediary connections
(or form centrality vertices that sit on critical paths). Such central nodes typically
sustain the graph structure. Moreover, if bots are responsible for a↵ecting content
popularity (favouriting, retweeting, quoting), then they should be among the
critical super-vertices. We will look at behaviours between retweeting and quoting
graphs as well as replying and mentioning graphs.
(a) 10M bots and humans.
(b) 10M when bots disappear.
Figure 4.9: Bots vs. Humans - graphs for retweets and quotes of 10M popu-
larity group. Black dots are vertices, edges represent an interaction. Red edges
represent bots and Blue edges represent humans.
Figure 4.9 presents the influence graph for the 10M group for retweets and
quotes. The density of edges (due to retweeting and quoting) for both bots
(Red) and humans (Blue) emphasises the influence of these vertices within their
network. Notice the two separate sub-graphs appearing for bots and humans,
which confirms most of the connections are between similar entities, i.e. bots
73
following other bots, and humans following other humans. Despite two separate
sub-graphs, vertices of both entity types are connected to each other too, i.e. bots
following humans, and humans following bots. This shows that intra-influence
is stronger than inter-influence, i.e. bots influencing other bots is stronger than
bots influencing humans, and vice versa.
(a) 100k bots and humans. (b) 100k when bots disappear.
Figure 4.10: Bots vs. Humans - graphs for retweets and quotes of 100k popularity
group. Black dots are vertices, edges represent a interaction. Red edges represent
bots and Blue edges represent humans.
Figure 4.10 presents the influence graph from the 100k vertices for retweets
and quotes; it exhibits profound di↵erences to the 10M graphs. Inspection reveals
that bots are holding the social graph together as they form the medium that
connects vertices on the edge of the network. The e↵ects are apparent in Fig-
ure 4.10b, which plots the same graph with all bots removed. This indicates that
the human part of the 100k retweet graph is only loosely connected, i.e. bots
play a significant role in influencing and consequently propagating content be-
tween humans. Though there are small human communities that seem to be
tightly connected, the number of weakly connected components are much higher
than strongly connected components.
I also look at replies and mentions for 10M and 100k groups in Figure 4.11,
which exhibits substantially di↵erent trends to the retweet graph. The density of
edges (due to replies and mentions) for both bots (Red) and humans (Blue) shows
74
(a) 10M bots and humans.
(b) 100k bots and humans.
Figure 4.11: Bots vs. Humans - graphs for replies and mentions of 10M and 100k
popularity groups. Black dots are vertices, edges represent an interaction. Red
edges represent bots and Blue edges represent humans.
a range of homophily and interconnectedness between bots and humans. The
interconnectedness between bots and humans for 10M and 100k groups ranges
from low to very low, respectively. The average degree of interconnectedness in
10M group is 15.4 edges, whereas in 100k group it is 2.7 edges. This observation
highlights two important trends within this dataset: (i) since replies and mentions
are direct one-to-one interactions, strong assortative behaviour is observed in both
bots and humans, (ii) humans intra-connect more often than bots in 10M group,
whereas the trends for 100k group are the exact opposite. This is partly driven
by the propensity for automated bots to generate unsophisticated automated
responses (e.g. spam). It is likely that suspecting humans do not respond to
these direct messages by bots, especially those that seem automated or employ
astroturfing. It is equally likely that naive or simplistic bots are not capable of
responding to or engaging in direct messages by unwary humans.
75
4.5 Takeaways
Bots exercise a profound impact on Twitter. This chapter confirms a number
of noteworthy trends: (i) bots generally retweet more often, while some humans
can exhibit bot-like activity (G10M+); (ii) bots can post up to 5⇥ more URLs in
their tweets (§ 4.3.1); (iii) bots can upload 10⇥ more content with their tweets;
(iv) humans can receive as much as 27⇥ more likes and 24⇥ more retweets as
bots (§ 4.3.2); (v) bots retweeting other bots is over 3⇥ more regular than bots
retweeting humans, whereas humans retweeting other humans is over 2⇥ greater,
indicating homophily (§ 4.3.2); (vi) humans favourite others’ tweets much more
often than bots do, though newer bots are far more aggressive in favouriting
tweets to replicate human behaviour (§ 4.3.3); (vii) humans enjoy higher levels
of friendship and usually form reciprocal relationships (§ 4.3.4); (viii) bots typi-
cally use many di↵erent sources for active participation on Twitter (up to 50 or
more); and (ix ) activity sources include basic automation and scheduling services
(§ 4.3.5) — used abundantly by bots and seldomly by humans. These findings
have been summarised in Table 4.2.
Table 4.2: Feature inclination: B is more indicative of bots, whereas H is more
indicative of human behaviour, and   is neutral (i.e. both exhibit similar be-
haviour). * represents magnitude of inclination: * is considerable di↵erence, ** is
large di↵erence. signif. shows statistical significance of each feature as measured
by t-test.
Feature & value Fig. 10M+ 1M 100K 1K signif.
More user tweets 4.2a   B* B* B*
Higher user retweets 4.2b H* B⇤ B⇤ B⇤ 99%
More user replies and mentions 4.2c   B* B* B 99%
More URLs in tweets 4.6a B** B** B** B** 99%
More total content uploaded
(KByte)
4.6b B** B** B** B** 95%
Higher likes received per tweet 4.3a H** H** H** B 99%
Higher retweets received per tweet 4.3b H** H** H** B 99%
More tweets favourited (liked) 4.4a B** H** H** H** 99%
More favourites by younger ac-
counts
4.4b B H B B
Higher follower-friend ratio 4.2d B** B* B* B**
More activity sources 4.5a B* B B B 99%
I have also shown that bots inject significant proportions of network tra c
via the uploading of media (§ 4.3.6). I also found that there were clear di↵erences
between the URLs and content posted by bots vs. humans. By regularly posting
links, I posit that bots trigger further tra c generation amongst their followers.
I therefore allude that Twitter, and similar services, should begin to explicitly
76
factor this within their infrastructural design. Such bots, for example, could be
downgraded in terms of Quality of Service priorities, or even have their uploads
bu↵ered/delayed until o↵-peak hours.
In this chapter I performed a measurement study that encompassed feature
extraction, an in-depth analysis for di↵erentiating bots from humans, and distin-
guishing their activities and impact on Twitter. I conclude this chapter by saying
that bots have an existential impact on social media, and I believe understanding
their activities has inherent scientific value. The scale of their role within Twitter
is equal to that of humans and, as such, this Chapter was intended to pave way
for a reliable bot detection tool (Chapter 5).
77
78
Chapter 5
Detecting Social bots
Chapter 4 utilised Stweeler to collect a large Twitter dataset, extracted and
studied features in-depth to acquire a wide array of attributes that distinguish
bots from humans. In this chapter I present a methodology and implement a
model for non-partisan classification of Twitter users into bots and human users,
by refining preprocessing and partitioning of datasets, creating and using a large
human annotated dataset as ground truth labels, as well as extracting most rel-
evant feature-sets (via ablation tests) for each popularity group.
To perform accurate classification I use partitioned human annotated dataset
(§ 3.4.1–3.4.3) that compensates the di↵erences present due to account popu-
larity. To judge accuracy of the procedure I calculate agreement among human
annotators as well as with a bot detection research tool. Treating account cate-
gorisation on Twitter as a binary classification problem, I then apply a Random
Forests classifier on the dataset. By performing ablation tests I identify most
insightful feature-sets for each popularity group. I then apply a Random Forests
classifier that achieves an accuracy close to human agreement. Finally, as a con-
cluding step I perform tests to measure the e cacy of the results.
5.1 Introduction
The existence of bots is making a real impact on our daily lives. For instance,
Facebook employed automated techniques1 to populate, curate and tweak its
1Facebook trending news module (last accessed 16 June 2018) – https://www.theguardian.
com/technology/2016/aug/29/facebook-fires-trending-topics-team-algorithm
79
trending news module which led to disastrous results. The algorithm started pop-
ulating the trending news feed with false and controversial stories that pushed
the questionable content even further. Microsoft’s Tay was a bot operating a
Twitter account learning to mimic human speech patterns by interacting with
other users through tweets and replies. The experiment had to be terminated2
when Tay was taught hate-speech and racism. This highlights that automated
conversation and content dissemination may take an unexpected turn that the
users may find o↵ensive and harmful. Recently, an MIT scientist programmed
a Twitter bot3 that tweets like the US president Donald Trump. The bot uses
an AI algorithm to learn Trump’s style of speech by going through debate tran-
scripts. This exemplifies the other side of the coin – the recent research trend of
automating content generation and mimicking people on Twitter.
Contributions of this chapter: The goal of this chapter is to classify
Twitter users as bots (that tweet via a scheduling tool or an automated program
that uses Twitter API) and human users. This chapter focuses on the following:
(i) Use of raw historical data (60 million tweets) for attribute collection and ac-
count classification (722,109 tweets) to cater for stealthier bots that are harder to
discern from humans; (ii) A Twitter dataset divided into user popularity groups,
further partitioned into lists of bots and humans (for reasons refer to § 5.2) using
a human annotation task. This serves as a large ground truth dataset; (iii) 14
novel features from a total feature-set of 22 attributes (see § 5.2); (iv) Perfor-
mance evaluation of current state of the art in bot detection by calculating agree-
ment between human annotators and BotOrNot; (v) Application of supervised
learning approach – Random Forests classifier – for non-partisan account cate-
gorisation; (vi) Identification of a distinct group of features (using ablation tests)
that are most informative for classifying bots within each popularity group (see
Table 5.7); and (vii) Hypotheses (see Table 3.1) verification against my findings
using t-tests (see § 5.4).
An implemented research tool that o↵ers an API is BotOrNot [22, 83], that
uses six feature-sets and a Random Forests classifier to output bot-likelihood score
of a given Twitter account. I carry out a well-defined human annotation task (see
2Microsoft’s Tay (last accessed 16 June 2018) – https://www.theguardian.com/
technology/2016/mar/24/microsoft-scrambles-limit-pr-damage-over-abusive-ai-bot-tay
3DeepDrumpf (last accessed 16 June 2018) – http://uk.businessinsider.com/
how-donald-trump-talks-2016-9
80
§ 5.2) and compare these to the BotOrNot annotations. In the experiments, I
have found that BotOrNot produces an average agreement of 48% with human
annotators, while the average agreement among human annotators is 89%.
5.2 Methodology
A tweet object4 is formed of attributes written in JSON structure. Stweeler 5
platform (Chapter 3) is used for collecting data, defining partitions, filtering data,
calculating feature values and various other preprocessing tasks. This chapter
extends Stweeler by designing a classification tool for bot detection. Full details
about the partitioned dataset can be found in § 3.4.1. Features I consider in this
study are defined in Table 3.1, and their details are explained in § 3.4.2. Details
about the annotations of the partitioned dataset can be found in § 3.4.3. The
annotated partitioned dataset is explored in detail in Chapter 4.
Hardly any past work objectively compares other detection or classification
tools to their experiments. I use BotOrNot6 HTTP REST API, which returns
a bot-likelihood score for each Twitter account. BotOrNot does not assign
labels as ‘bot’ or ‘human’, but a 50% threshold (as mentioned on BotOrNot
website and confirmed from author publications) is set as the boundary between
an account being a human account (i.e. < 50% likelihood) and an account being
a bot account (>= 50% likelihood). I choose 50% threshold in this chapter
as logically indicated by BotOrNot authors. Furthermore, the accuracy of
BotOrNot across a variable threshold range (40% to 60%) proved to be similar
to 50% threshold. Whenever BotOrNot returns a bot-likelihood score of less
than 50% the account is labelled as ‘human’, otherwise assigned a ‘bot’ label.
The assumption is that the human annotation task produces a dataset anno-
tated with the labels that are the closest approximations of the “ground truth”
labels, since the latter are, in general, unavailable (see the discussion in § 5.3).
Furthermore, I use the agreement between the human annotators to benchmark
the performance of the automated bot classification system.
I then calculate statistics for various features listed in Table 3.1, and use a
4Twitter Tweet Object (last accessed 16 June 2018) – https://developer.twitter.com/
en/docs/tweets/data-dictionary/overview/tweet-object
5Stweeler– https://github.com/zafargilani/stcs
6BotOrNot is now rebranded as Botometer, last accessed 16 June 2018 at https://
botometer.iuni.iu.edu/
81
Random Forests classifier to perform three sets of experiments. First, I run a
5-fold cross-validation experiment in which I use 4 folds to train and 1 fold to
test the classifier in each of the runs, with each fold containing subsets of all
popularity groups, and report the results averaged across all 5 runs. Second,
I report the results on the data originating with each of the popularity groups
in particular. Third, I test how generalisable the features are, and for that I
train the classifier using sets of 3 popularity groups and test it on 1 remaining
popularity group in each of the runs.
I perform ablation tests: starting with the full feature-set and then remove
features one by one in order to detect the minimal optimal feature combination
that yields the best results on the task. Features that show up most often in the
best performing feature splits in these experiments include followers-to-friends
ratio, user retweets, tweet frequency and URLs count.
Finally, I obtain the classified datasets as well as the best features and their
respective feature splits. Results of the annotation task and bot classification are
presented in § 5.3 and § 5.4, respectively.
5.3 Human Annotation Task
The annotation task fulfils two goals: first, it is used to derive the ground truth
labels for the machine learning experiments presented in § 5.4. The information
provided by the Twitter users on their accounts is not a reliable method to discern
an account type. Depending on the goals of a Twitter account operated by an
bot, it may or may not self-identify as such: e.g. if the goal is to spread false
information and malicious content, the bot may pretend to be a human.
Second, human annotation task helps estimate how accurately humans can
identify bots on Twitter. This provides a very useful point of comparison for
the machine learning experiments presented in § 5.4. The ultimate goal of this
chapter is to implement an automated tool for bot classification on Twitter that
would perform comparably to humans, but it might be unrealistic to expect it
to outperform humans. I will therefore compare the performance of the classifier
presented in § 5.4 to the inter-annotator agreement.
For details on human annotations see § 3.4.3. Twitter data within each pop-
ularity group has been independently annotated by 4 annotators. Each account
82
is marked as either human or bot, and final ground truth labels are used (in the
following machine learning experiments) i↵ majority vote holds between all an-
notators. This majority vote is the final annotation that is derived from the four
annotations. If there is a tie (i.e. 2-2 vote split among annotators) it is discussed
among the annotators and re-annotated for a majority vote (i.e. for final anno-
tation). Table 5.1 reports the average pairwise inter-annotator agreement across
all popularity groups. In addition, I report average annotators’ agreement with
the final annotation, and average agreement of the annotators with the labels as-
signed by BotOrNot (BON) [22]. The inter-annotator agreement in Table 5.1
is reported on the scale from 0% to 100%, with 0% showing lack of agreement
and 100% being perfect agreement.
Table 5.1: Average inter-annotator agreement (%-age).
Ann G10M+ G1M G100k G1k
An1 94.50 82.14 73.15 91.32
An2 95.50 79.46 72.02 89.75
An3 95.50 75.63 68.32 86.87
An4 90.50 79.69 70.88 90.72
Avg 95.58 80.65 73.00 90.40
Final 96.00 86.32 80.66 93.35
BON 46.00 58.58 42.98 44.00
Table 5.2 reports Cohen’s kappa () coe cient widely used in annotation
experiments for assessing how reliable the annotators’ judgements are, or deter-
mining “the degree, significance, and sampling stability of their agreement” [20].
This coe cient takes into account the observed agreement between the annota-
tors po as well as the agreement that is expected by chance pc, that is estimated
by finding the joint probabilities of the marginals. The  coe cient is calculated
as follows:
 =
po   pc
1  pc (5.1)
Following interpretation of  values provided by [56], it was concluded that the
annotators in this experiment achieved moderate ( 2 [0.41  0.60] for G100k) to
substantial ( 2 [0.61 0.80] forG1k andG1M) to almost perfect ( 2 [0.81 0.99]
for G10M+) agreement which can be considered reliable in all cases. It is also
worth noting that agreement of BotOrNot with human annotators ranges from
less than chance7 ( < 0.00 forG1k,G100k andG10M+) to slight ( 2 [0.01 0.20]
7Negative  shows less than chance agreement.
83
for G1M) agreement only, which shows that human annotators almost always
disagree with the labels assigned by BotOrNot. These evaluation results are
similar to what is reported by Cresci et al. [21].
Table 5.2: Average Cohen’s .
Ann G10M+ G1M G100k G1k
An1 89.00 63.26 46.37 81.68
An2 90.93 57.90 44.21 77.99
An3 90.93 50.41 36.69 72.17
An4 80.86 58.03 41.71 80.14
Avg 85.15 60.27 46.05 79.58
Final 91.96 71.76 61.28 85.91
BON -8.69 01.90 -14.46 -14.70
Interestingly, G100k shows the highest disagreement. Less particular proper-
ties within this group make these accounts similar to each other: e.g. the an-
notators reported that a number of accounts within this group seemed to be
initially bot-operated but were personalised later as human users started actively
using them, and vice versa. Exploring this further I found that in some cases
new users initially made use of third-party apps and services such as SocialFlow,
Hootsuite and Sprinklr to post pre-written messages. Reasons for using such ser-
vices vary for transitioning from human-operated to bot-operated and vice versa,
e.g. scheduling tweets while being away or passively monitoring, acquiring new
followers, experimenting or ‘trying out’ new apps or services and then discon-
tinuing, initially posting manually but then signing up to solely use third-party
services to interface with Twitter, etc.
Based on the results of the annotation task I conclude that: (i) The anno-
tators mostly agree when they assign labels to the Twitter accounts, and the
annotation can be considered reliable for all groups. (ii) The annotators label
43.13% accounts as bots. (iii) BotOrNot does not perform well on the given
data and shows considerably large disagreement with human annotators’ votes.
(iv) I set the human annotation-based benchmark for the machine learning ex-
periments reported in § 5.4 at 87.42, or at the average observed agreement of the
annotators with the final labels on the whole dataset spanning all four popularity
groups.
84
5.4 Classifying Bots and Humans
I approach bot classification on Twitter as a binary classification task. Previous
research [18] distinguished between bots, humans and cyborgs – accounts that are
partly operated by humans and also include automation, thus having properties
of both bots and humans. However, there is a confusion surrounding when is
a cyborg a bot-operated human account and when is it a human-operated bot
account? This confusion emanates because operational observation of an account
leaves traces of activity that point towards both automated and human actions.
In this work, I choose to perform binary classification distinguishing between bots
and humans only, because accounts that consistently involve automation (e.g. au-
tomated tweeting) should be characterised as automated accounts. As noted in
§ 5.1, the primary goal is to present a thorough methodological mechanism that
allows identification of Twitter accounts as bots and humans using supervised
classification.
I had a number of choices for the classification task, but two obvious ones:
Naive Bayes and Random Forests. Naive Bayes is a simple classification technique
based on the Bayes’ Theorem with the strong assumption that the predictors
(or features) are independent. Naive Bayes uses Bayes’ theorem to calculate
posterior probability8 P (c|x) (Equation 5.2) from prior probability of class P (c),
a likelihood P (x|c) and a prior probability of the predictor P (x).
P (c|x) = P (x|c)P (c)
P (x)
(5.2)
Naive Bayes assumes that every feature is independent of every other feature,
therefore properties corresponding to all of these features. e.g. tweeting behaviour
and URLs in tweets, would independently contribute to the probability that an
entity is a ‘bot’. Though, easy to build, scales well for large datasets, and having
linear processing times, the model su↵ers from the drawbacks that it is fragile to
overfitting, underperforms for numerical data in favour of categorical data9, and
predictions are recommended to be taken as raw estimations.
Given that the dataset is multivariate, both categorical and numerical, these
problems need to be mitigated. Random Decision Trees [44], or Random Forests,
8Statistical probability that a hypothesis is true (in this case that an entity is a ‘bot’)
calculated in the light of relevant observations (in this case features).
9Categorical data represents non-numerical characteristics, such as binary classes.
85
are an ensemble learning method that operates by constructing a multitude of
decision trees and produces a prediction class that receives the majority vote
(mathematical mode of the classes). The idea behind Random Forests is to use
a number of average predictors to make a strong final prediction. Therefore,
Random Forests are influenced by Adaptive Boosting (AdaBoost), which trains
a classifier in the form Fr(x) =
PT
t=1 ft(x), where ft is a weak learner in a setting
of T learners. Each weak leaner then produces a hypothesis h(xi) for each sample
i. A weak learner is selected per iteration of t, assigned a coe cient ↵t such that
the sum training error Et (Equation 5.3) of the resulting classifier is minimised.
Et =
X
i
E[Ft 1(xi) + ↵th(xi)] (5.3)
Random Forests are composed of tree bagging and manufacturing forests of
similar trees. The bagging procedure involves bagging a training set X with
Y responses repeatedly for B times to fit trees to training samples. For b =
1 . . . B: (i) n training examples from (Xb, Yb) are sampled, and (ii) a classification
tree fb(Xb, Yb) is trained. After training, predictions for test samples X 0 can be
construed by taking the majority vote (mathematical mode) of the classification
trees (Equation 5.4).
fˆ =
1
B
modefb(X
0) (5.4)
While predictions by single trees are sensitive to noise in training samples, the
majority vote mitigates this, thus leading to better model performance in terms of
accuracy. Furthermore, the larger the training sample the better the prediction,
as the bagging procedure is designed to de-correlate the decision trees. Addition-
ally, Random Forests are robust against overfitting and gives better accuracy as
the sample size increases.
I apply Random Forests classifier implemented using scikit-learn10 [67]
toolkit and 100 decision tree estimators. But first let’s define the benchmarks
against which the automated account classification system is evaluated. The
lower bound is set as the majority class distribution in the data, which for all
popularity groups is equal to the proportion of accounts that belong to humans. In
other words, if the automated account classification system always “guesses” that
10scikit-learn toolkit – http://scikit-learn.org/
86
an account belongs to a human, then it will perform at the majority class baseline
level. Next, I use the average observed inter-annotator agreement between each
of the annotators and the final annotation, which indicates how well humans
perform on this task as it may be unrealistic to expect an automated system to
outperform humans (see § 5.3). Finally, I also include the average agreement
between the annotators and labels assigned by BotOrNot. Table 5.3 reports
these estimates for each of the popularity groups as well as the average across all
data points in the whole dataset.
Table 5.3: Dataset benchmarks.
Group Majority Human BON
baseline agreement
G10M+ 52.00 96.00 46.00
G1M 60.50 86.32 58.58
G100k 51.24 80.66 42.98
G1k 61.41 93.35 44.00
Total 56.28 89.08 47.89
In addition to the dataset benchmarks, I also prove that the sample set of
annotations are representative of their population. In this validation experiment
I take varying size of training data (to train the classifier model) and test it
against a validation sample of 100 annotations. The training data is taken from
the human annotated dataset (see § 5.3), and ranges from 1,000 to 3,000 randomly
selected annotations. The 100 annotations for validation purposes are also taken
from the human annotated dataset, and are not repeated in the training data.
I carry out two validation experiments: (i) randomised lists that do not have
repeated data points among the lists, and (ii) randomised lists that may have
repeated data points among the lists.
Table 5.4: Validation results.
Training sample Acc validation Acc validation
size exp (i) (%) exp (ii) (%)
1,000 81 83
1,500 79 79
2,000 72 78
2,500 79 82
3,000 79 80
Table 5.4 shows that the set of annotations obtained from human annotators
is indeed su cient. For all of the training sample sizes tested, the prediction
accuracy of the classifier model remains at acceptable levels, ranging between
72% and 83%, and usually remaining at 80%. The classifier model hits a low
87
Figure 5.1: Classifying bots by training and testing on all groups with 5-fold
cross-validation.
point at 2,000 samples which shows that 2,000 training annotations in validation
experiment (i) di↵ered the most from the testing annotations.
Next, I perform three types of machine learning experiments (see § 5.4.1, 5.4.2,
and 5.4.3) aimed at detecting how informative and generalisable features, overviewed
in § 5.2, are for this task. For each of the experiments, I report accuracy of clas-
sification (Acc) which shows the proportion of bot and human accounts that the
classifier identifies correctly, and precision (P), recall (R) and F1 measures on the
class of bots which show classifier’s performance in identifying bots specifically.
5.4.1 Classifying bots by training and testing on all groups
with 5-fold cross-validation
In the first experiment, I apply 5-fold cross-validation: I split the data into 5
non-overlapping folds, each containing approximately equal proportion of data
points from each of the popularity groups, as well as having similar distribution
of human and bot accounts. The classifier is then run over the folds, using each
of the 5 folds as a test set once and training the classifier on the other 4 folds for
each of the runs. Figure 5.1 illustrates this experiment. The first row (Total)
of Table 5.5 reports the results obtained with the best-performing feature-sets.
This type of test enables determine the general accuracy of the classifier.
Next is to run ablation tests to detect the most optimal feature-set – the
minimal feature-set that yields the best accuracy. Ablation tests show that among
the total of 22 features that I use in this work 12 features score among the
most informative features across all 5 folds in the cross-validation experiment.
These include user replies, retweets per tweet, tweet frequency, age of account,
followers-to-friends ratio, favourites-to-tweet ratio, URLs count, and S1, S2, S3,
88
S5, S0. Note that human annotators also mentioned similar characteristics as
strong indicators. A group of 6 other features score well for 4 out of 5 folds. These
include user tweets, user retweets, user favourites, likes/favourites per tweet, lists
per user, and S4. Based on these results and in conjunction with Chapter 4, I
conclude that features that represent content propagation (frequently tweeting,
retweeting, posting URLs with tweets) and user engagement (following, receiving
likes, receiving retweets, subscribing to lists) are overall the strongest predictors
of automation.
Interestingly, activity source count and CDN content size considered in this
experiment do not score as frequently among the most discriminative features
on the data that combines all popularity groups. The annotators noted that the
use of the Twitter API or automated activity source was a strong indicator of
automated behaviour on Twitter. This is confirmed by the nature or type of
the activity sources (S1 = browser, S2 = mobile apps, S3 = management, S5
= marketing, and S0 = all other services), all of which are strong indicators of
automation.
5.4.2 Classifying bots by training on all and testing on
specific groups with 5-fold cross-validation
In the second experiment, I train my classifier using the same 5 training folds
containing data from all popularity groups, but report the results and run the
ablation tests on the subsets of the test data that belong to each of the 4 pop-
ularity groups separately. Figure 5.2 describes the design of this experiment. In
essence, the classifier is trained on the features that describe accounts from all
4 groups, but is then applied to the test data from one particular popularity
group.11 This experiment helps discriminate between the results obtained on the
data points originating within di↵erent popularity groups. Table 5.5 reports the
results.
Note that the performance follows similar trends as I report for the human
annotation experiments (see Table 5.1 and Table 5.2): the classifier performs the
best onG10M+ and the worst onG100k, whereas I also noted that human annota-
tors reach highest agreement on G10M+ and lowest on G100k. Interestingly, when
11Note that the data in the training and test sets is non-overlapping as before: i.e. each of
the 5 test folds contains a di↵erent 20% of the data, with the rest being used for training.
89
Figure 5.2: Classifying bots by training on all and testing on specific groups with
5-fold cross-validation.
Table 5.5: Machine learning experiments results.
Group Acc Pbots Rbots F1bots
Total 86.44 85.40 82.20 83.60
G10M+ 100.00 100.00 100.00 100.00
G1M 91.76 90.60 88.00 89.40
G100k 85.70 85.60 85.40 85.60
G1k 88.25 87.80 80.80 84.00
I train the classifier on the data from all popularity groups and measure its per-
formance on specific groups, the classifier’s accuracy on G10M+, G1M and G100k
is above human agreement, and closely approaches human agreement on G1k
(see Table 5.5 and Table 5.3). The most informative features include retweets
per tweet, lists per user, tweet frequency, CDN content size, and S2, S4. Note
that features such as age of account, follower-to-friend ratio, favourites-to-tweet
ratio, and URLs count were informative when data is combined from all popu-
larity groups, but are not discriminative when popularity groups are looked at
separately. On the contrary, features such as lists per user, CDN content size
and S4 = automation services, were not informative for combined data but are
discriminative upon observing popularity groups separately.
5.4.3 Cross-group experiments
Next I test how well the system generalises across the popularity groups with
respect to the features used. For that, for each popularity group I train the
classifier on the data from other 3 popularity groups and apply it to the particular
group (see Figure 5.3). The experimental design is described in Figure 5.3, and
the results are reported in Table 5.6. Precision, i.e. how many selected samples
90
Figure 5.3: Cross-group experiments.
are relevant (usefulness) and Recall, i.e. how many relevant samples are selected
(completeness), are computed as listed12. Similarly, F1 scores13, i.e. harmonic
mean of the Precision and Recall scores, are computed to test the accuracy of the
test (in this case prediction).
Table 5.6: Cross-group experiments results.
Group Acc Pbots Rbots F1bots
G10M+ 90.00 83.00 100.00 91.00
G1M 86.73 83.00 82.00 83.00
G100k 81.65 82.00 80.00 81.00
G1k 84.17 87.00 70.00 77.00
Note that the classifier performance is consistently high for all groups, reach-
ing the highest for G10M+. This e↵ect might also be due to the size of the
training and test sets: the ratio is the highest for G10M+ with 3,486 training and
50 test cases, and the lowest for G100k with 2,089 training and 1,447 test cases.
Nevertheless, note that the performance on all groups is stable, with the accu-
racy being significantly above the majority class baseline as well as BotOrNot
performance (see Table 5.3).
Also note the e↵ect of the training data size on generalisability of the feature-
set itself: the largest training set for G10M+ allows the classifier to achieve an
accuracy of 90.00% using only 7 features (user replies, follower-to-friend ratio,
tweet frequency, favourites-to-tweet ratio, and S4 = automation services, S5 =
marketing, S6 = news content web services), while the smallest training set for
G100k allows the classifier to achieve an accuracy of 81.65% relying on 16 out of
the total of 22 features. The features that are most informative across all the
groups include age of account, user replies, retweets per tweet, tweet frequency,
12Precision and Recall – https://en.wikipedia.org/wiki/Precision_and_recall
13F1 score – https://en.wikipedia.org/wiki/F1_score
91
favourites-to-tweets ratio, and S4 = automation services, S5 = marketing, S6 =
news content web services. It is concluded that this set represents the most gener-
alisable features that are quite independent of the type of account (i.e. popularity
level). Also note that these features are in general consistent with the features
that score well in other experiments, as well as the account properties that human
annotators considered important when making their decisions (see § 5.3).
5.4.4 Hypotheses testing
Finally, I check and report whether the features used in this work comply with my
original hypotheses. For instance, I had expected that bots tweet more aggres-
sively than humans do and, thus, an average tweet frequency should be signifi-
cantly higher for bot accounts than for human ones. In the last set of experiments,
I apply t-test to the features for the humans and bots within each group and re-
port: (i) whether the di↵erence is statistically significant, and (ii) whether it
supports my original hypotheses in terms of the sign of the di↵erence between
the means.
Table 5.7 reports the results: I use + where the values for bot accounts are
higher than those for human accounts, and - when human accounts have higher
values; ⇤⇤ denotes statistical significance at 99% confidence level and ⇤ at 95%
confidence level.
Table 5.7: Feature significance.
Feature 10M 1M 100K 1K All
Age of account +⇤⇤ + - -⇤⇤ -
Favourites-to-tweets ratio -⇤ + - -⇤⇤ -
Lists per user -⇤ +⇤⇤ +⇤ +⇤⇤ -
Followers-to-friends ratio + + - +⇤⇤ +
User favourites + - -⇤⇤ - -⇤⇤
Likes/favourites per tweet -⇤⇤ N/A N/A N/A -⇤⇤
Retweets per tweet -⇤⇤ N/A N/A N/A -⇤⇤
User replies - + + + +⇤⇤
User tweets - +⇤ +⇤⇤ +⇤⇤ +
User retweets - +⇤⇤ +⇤⇤ +⇤⇤ +⇤⇤
Tweet frequency + +⇤⇤ +⇤⇤ +⇤⇤ +⇤⇤
URLs count + + +⇤⇤ +⇤⇤ +⇤⇤
S1 = browser + + - - -
S2 = mobile apps -⇤⇤ -⇤⇤ -⇤⇤ -⇤⇤ -⇤⇤
S3 = OSN management +⇤ +⇤⇤ - - +⇤⇤
S4 = automation +⇤⇤ +⇤⇤ +⇤⇤ +⇤⇤ +⇤⇤
S5 = marketing +⇤ +⇤ +⇤⇤ +⇤⇤ +⇤⇤
S6 = news content +⇤ + + N/A +⇤
S0 = all other +⇤ +⇤⇤ +⇤⇤ +⇤⇤ +⇤⇤
Source count +⇤⇤ +⇤⇤ +⇤⇤ +⇤⇤ +⇤⇤
CDN content size + + +⇤⇤ +⇤⇤ +⇤
92
Note that these results are generally in accordance with the assumptions and
also corroborate annotators’ feedback as well as classification results: e.g. tweet
frequency, S2 = mobile apps, S4 = automation services, S5 = marketing, S0 = all
other services, and source count show the highest statistical significance overall.
To summarise, there are several trends worth noting:
• Age of account is a good predictor at the extreme ends of the popular-
ity groups. At the same time, within the high popularity groups the bot
accounts (e.g. those of news agencies) are significantly older than human
accounts (e.g. those of celebrities). At the lower popularity levels, the dif-
ference is exactly the opposite, with the human accounts being significantly
older than bot accounts.
• Humans in the high popularity G10M+ follow significantly more lists than
bots, while within the other groups bots join significantly more lists.
• Humans in the high popularity G10M+ post more replies, and also tweet
and retweet more than bots. Within the other popularity groups the trends
change to exactly the opposite.
• The number of URLs posted, as well as the CDN content size, are higher for
bots across all popularity groups, but the di↵erence becomes statistically
significant for G100k and G1k.
• S2 = mobile app usage is significantly higher for humans than bots in all
popularity groups.
• Usage of S4 = automation services, S5 = marketing and S0 = all other
services is significantly higher for bots than humans in all popularity groups.
• S3 = OSN management seems to be employed by bots in G10M+ and G1M,
while the opposite is true for G100k and G1k.
• The number of source count is significantly higher for bots in all popularity
groups. This shows that within G10M+ and G1M humans post many URLs
as well.
93
5.5 Takeaways
In this chapter I developed and evaluated a thorough mechanism to reliably clas-
sify automated bots and human users on Twitter using a dataset divided into
four popularity groups. I used a human annotation task to augment and refine
the original ground truth labels (Chapter 4), and verify the annotations using
inter-annotator agreement among human annotators and BotOrNot (a bot de-
tection research tool). Using a Random Forests classifier I perform three di↵erent
machine learning experiments. The classifier yields an accuracy that is on a par
with human agreement for all four popularity groups. I report on how di↵erent
feature splits perform for di↵erent experiments and noted that 6 features show
the highest statistical significance overall.
Human annotation experiment (§ 5.3) shows that people pay attention to
the content of the tweets: e.g. human annotators cited the style and pattern of
the tweets as strong indicators of bot-operated accounts, and also noted that
abundance of promotional and depersonalised content strongly suggested that
the account was operated by an automated bot. In this chapter, URLs count was
used as one of the features to analyse the tweet content, with the higher number
of URLs suggesting promotional and depersonalised content. To supplement this,
it is possible to explore if bots fall into particular topical divisions and exhibit
sentiments that are similar to humans (as also suggested in Chapter 4). In Chap-
ter 6 I address the above and explore bot categories by defining a methodology
that employs unsupervised learning to define unlabelled bot clusters. Next I label
these clusters using distinctive features in order to be able to make sense of the
analyses that follows. I then focus on content analysis using topic modelling and
sentiment analysis to distinguish between various bot categories.
94
Chapter 6
Typification of Social bots
In Chapter 5 I explored bot detection that employed supervised learning (clas-
sification) to discern bots from humans. However, social bots are not unitary.
Instead, bots exist in various shapes and forms, and could range from semi-
automated to fully automated entities. This chapter utilises work done in Chap-
ters 4–5 to extend Stweeler (Chapter 3) for a deeper understanding into the bot
phenomenon. In order to explore bot categories I extend Stweeler to design a
set of unsupervised machine learning methods. I evaluate models based on their
purpose and output to pick and implement the most suitable method for defining
unlabelled bot clusters. Next, I label these clusters using distinctive features in
order to be able to make sense of the analysis that follows. My focus then shifts
towards content analysis using topic modelling and sentiment analysis to distin-
guish between various bot categories. However, Twitter by default does not o↵er
geolocation information (for privacy purposes) or IP addresses (because of being
an application layer service). Network level information is necessary to detect
bots that exist on the Web but can impact content popularity and activity on
Twitter. I setup and use a bot account on Twitter to collect this supplementary
dataset to conduct aforementioned analyses. I conclude with compelling evidence
that bots exist in diverse forms and shapes, have diverse existence (on Twitter or
o↵ it) while maintaining many similarities but also a large array of di↵erences.
95
6.1 Introduction
Most recent works have tended to focus on identifying bots and studying their
role in particular settings, e.g. political infiltration. The limited scope of the
latter is largely driven by the di culty of understanding bot behaviour without
a priori context to explain their actions. This is particularly challenging at scale
simply due to the huge diversity of bots: without knowing approximate intentions
(e.g. supporting a political candidate, promoting a commercial product) it is near-
impossible to explain their actions.
The lack of generalisable tools for categorising “types” of bots has led to a
range of ad hoc techniques applied in the above studies. Although sometimes
e↵ective, this approach has severe implications on reproducibility and, perhaps
more importantly, makes the analysis of new datasets extremely di cult (due
to the need to develop new methodologies). Hence, I posit that a generalisable
and modular methodology is required to allow any researcher to easily (i) Identify
bots within a social media dataset, and (ii) Classify them into “types” of bots for
further analysis. I aim to deliver this goal while enforcing two constraints: (i) us-
ing an unsupervised learning approach that is flexible and applicable to various
datasets, and (ii) simplifying and automating the learning process by removing
prerequisites such as a human or manual annotation task to label datasets. Un-
supervised learning further helps alleviate the issues of subjectivity, misaligned
decision boundary, and pre-annotated classifications; problems common in super-
vised settings.
Contributions of this chapter: With the above goals in mind, I extend
Stweeler (Chapter 3) – a data collection, measurement, feature extraction, bot
detection and analysis framework. To explore bot categories I begin by per-
forming a large-scale measurement and analysis campaign on Twitter (§ 6.2) via
Stweeler . Using the Stweeler bot classifier developed in Chapter 5 bots are
detected through classification from the datasets. I then decompose the bots into
a set of clusters exhibiting similar traits – I term this process “bot typification”.
To achieve this, I develop an unsupervised clustering task to create unlabelled
clusters from features (§ 6.3). These clusters are derived from the quantified be-
havioural and social properties of the accounts, grouping users based on traits
such as retweeting rates, number of followers, etc (see Table 6.2). Through a
series of topical analyses, I then strive to generate labels for these groups based
96
on the principle components of discussion within each cluster.
Once the clusters have been defined, I then explore their properties — starting
by exploring the innate characteristics of the eight clusters identified (§ 6.3.3).
A range of behaviours are observed, with three highly populated clusters made
up of bot accounts that follow well known promotional strategies. These include
favouriting a large number of tweets (for self promotion), whilst receiving little
attention in return (e.g. receiving few likes). However, I also discover five outlier
clusters, with one containing a maximum of 35 accounts. These tend to contain
older bots and more popular bot accounts, sometimes even with celebrity status.
For example, one cluster (#5) contains bots with an average of 405 likes per tweet
compared to just 20 in another cluster (#0). Although intuitive, this empirically
confirms that bots are not one shade but, instead, highly diverse with various
patterns both in terms of their own behaviour and the reactions of others.
Following this characterisation, I then perform an in-depth analysis into sev-
eral core aspects of bot activity to understand how it varies across the clus-
ter identified (§ 6.4). I start by evaluating the types of software tools used by
bots, as identified via the endpoint metadata contained within this Tweet dataset
(§ 6.4.1). This reveals a complex picture, where each cluster typically utilises a
range of tools. That said, a few major players are identified – software specifi-
cally dedicated to tweet generation and management. Curiously, I also observe
that less popular accounts tend to use a mix of toolkits and human intervention
(e.g. web client). This is also mirrored across some more popular clusters, often
driven by a few constituent celebrity accounts (e.g. alexburnsNYT).
Next, Latent Dirichlet Allocation is used to identify topics of discussion within
each cluster (§ 6.4.2). As the unsupervised learning technique solely uses quanti-
fied metadata for the clustering process, they are formed independent of the tweet
content itself. Hence, I discover that the clusters focus on a range of overlapping
topics. Through this I label each cluster with a range of tags, particularly Ad-
vertisements & Marketing, Daily A↵airs & Lifestyle, International A↵airs, News,
Politics. I further investigate the content of the tweets by inspecting the sentiment
and polarity (positive or negative) of language used within each tweet (§ 6.4.3).
Although all clusters broadly exhibit positive sentiment (i.e. > 0) and similar
variance (0.0255–0.0572), I find a far greater spread of polarity. For example, it
is found that one cluster (#5) has very low average polarity (0.0454), i.e. neutral
content. This is because the cluster predominantly contains mainstream news
97
and sports outlets, which post both highly positive and negative content. Fi-
nally, I inspect the content links that accounts include in their tweet (i.e. URLs).
Although, I find many examples of mainstream websites (e.g. youtube.com is the
most popular across most clusters), I also observe various other URLs. These are
largely dominated by a few accounts that contribute a disproportionately large
number of URLs within each cluster. For example, one cluster (#2) contains
links to elevatedfaith.com 926 times, just from a single account. The method
(this chapter), code/tool1, and processed datasets2 are available to the research
community for further investigation and future research.
6.2 Preliminaries
In order to define and explore bot categories I build upon Stweeler (Chapter 3)
and use it for data collection, pre-processing, feature extraction and classification
tasks. In this section Stweeler is extended to have bot typification capabilities
via clustering and topic modelling (§ 6.3).
6.2.1 Data Collection and Pre-Processing
In order to explore characteristics of various bot types, it is necessary to identify
bots from human profiles. Detecting bots is important because the presence
of human profiles could skew the results due to similarities. The purpose of
clustering is to divide a dataset into equal or unequal chunks on the basis of
decided and measured criteria. Bot and human accounts from di↵erent subsets of
data (e.g. similarities betweenG10M+ humans and bots in Chapter 4) might only
exhibit minute di↵erences that could alter the boundaries of clusters, thus forming
misrepresenting clusters. Moreover, di↵erences between bots and humans could
also cause formation of unnecessary and irrelevant categories containing little or
no bots.
Therefore, I use the Stweeler bot classifier3 designed in Chapter 5 to distin-
guish bots from humans for the dataset described in § 3.4.4. I collect a dataset
1Stweeler– https://github.com/zafargilani/stcs
2Datasets – http://www.cl.cam.ac.uk/~szuhg2/data.html
3Stweeler bot classifier – https://github.com/zafargilani/stcs/blob/master/lib/
classifiers/rfclassifier.py
98
for 30 days in December 2016. Reasons why a new dataset is collected (as op-
posed to Chapters 4–5) as well as the details on this dataset, language detection
and translation, can be found in § 3.4.4. I verify my findings from Chapter 4 in
§ 6.3.3.
6.3 Typifying Bots: AMethodological Approach
The previous section has described a dataset of tweets, annotated with the bot
vs. human labels for each account. Next, I further breakdown these accounts
into finer-grain classifications that augment the bot label with the type of bot.
Note that it is not necessary to use Stweeler for identifying bots; my typification
methodology works with any other tools that can extract bot accounts.
6.3.1 Typification Methodology
First, it is necessary to extract “groups” of bot accounts that exhibit similar
behavioural traits. This poses two challenges: (i) identifying features that typ-
ify similar types of bots; and (ii) clustering such bots together. The former is
particularly di cult to do, as it necessitates a formal definition of bot “types”.
Although feasible, this comes with a few problems. Firstly, to do this manually,
i.e. via human annotations, restricts the process to a limited dataset and limited
‘freshness’. Secondly, it is likely to su↵er from high degrees of subjectivity. In
order to remove such subjectivity, I employ an unsupervised learning approach,
which can then be analysed post priori. The other advantage of an unsupervised
task is diminished reliance on training datasets, which would be required during
a supervised classification task. Furthermore, this approach is modular, thus a
learning model can be replaced with another.
This chapter tests three di↵erent clustering approaches for the dataset. A
set of features (Table 6.1) for all processed bot accounts is given as input to
each of the following clustering algorithms. The feature values are normalised
and projected to the clustering method which then predicts the data point per
cluster, depending on the algorithm criteria. I initially experimented with the k-
means clustering approach but found it to be limited given that each data-point
is assigned to a cluster whose mean has the least squared Euclidean distance.
Therefore, k-means does not capture the di↵erences that might occur between
99
Table 6.1: Features
Feature Description
Age of account The age of the Twitter account in days.
Favourites-to-tweets ratio ‘Favourites’ or ‘likes’ received for all user tweets.
Lists per user Lists subscribed to.
Followers-to-friends ratio Relationship reciprocity.
User favourites Tweets ‘favourited’ by a user.
Likes/favourites per tweet ‘Favourites’ received by a user.
Retweets per tweet ‘Retweets’ received by a user.
User replies Tweets replied to by a user.
User tweets User-generated tweets.
User retweets Retweeting tweets of other users.
Tweet frequency Daily tweet frequency of a user.
Activity source type A ‘source’ is the endpoint from where a user performs activity on
Twitter, as identified in Chapter 4. This categorisation is refined
as: browser or web client (S1), mobile device apps (S2), social media
management apps (S3), social media scheduling and automation (S4),
social media optimisation and intelligent tweeting (S5), marketing and
brand promotion (S6), and news content web services (S7).
Source count The number of the endpoints used.
URLs count URLs are used to redirect tra c to elsewhere from Twitter platform.
URL & schemes URL hosts and URI schemes, extracted from the [text] tweet at-
tribute.
photos (JPG/JPEG) A photos is extracted from the URL in [media url https] attribute.
animated images (GIF) Though these are animated photos, Twitter saves the first image in
the sequence as a photo, and the animated sequence as a video under
the [video info] attribute.
videos (MP4) Video files accompany a photo which is extracted by Twitter from
one of the frames of the video. A video is pointed to by the URL in
[video info][url] attribute.
data-points in a multimodal (multivariate) setting. This approach was therefore
not suitable.
Next, I experimented with the Gaussian Mixture Model, which is applied to
multimodal (multivariate) datasets. Gaussian Mixtures instead use Mahalanobis
distance, which is a quadratic distance as opposed to a straight line in Euclidean
distance. There are, however, two issues when using this model. Firstly, the
model cannot learn the number of clusters from the dataset; instead, these have
to be provided arbitrarily as an input to the model (which is di cult to know
a priori). Secondly, the model assumes that the dataset consists of normally
distributed dense matrices – this requirement was not met within out data. It
was concluded that this approach was also not suitable for this dataset.
6.3.2 Spectral Clustering
Considering the failures with k-means and Gaussian Mixtures, I next experi-
mented with the Spectral clustering approach (with k-means assignments). Spec-
tral clustering has been widely used in the past for segmenting data points from
100
a noisy background and image segmentation to identify objects. The algorithm
processes normally distributed sparse matrices to group bot accounts into n clus-
ters, where n is learned automatically from the data. This makes it more suitable
for this particular purpose. Spectral clustering uses a spectrum, or eigenvalues4,
of the a nity matrix to project the data into a low dimension space. This low
dimension is the eigenvector (spectral) domain where the data points are easily
separable through an assignment method, e.g. k-means.
Spectral clustering solves the problem on the a nity graph by cutting the
graph into n clusters such that the weight of the edges connecting the clusters
(inter-connection) is small compared to the weight of the edges connecting ob-
jects inside each cluster (intra-connection). The a nity graph G measures the
similarity between data points (or computes the distance) with indices i and j
such that Gij   0. Cutting the a nity graph is adapted from the normalised
cuts problem [73]. This in turn means that since an edge connecting two similar
objects on the graph is a function of the gradient (i.e. distance), similar objects
will be kept together.
Thus, having a distance matrix as a nity matrix for which 0 means identical
objects, and high values mean dissimilar objects, the problem can be stated as a
weighted k-means kernel problem (Equation 6.1).
max
kX
r=1
!r
X
xi,xj2Cr
k(xi, xj) (6.1)
The weight !r is the reciprocal of the number of elements in the cluster, and
Cr represents normalised coe cients for each data point for each cluster. The
problem can then be vectorised (Equation 6.2) as weighted kernel k-means with
n points and k clusters.
maxG trace(G
TG) (6.2)
The k-means assignments match finer details of the dataset, though could be
unstable and hard to reproduce. Despite this disadvantage, the k-means pro-
duces finer clusters that match the reality, than the discretise assignments that
is reproducible and creates clusters of even shapes.
4An eigenvalue is a non-zero value that only scales by the scalar value and does not change
direction when a transformation T is applied to it.
101
Table 6.2: Clusters produced by Spectral clustering, their comparative tendency
vs. other clusters for distinctive behavioural properties (bold and italic signify
di↵erent tendencies), and descriptive labels.
Cluster Total bots Tendency Distinctive feature (mean value) Descriptive label
0 3,017 higher favourites performed (14,910) Young producers
higher daily favouriting frequency (26)
lower age (1,105)
less likes per tweet received (20)
less source types used (3)
less URLs posted (37)
1 1,151 higher favourites performed (11,458) Young assistants
higher daily favouriting frequency (20)
lower age (1,334)
less source types used (4)
2 809 higher favourites performed (14,600) Assistants
3 20 more retweets per tweet received (320) Popular content
producers
4 23 less retweets posted (8) Popular content
higher lists-age ratio (23,043) redirectors
more URLs posted (300)
5 25 higher age (2,357) Stellar active
more tweets posted (1,711) engagers
more replies and mentions posted (404)
more likes per tweet received (405)
higher follower-friend ratio (44,757)
more source types used (19)
more URLs posted (1,151)
6 35 more retweets posted (60) Stellar passive
more likes per tweet received (661) engagers
more retweets per tweet received (526)
higher follower-friend ratio (33,120)
more source types used (11)
more URLs posted (351)
7 8 more source types used (12) Social chameleons
I used Spectral clustering implementation from the scikit-learn [67] ma-
chine learning library to identify the unlabelled bot clusters. I also identified nine
principal components from a list of 24 features (see Table 6.1) that cluster similar
accounts together. These include account age, favourites performed, retweets per
tweet ratio, follower-friend ratio, number of activity source types used, activity
source type, URLs posted as part of tweets, likes received, and retweets received.
Note that activity source type is a collection of 7 sub-features (more on that in
§ 6.4.1). More about feature extraction and exploration can be found in Chap-
ter 3. Findings in Chapters 4–5 helped in refining this list of features (Table 6.2)
to achieve an accurate clustered dataset.
However, one persistent shortcoming of Twitter data is that I cannot obtain
geolocation information, as Twitter (by default) does not geo-annotate tweets,
nor include an IP address which can be used to determine regionality. This
would have provided another dimension of features which could have been used
102
to further refine the clusters, based on account location. However, to experiment
with such information I explore other avenues to collect and curate data, such as
discussed in § 6.5.
6.3.3 Clustering Results
Table 6.2 presents the clustering results. The process produces eight di↵erent
clusters which I initially label from 0 to 7. The table lists the number of bots
that fall into each group, as well as the characteristics that each group exhibit
in regards to features. The characteristics highlighted were identified as the
defining factors that resulted in the account being placed in a separate cluster.
For example, the largest group is Cluster 0, which tends to contain bots that
favourite a large number of tweets, whilst being young, receiving few likes, posting
only a few URLs and using just a small number of sources. With these observed
characteristics, I then manually label each cluster with a relevant name (see
Table 6.2). For instance, in the case of Cluster 0, I term it “Young Producers”
as it contains predominantly young accounts that produce a large amount of
content. I repeat this for all clusters, selecting names that (in my opinion) best
capture their key characteristics. Note that these labels are used for convenience
of reference, and do not impact any of the subsequent analysis.
It can be seen that there is high diversity in the cluster sizes. Whereas the
majority of accounts are classified as Young Producers, Young Assistants or As-
sistants, there exists a tail of other accounts that do not have particularly diver-
gent characteristics, e.g. Cluster 7 (which is termed as “Social Chameleons”), are
bound together exclusively because of number of source types they used. Clus-
ters 3–7 each have 35 or fewer accounts; I find that these clusters tend to contain
more “unusual” accounts, which (by definition) have a relatively small number of
participants. Most notably, these clusters contain accounts that are both more
active and more popular than other clusters. For example, the 25 bots in Cluster
5 post an average of 1,151 URLs compared to just 37 in Cluster 0 (which con-
tains 3,017 bot accounts). Hence, these clusters are of significant interest as they
constitute the outliers within my dataset.
To elucidate this, I proceed to explore the exact characteristics of the accounts
within each cluster. Figure 6.1–6.2 presents a series of cumulative distribution
functions (CDFs) that show the distribution of values across all accounts in each
103
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(a) Tweets posted.
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(b) Retweets posted.
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(c) Favourites performed.
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(d) Replies and mentions
posted.
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(e) Likes per tweet received.
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(f) Retweets per tweet re-
ceived.
Figure 6.1: Empirical distributions for behavioural activities of bot clusters: 0
(Young producers), 1 (Young assistants), 2 (Assistants), 3 (Popular content pro-
ducers), 4 (Popular content redirectors), 5 (Stellar active engagers), 6 (Stellar
passive engagers), 7 (Social chameleons).
cluster. I present all features considered within the clustering process. Note that
Clusters 3–7 have relatively small sample sizes, hence the step-based distributions.
It can be seen that there is a mix of behaviours, with some clustering closely
mirroring each other, whilst the remainder diverge significantly. This, for exam-
ple, can be seen in Figure 6.1a, in which Clusters 0 and 1 generate substantially
fewer tweets than other clusters (medians of 32 and 33, respectively vs 65–432).
This observation occurs across other features, with Clusters 0 and 1 di↵ering,
e.g. they tend to favourite more but post fewer tweets. These are what one might
term common bots – relatively inactive and unpopular accounts. In contrast,
the other clusters exhibit far more unusual characteristics, with high levels of
activity across most features. This is most noticeable in terms of tweets, likes,
retweets per tweet, follower-friend ratios. The remaining features exhibit roughly
equal characteristics across all accounts, with one noticeable di↵erence: favourit-
ing rates. This captures the number of favourites performed by accounts (Fig-
ure 6.1c and 6.2f), which Clusters 0 and 1 tend to excel. The median number of
104
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(a) Lists subscribed to.
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(b) Follower-friend ratio.
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(c) Daily status freq.
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(d) Source count.
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(e) URL count.
100 101 102 103
tweet-count
0
0.2
0.4
0.6
0.8
1
CD
F
0
1
2
3
4
5
6
7
(f) Daily favouriting freq.
Figure 6.2: Empirical distributions for behavioural activities of bot clusters: 0
(Young producers), 1 (Young assistants), 2 (Assistants), 3 (Popular content pro-
ducers), 4 (Popular content redirectors), 5 (Stellar active engagers), 6 (Stellar
passive engagers), 7 (Social chameleons).
favourites per day for Cluster 0 and 1 is 4,307 and 1,670, respectively; this can be
compared against an overall median of 2,634. This highlights one type of promo-
tion strategy for typical5 bots, where favourites are used to advertise themselves.
Again, I present these distributions to capture the exact characteristics of each
cluster, and allow others to contextualise my later analysis. I re-emphasise that
using the labels presented (e.g. “Young Producers”) is a mechanism for discourse,
and they do not influence any of the latter analysis.
Before diving deep into the congruent or typical behaviours of each cluster, I
verify whether Spectral clustering (i) produces representative amount of clusters
from the given bot population, and (ii) forms same amount of categories rather
than new ones. I used two di↵erent datasets to find that the same number of
categories were formed from both datasets. The first dataset comprised of 9,186
bots from April 2016 and formed a total of eight clusters, although the size of
the clusters varied. The second dataset comprised of 5,551 bots from December
2016, that also formed a total of eight clusters. Hence, Spectral clustering proves
5Note that 81.92% of all bots in this dataset fall into these two categories.
105
to be both representative and consistent with the amount of clusters it produces
from datasets of similar features.
6.4 Deep Diving into Bot Behaviours
The previous section has presented a methodology to cluster bots into di↵erent
categories based on various prominent features. Whereas the majority have been
clustered into “typical” accounts (i.e. those with relatively few followers and
low scores across most popularity metrics), I observe a set of outlier clusters
containing more unusual bots that exhibit behavioural traits not dissimilar to
major human celebrities. This section builds upon these basic characteristics to
investigate the deeper behaviour of these bots.
6.4.1 What bot software is used?
I begin by inspecting the bot software used by each account. This is trivial
as tweets are accompanied by “source endpoints” which describe the endpoint
that created the tweet. Whereas, nearly all (more than 339k tweets, 78.09%)
human accounts rely on the o cial Twitter client (either web or mobile), I observe
significant diversity amongst the bot-operated accounts.
To study these, Table 6.3 presents a summary of the di↵erent source types I
observe, and Figures 6.3 shows the distribution of source type across clusters. It
is worth noting that, even though I exclusively include bot accounts, almost 320k
tweets (53.83%) from tools involve human usage and intervention (S1 and S2),
whereas almost 274k tweets (46.17%) are tweeted using automated tools (S3–S7).
This confirms that many bots are not exclusively automated and, instead, consist
of significant human intervention.
In fact, this is further enforced by the human population in the dataset (re-
call that I detected 11,379 humans as part of Stweeler bot detection cam-
paign). From the accounts that are detected as humans, approximately 343k
tweets (78.90%) of all tweets are generated by tools involving human usage and
intervention vs. almost 92k tweets (21.10%) by automated tools. This goes a long
way in explaining the usual challenges with bot detection – most bots are not
exclusively software-based, and most humans are not exclusively using manually
operated apps, despite distinctive trends. Inspection of these accounts there-
106
Table 6.3: Types of most prevalent Twitter activity sources for bot clusters.
Source type Tool/App Usage description # Tweets
S1: Browser or Web client Twitter Web Client Human intervention. 98,991
S2: Mobile device apps Twitter for iPhone,
Twitter for An-
droid, Mobile Web,
Facebook, Drudge
Human intervention. 220,176
S3 Social media manage-
ment apps
TweetDeck Social media dashboard manage-
ment and primitive scheduling.
60,158
S4: Social media integra-
tion, scheduling and au-
tomation
Bu↵er, Hootsuite,
SocialOomph,
Echobox Social,
Postcron, dlvr.it,
twittbot.net
Social media integration (Twitter,
Facebook, etc) and advanced tweet
scheduling and automation.
115,663
S5: Social media optimisa-
tion and intelligent tweet-
ing
SocialFlow Optimise the delivery of messages
on Twitter using the commercial
Twitter Firehose API and pro-
prietary link proxy (accumulating
click data) for large brands and
publishers.
34,418
S6: Social media market-
ing, brand promotion and
customer experience man-
agement and analytics for
enterprises and businesses
Sprinklr, Spred-
fast, Sprout Social
Social media marketing, advertis-
ing, content management, commu-
nity management, collaboration,
advocacy, monitoring and analyt-
ics tools for large brands and agen-
cies.
24,834
S7: Content web services SnappyTV.com,
IFTTT, Vine
Applets, video editing (e.g. creat-
ing highlights), video sharing.
38,640
fore reveals a mix of types. Most prominently, I notice that many celebrities
(e.g. 0220nicole, hughhewitt, sa↵rontaylor) and organisations (e.g. airandspace,
TEDTalks, Xbox) with Twitter-facing communications rely on both humans and
software to handle significant tweet activity.
As well as revealing human involvement in bot activity, Table 6.3 also presents
a number of sources that are automated: S3–S7 are all software-based. These in-
clude social media integration management and primitive scheduling services (S3)
as well as more advanced tweet scheduling and automation services (S4). In fact,
together S3 and S4 form the second largest endpoints for generating tweet ac-
tivity with almost 176k tweets (29.65%) produced. Beyond these basic tools, I
also observe a range of sophisticated and targeted bot platforms. For example,
I observe pattern mining bots6 that learn optimal ways to obtain visibility (S5),
and marketing, monitoring and analytics bots for large brand and enterprises
(S6). The platform provides advertising and marketing products, and monitor-
ing through dashboard services. It is important to note that they account for
6These are based on collecting data from Twitter’s commercial Firehose API and accumu-
lating click data through spreading URLs and monitoring clicks.
107
0 1 2 3 4 5 6 7
Clusters
0
1
2
3
4
Tw
ee
ts
104
(a) S1 clusters.
0 1 2 3 4 5 6 7
Clusters
0
2
4
6
8
10
Tw
ee
ts
104
(b) S2 clusters.
0 1 2 3 4 5 6 7
Clusters
0
0.5
1
1.5
2
2.5
Tw
ee
ts
104
(c) S3 clusters.
0 1 2 3 4 5 6 7
Clusters
0
2
4
6
8
Tw
ee
ts
104
(d) S4 clusters.
0 1 2 3 4 5 6 7
Clusters
0
0.5
1
1.5
2
Tw
ee
ts
104
(e) S5 clusters.
0 1 2 3 4 5 6 7
Clusters
0
5000
10000
15000
Tw
ee
ts
(f) S6 clusters.
0 1 2 3 4 5 6 7
Clusters
0
0.5
1
1.5
2
Tw
ee
ts
104
(g) S7 clusters.
Figure 6.3: Types of most prevalent Twitter activity sources for bot clusters.
less than 60k tweets (10% of my dataset) but are highly optimised: SocialFlow,
Sprinklr, Spredfast and Sprout Social (S6) are specifically designed to optimise
tweet activity for large brands (Xbox), agencies (CNN, TIME) and even popular
individuals (alexburnsNYT) for maximum visibility and screen time. For exam-
ple, Xbox retweeted a tweet7 (originally posted on Friday evening at 2200 hours)
every few hours on Saturday to get maximum participants.
The final category of activity source endpoints includes content web services
that are purposed to create applets, video editing and sharing on-the-fly by con-
7The original tweet can be found here (last accessed 16 June 2018) – https://twitter.
com/xbox/status/809880789437575168
108
tent creators (S7). These services involve a combination of humans (e.g. to cre-
ate highlights of sports events or bulletin news via SnappyTV.com or Vine) and
rapid content sharing (e.g. through content management and replication such as
IFTTT conditional applets). While only around 39k (6.52%) tweets are produced
by these services in the dataset, it shows the rapid ability of an information social
network to distribute content.
I next inspect how the di↵erent clusters exploit each software platform. Fig-
ure 6.4 presents the fraction of tweets generated by each source endpoint across
the eight clusters. Di↵erences can immediately be seen across the choices made
within each cluster. For example, it shows that 100% of tweets injected by
Drudge8 were from accounts in Cluster 1, such as news reporters tweeting for
AFP, AJENews, AlArabiya EGY, AlArabiya, bbcbrasil, FoxSports br, etc; rep-
resentatives from ELLEfashion; sta↵ from DunkinDonuts, HarvardHealth, etc;
individuals BobVila, jimcramer, etc; and the app itself DRUDGE REPORT. It
is also noticeable that clusters 0 (Young producers), 1 (Young assistants) and 2
(Assistants) use most of the available activity sources, that range from human
usage and intervention (left hand side) to completely automated services (right
hand side). While clusters 3 (Popular content producers) and 4 (Popular content
redirectors) show considerable human usage vs. automation, clusters 5 (Stellar
active engagers), 6 (Stellar passive engagers) and 7 (Social chameleons) show
much higher automation and scheduling vs. negligible human usage. This is
understandable since content popularity is directly proportional to content nov-
elty and popular trends, that in turn engages human interest. Most bots lack
these properties and thus earn much lower popularity levels than human-created
content, as noticed previously in Chapter 4.
6.4.2 What topics do bots discuss?
Spectral clustering (§ 6.3.2) produces groups of accounts that exhibit similar
traits. Table 6.1 lists traits that are similar among accounts within the same
cluster, e.g. aggressive tweeting patterns. However, this provides little insight
into what di↵erent types of bots tweet about. Particularly, I am interested in un-
derstanding the context of each bot in terms of its purpose and topics of interest.
8Drudge (better known as Drudge Report) is a news aggregator service that allows the user
to directly tweet the content being viewed/read.
109
Figure 6.4: Distribution of top 20 activity sources per cluster: percentages are
calculated per source per cluster (i.e. normalised for di↵erent sources in each
cluster).
Next, I attempt to explore the topics discussed within each cluster. I hy-
pothesise that certain clusters may have a proclivity towards certain prominent
topics. I emphasise, however, that the clusters are derived from the traits listed
in Table 6.1, i.e. topical similarity was not taken into consideration. Hence, I
now explore popular topics discussed within and across clusters.
I start by filtering stop-words and frequently occurring words, such as URL
protocol names (to clean the text). I then employ topic-modelling by converting
tweets into the most popular topics per bot account. In order to accomplish
this I use Latent Dirichlet Allocation (LDA). LDA is an unsupervised genera-
tive probabilistic model that discovers latent structure in a set of documents by
considering each document as a collection of latent topics. Tweets are first bro-
ken down into word vectors, and topics are then modelled as a distribution over
word co-occurrences. Exact details regarding LDA can be found in [9]. I use the
110
LDA implementation in scikit-learn [67] to generate topic models for the eight
clusters.
Figure 6.5–6.6 presents the topic word cloud for each cluster. For the purposes
of comparison, Figure 6.7 shows most popular topics and words tweeted by the
11,379 human Twitter users. To give greater context, I perform a manual review
exercise to allocate topic labels to these clusters. Topic labels are only generally
suggestive and indicative, not decisive. Therefore, I manually label these eight
clusters into any combination of Advertisements & Marketing (A), Daily A↵airs
& Lifestyle (D), International A↵airs (I), News (N), Politics (P), Online Social
Networks (O), Sports (S), and Television (T).
It can be seen that di↵erent clusters have a di↵erent “skew” towards cer-
tain topics. For instance, whereas accounts in Clusters 3–7 (dominos, HPbas-
ketball, RedeGlobo, BBCWorld, MoneyA↵airs, BreakingNews, CollingwoodFC,
ESPNFC, WDRBNews) have certain very dominant topics of discussion, e.g. Bas-
ketball, The Economist, Football, etc, accounts in Clusters 0–2 (AJArabic, bbc-
worldfeed, CNNEE, CNNsWorld, NFL, pitchpivot, photo cj, reddit top, swis-
sifg, talkvn, teachersdesign, tra cjamnet, whats live, youkoudan, yalgaarmateen)
have a far more egalitarian distribution of topics. This is predominantly driven
by the size of these clusters. Whereas Cluster 0 has over 3K accounts, Cluster 7
has just 8 accounts. Despite this, there are clear topics shared across each group,
particularly related to politics, e.g. US politics. This suggests that each cluster is
not dedicated to individual topics but, rather, their behaviour traits are shared
across accounts tweeting on a number of issues.
To explore the similarity between the topics, I also compute the topical a nity
scores for each cluster against every other cluster. A nity scores are computed
by calculating close matches between pairs of clusters (e.g. 0 and 1, 0 and 2, and
so on) using Python’s difflib9 library. Tiny di↵erences can be observed between
same pairs in opposing sequences (e.g. 0 and 1, 1 and 0) because the first item of
the pair is taken as a base to compare against the second item. When the order
of comparison is reversed it changes the comparator cluster (base) and therefore
produces the di↵erence in result.
Table 6.4 shows the produced clusters and their a nity scores, where boldface
shows the highest topical a nity between two clusters, as well as topic labels per
9di✏ib – https://docs.python.org/2/library/difflib.html
111
(a) 0 - Young producers -
DNP.
(b) 1 - Young assistants -
ANPST.
(c) 2 - Assistants - ADO.
(d) 3 - Popular content pro-
ducers - DS.
(e) 4 - Popular content redi-
rectors - INP.
(f) 5 - Stellar active en-
gagers - INP.
Figure 6.5: Word Clouds of extracted bot clusters with their statistical labels
(Table 6.2) and topic labels: Advertisements & Marketing (A), Daily A↵airs &
Lifestyle (D), International A↵airs (I), News (N), Politics (P), Online Social
Networks (O), Sports (S), Television (T).
cluster. This shows that there is heavy overlap between the topics discussed in
di↵erent clusters. For the purposes of comparison I also show the a nity scores
between the entire human population (11,379 accounts in total) and the eight
bot clusters. The bot clusters are strikingly similar to the human population in
terms of the popular topics in tweets. The reason of this is that most of the bots
are reproducing content which has been posted by humans (either on Twitter or
from elsewhere e.g. via external URLs). Additionally, this suggests that although
there are two very distinct entity populations on Twitter, the topics are highly
common among the entities. This strongly indicates that bots are trying to appeal
to humans because human action (in the form of a like, retweet, follow, external
redirection, influence, bias, manipulation, support, publicity, etc) is the end goal
112
(a) 6 - Stellar passive en-
gagers - ADIT.
(b) 7 - Social chameleons -
INPS.
Figure 6.6: Word Clouds of extracted bot clusters with their statistical labels
(Table 6.2) and topic labels: Advertisements & Marketing (A), Daily A↵airs &
Lifestyle (D), International A↵airs (I), News (N), Politics (P), Online Social
Networks (O), Sports (S), Television (T).
for most of these entities as noted in Chapter 4.
6.4.3 Do bots exhibit sentiment?
The above has shown that, although clusters tend to have certain dominant top-
ics, there is not a statistically significant trend that exclusively limits bots within
a cluster to a given set of topics. Next, I expand the content analysis to inves-
tigate the sentiments contained within bot tweets. I use the textblob API for
calculating polarity and subjectivity from all of the text corpora tweeted by the
bots in each cluster. Polarity ranges from -1 (negative sentiment) to 1 (positive
sentiment), and subjectivity ranges from 0 (very objective) to 1 (very subjective).
Table 6.5 shows polarity and subjectivity for the eight bot clusters. I provide both
cluster and topic labels.
I observe that Subjectivity scores are fairly even across all clusters (0.4568–
0.5386), indicating that all clusters are quite subjective in their generated content,
despite the fact that larger clusters have a higher overall variance (e.g. Cluster 0
with variance of 0.028 vs. Cluster 7 with variance of 0.008, thus a 3.5⇥ di↵erence).
Interestingly, Clusters 6 and 7 seem to have mid-range subjectivity, i.e. neither
completely objective nor subjective. This is owed to two reasons: (i) Cluster 6
has only two accounts on either end of the subjectivity spectrum (the very objec-
tive primiciasyacom – an Argentinian TV shows portal, and the very subjective
113
Figure 6.7: Word Cloud of 11,379 human accounts.
Table 6.4: Inter-cluster a nity scores and review labels vs. humans. Cluster
labels could be any combination of categories: Advertisements & Marketing (A),
Daily A↵airs & Lifestyle (D), International A↵airs (I), News (N), Politics (P),
Online Social Networks (O), Sports (S), Television (T).
Cluster 0 1 2 3 4 5 6 7
0 - Young producers 1 .814 .782 .706 .728 .796 .682 .784
1 - Young assistants .880 1 .762 .696 .762 .854 .700 .846
2 - Assistants .838 .804 1 .674 .736 .744 .656 .782
3 - Popular content producers .770 .762 .724 1 .712 .712 .656 .746
4 - Popular content redirectors .800 .788 .694 .696 1 .796 .662 .790
5 - Stellar active engagers .860 .840 .742 .690 .768 1 .686 .840
6 - Stellar passive engagers .810 .744 .710 .608 .748 .784 1 .758
7 - Social chameleons .846 .772 .742 .668 .710 .818 .718 1
Cluster 0 1 2 3 4 5 6 7
Labels DNP ANPST ADO DS INP INP ADIT INPS
Humans (pop. 11,379) .788 .746 .746 .712 .684 .752 .662 .752
all clusters vs. Humans .730
VanguardiaSon – Mexican daily information network), while Cluster 7 has none
on the ends (rather all between 0.3668–0.6333). This is understandable given the
nature of the accounts that mostly relate to Daily A↵airs, International A↵airs,
News, Politics, Sports and Television. However, some particular accounts across
all of the other clusters exhibit variance from very objective, i.e. 0 (e.g. reddotjobs
that is operated by reddotjobs.co.uk – a specialist sales recruiter in the UK, or
ELLEfashion operated by elle.fr from France tweeting about fashion and prod-
ucts) to very subjective, i.e. 1 (e.g. DinheiRonaldo that tweeted about Cristiano
Ronaldo’s net worth roughly 268 times a day from Mar 2015 to Oct 2015, or
TheGifLibrary tweeting funny GIFs).
114
Table 6.5: Average polarity and subjectivity for bot categories and their formu-
lating clusters vs. humans.
Bot cluster Avg Polarity [-1, 1] Avg Subjectivity [0, 1]
0 - Young producers - DNP 0.1554 0.5191
1 - Young assistants - ANPST 0.1352 0.4707
2 - Assistants - ADO 0.2059 0.5386
3 - Popular content producers - DS 0.2105 0.5303
4 - Popular content redirectors - INP 0.1310 0.4652
5 - Stellar active engagers - INP 0.0454 0.4568
6 - Stellar passive engagers - ADIT 0.2777 0.5194
7 - Social chameleons - INPS 0.1125 0.4885
Humans (population of 11,379) 0.1266 0.4531
There is a greater spread of sentiment polarity, although all clusters broadly
exhibit a positive sentiment (i.e. > 0) and similar variance (0.0255–0.0572). Quite
interestingly, Cluster 5 is the most di↵erent overall in terms of polarity, exhibiting
low average polarity (0.0454), i.e. neutral content. This can be attributed to
two reasons: (i) most of the accounts in Cluster 5 are operated by (relatively)
mainstream news channels (CNN, Fox News, TIME, AlArabiya, MetroTV and
NBC’s Louisville a liate wave3news, Q13FOX, franceinter, detikcom), which
means these accounts will post content in vast quantities that is both negative and
positive; and (ii) some of the accounts also belong to sports news (SpheraSports),
brands (Starbucks) and Twitteratis running social campaigns (segalink) that will
try to post content with positive undertones to keep followers engaged. That
said, throughout Clusters 0–7 some particular accounts exhibit variance from very
negative sentiment, i.e. -1 (e.g. CornOppa is a sarcastic account tweeting about
topics that typically contain words, such as ‘empty’ or ‘warning’, that are usually
marked as negative) to very positive sentiment, i.e. 1 (e.g. LakeNormanRE which
is operated by a realty business that tweets listings of attractive properties).
Clinton vs. Trump: To ground these results, I next zoom into two pertinent
accounts – Hillary Clinton vs. Donald Trump – who were being debated in Dec
2016 because of their candidacy in the 2016 US Presidential election. It is now
commonly believed that the 2016 US Presidential election was “hacked” through
collusion10 between Trump’s campaign team and Russian individuals posing as
Americans. In fact, the it has been indicted by the US Department of Justice
that the Russian individuals: (i) organised and promoted pro-Trump political
rallies within the US, (ii) posted political messages on social media accounts that
10Trump-Russia inquiry indictment (last accessed 16 June 2018) – http://www.bbc.co.uk/
news/world-us-canada-43095881
115
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1
polarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CD
F
0
1
2
3
4
5
6
7
h
(a) Polarity [-1, 1].
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
subjectivity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CD
F
0
1
2
3
4
5
6
7
h
(b) Subjectivity [0, 1].
Figure 6.8: Distributions of polarity and subjectivity per bot cluster vs. humans.
impersonated real US citizens, and (iii) promoted information that disparaged
Hillary Clinton – the Democrat candidate.
I use the dataset to find if the three of the indictments actually took place,
i.e. if Donald Trump received more screen time simply because he had received
greater promotion, if Donald Trump had received greater social media coverage,
and if Hillary Clinton had received infrequent and negative coverage as compared
to her Republican rival.
Figure 6.9 presents the distribution of polarity and subjectivity values for all
tweets mentioning Clinton or Trump, either as a word, mention or a hashtag.
Polarity and subjectivity scores are calculated per account across all clusters,
and normalised against total number of tweets posted per account mentioning
each topic. Therefore, an account mentioning Clinton in one tweet and Trump
in ten tweets will be given normalised weightage. Despite similar distributions,
both Clinton and Trump show some di↵erences, such as higher average positive
polarity towards Trump, but lower content subjectivity for Clinton (and therefore
higher objective argumentation).
However, to find out the sheer volume of tra c produced per topic I look
at Table 6.6, which shows polarity scores for Clinton vs. Trump tweets. Quite
surprisingly, Donald Trump (13,631) received almost 14⇥more positively inclined
tweets than Hillary Clinton (1,005). Even more surprisingly, Hillary Clinton
received 796 negative sentiments in tweets than Donald Trump’s 538.
To dive deeper I review most renowned news outlets significantly covering
116
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1
polarity
0
0.5
1
1.5
po
pu
la
tio
n 
fra
ct
io
n
clinton
trump
(a) Polarity [-1, 1].
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
subjectivity
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
po
pu
la
tio
n 
fra
ct
io
n
clinton
trump
(b) Subjectivity [0, 1].
Figure 6.9: Clinton vs. Trump: Normal distributions of polarity and subjectivity.
Table 6.6: Tweet polarity scores for Clinton vs. Trump.
+ve Clinton tweets -ve Clinton tweets +ve Trump tweets -ve Trump tweets
1,005 796 13,631 538
Clinton and Trump during Dec 2016. All of these news outlets are operated
through one or more automated sources, with frequent human intervention. Ta-
ble 6.7 shows the results. At first glance it is obvious that all of the news outlets
were providing 6⇥–42⇥ more coverage to Donald Trump than Hillary Clinton.
More surprisingly, most of the news outlets had comparatively more positive cov-
erage towards Donald Trump than Hillary Clinton. In fact, nytimes, Reuters,
and TIME were the only news outlets that despite giving Trump more coverage
and screen-time, had tweeted more positively towards Clinton. Even more un-
expectedly, none of the news outlets had negative sentiment (negative average
polarity) towards Trump. This here proves that the three indictments are in fact
correct.
6.4.4 What content do bots share?
A major characteristic of bot behaviour is their tendency to share content or
redirect tra c to external Web resources via URLs. Whereas the average number
of URLs shared by human accounts is 17, it is 22 for bots. In the most extreme
case (Cluster 5), the average is 672. This is intuitive as bots are regularly tasked
with promoting websites and/or particular viewpoints.
117
Table 6.7: Polarity scores for Clinton vs. Trump by renowned news outlets.
CNN Fox MSNBC nytimes Reuters Economist TIME WSJ
Clinton tweets 313 277 31 120 12 11 52 10
Clinton polarity 0.0517 -0.0405 0.0907 0.4249 0.2 -0.2857 0.2554 0.0486
Trump tweets 1,792 3,945 331 1,730 502 181 567 328
Trump polarity 0.0773 0.1233 0.0968 0.1133 0.1634 0.1034 0.1114 0.1337
Table 6.8: Shortened URI hosts used for redirection, per bot cluster.
Bot cluster URI host # Tweets
0 - Young producers - DNP t.co 74,583
tinyurl.com 7
1 - Young assistants - ANPST t.co 66,507
on.natgeo.com 5
2 - Assistants - ADO t.co 74,612
tinyurl.com 1
3 - Popular content producers - DS t.co 1,063
4 - Popular content redirectors - INP t.co 4,248
5 - Stellar active engagers - INP t.co 16,804
6 - Stellar passive engagers - ADIT t.co 6,808
7 - Social chameleons - INPS t.co 639
Humans t.co 193,792
yfrog.com 11
I extract all URLs from the bot tweets and find that almost all of the hosts
are actually URL shortening services (e.g. t.co, tinyurl.com), thus hiding the real
URL. Table 6.8 presents the most frequently used URL shorteners for each cluster.
Unsurprisingly, the most frequently used URL shortener is Twitter’s shortening
service t.co. The domain t.co11 allows Twitter to automatically shorten a URL
whenever a tweet is posted, thus helping Twitter to track and monitor URLs (for
spam and malicious content), generate quality signals for insights and conserve
the tweet character limit. Little insight can be garnered from this, and therefore I
resolve all of the shortened URL to track where they redirect to. Table 6.9 shows
the actual URI hosts post-resolution.12
Table 6.9 presents a number of popular domains – some well known, others
less so. Most prominently, I find YouTube regularly occurring across most clus-
ters. This is particularly the case in Clusters 0, 1, 2, which have large populations
with many accounts posting such URLs. I also observe a number of more fringe
11Twitter t.co (last accessed 16 June 2018) – https://help.twitter.com/en/
using-twitter/url-shortener
12Note that shortened URI hosts and redirected URI hosts are not equatable i.e. the sum of
shortened URI hosts will not equal the sum of redirected URI hosts because of a number of
reasons while parsing the redirected links, such as: suspended URLs, URL resolution expired
or deleted, host not found (webpage deleted), etc.
118
URLs being posted, particularly in the smaller clusters. A surprising result is
the sheer impact of just a small number of accounts. The nature of the bots
means that it is trivial to generate significant numbers of URL tweets, allowing a
small number of intense accounts to dominate the cluster. Whereas popular do-
mains (e.g. YouTube, Hu ngton Post) tend to be contributed by many accounts,
popular fringe domains are primarily injected by just a few prominent accounts
– a clear di↵erentiator from (manual) human behaviour. For example, links to
couponchief.com were tweeted 595 times in one month (Dec 2016) by just two
accounts (Twitter has since flagged it as spam). Although one might imagine
more legitimate websites (e.g. news) would di↵er, many other domains are seen
achieving high presence through the contributions of just one or two accounts. For
example, the second most popular domain in Cluster 1 is ahmnews.com with 625
tweets by one account; similarly, in Cluster 4 reuters.com is the most popular
domain with 30 tweets by one account.
I next zoom into the behaviours of each cluster. I remind the reader that the
content of the URLs was not used within the initial cluster process. Noticeably
di↵erent activities are identified with the large (0–2) vs. small (3–7) clusters.
The large clusters tend to contain a large number of accounts, each generating a
relatively small proportion of the URLs. As stated earlier, there is only one com-
monality shared across most clusters: links to YouTube. In larger clusters, this is
driven by a high number of accounts, e.g. in Cluster 0, 844 tweets were generated
by 72 accounts containing links to YouTube. In contrast, smaller clusters tend
to only have a single account that generates a large number of YouTube links.
Inspection of the videos reveals that most are music, news, politics, anime and
promotional videos (fantasy, religion, ads).
The latter observation generalises across nearly all other domains: their popu-
larity within a cluster is dictated by a tiny number of highly active accounts. This
creates an unstable dynamic, where the top domains vary dramatically over time.
This is, in part, due to the small population of some clusters, and the extremely
aggressive levels of activity seen by a small number of accounts. For example, a
single bot (JawalWatani – an Arab news bot with 1.09 million followers) posts
1,337 of 3,105 URLs as part of tweets covering YouTube, Saudi Press Agency,
Ahm News and Saudia Today Arabic daily. Similarly, religion is also quite a
popular theme in some clusters. For example, elevatedfaith.com (tweeted 926
times by LovLikeJesus from Cluster 2) is a website selling bracelets to promote
119
Table 6.9: Top most URI hosts post-resolution, per bot cluster (similar URL
types are colour-coded), and accounts most typically tweeting a URL (e.g. 01 is
Cluster 0 account 1, and 02 is Cluster 0 account 2).
Bot cluster URI host URL type # Tweets Accts
0 - Young producers - DNP youtube.com multimedia 844 01–072
financialsbeat.com finance 444 073, 074
adnil.site recruitment 339 075, 074
ryann1200.com unknown 172 055
twitter.com social media 124 076, 077
huffingtonpost.com news 83 060, 078–091
1 - Young assistants - ANPST youtube.com multimedia 716 11–126
ahmnews.com news 625 14
hwswworld.com automation 570 127
spa.gov.sa press 518 14
fenerbahce.org sports 195 128
2 - Assistants - ADO youtube.com multimedia 1,717 21–237
elevatedfaith.com religion 926 238
google.co.in search 769 239
couponchief.com coupons 595 240, 241
amazon.com e-shopping 258 213, 233, 235,
242–247
3 - Popular content producers - DS youtube.com multimedia 78 31
4 - Popular content redirectors - INP reuters.com news 30 41
investors.com stock market 6 41
hbr.org business mag 2 41
fortune.com business mag 1 41
5 - Stellar active engagers - INP moca-news.net news 293 51
youtube.com multimedia 38 51
animatetimes.com unknown 35 51
washingtonpost.com news 2 52
6 - Stellar passive engagers - ADIT politico.com news 33 61
topstarnews.net celeb news 22 62
sinembargo.mx news 12 63
washingtonpost.com news 10 64
Humans youtube.com multimedia 2,861
90min.com football 453
play.google.com app store 272
prizeo.com charity 269
itunes.apple.com music store 141
facebook.com OSN 85
Christianity.
Dynamics are more significant in large clusters, they are even more pro-
nounced in the smaller fringe clusters (4, 5, 6). This is because only a tiny
fraction of accounts post large amounts of URLs. For example, all domains in
Cluster 4 are injected by a single account (josephjett), which is a Popular Content
Redirector. It tweets all of 39 URLs to Reuters, Investors, HBR and Fortune.
The account is owned by a corporate finance expert and solely uses dlvr.it (a
social media automation and scheduling app) to post tweets mainly on a number
of related themes, including corporate finance, business, and politics.
120
Similar examples can be highlighted across Cluster 5 – Stellar active engagers,
e.g. one of the 25 bot accounts (animeseiyu) tweets 410 of 425 URLs to video
streaming services (YouTube), Japanese entertainment websites (kiramune.jp,
lantis.jp), and Japanese anime news websites (moca-news.net). It is also
worth briefly comparing the various bot clusters against the remaining human
accounts in my dataset. Again YouTube is the dominant domain, but I also see
OSNs (Facebook) and app stores (Google Play and iTunes).
Many of the accounts in Cluster 6 produce URLs as part of tweets to various
political and news websites (politico.com, topstarnews.net, sinembargo.mx,
washingtonpost.com). Cluster 7 does not tweet any URL that I was able to
redirect successfully. This was probably because the URLs had either been sus-
pended, expired or deleted.
Next, I collect and use a supplementary dataset to study the impact of Web
bots on Twitter content and activity.
6.5 The Social Cost of Web Bots
According to an estimate 51.8% of all Web tra c is generated by bots13. In this
section, I quantify the impact of Web bots on content popularity and activity on
Twitter. Web bots could be of many types, such as crawlers, indexers, content
curators and publishers. I show that despite Web bots being smaller in numbers,
they exercise a profound impact on content popularity and activity on Twitter.
To quantify the impact of Web bots, I set up a bot account on Twitter and
conduct analysis on the dataset of click logs (Table 3.5) collected on the Web
server. I then characterise the properties of bots using the click logs dataset,
highlighting key properties in terms of impact on URL popularity, revisiting be-
haviour, and use of IP addresses and Autonomous Systems to launch requests or
clicks.
121
Stweeler 
bot
WS
shortener
(1) Trending topic (2) Shorten URL
(3) Assemble 
tweet
(4) Post tweet
(5) Log clicks
bot analysis toolkit
content analyser
Username,
User ID,
Tweet ID,
Retweet ID
Trending topic,
Keyword,
Hashtag,
Location
Bot or not?
Good bot or bad bot?
Spammer?
Producer or consumer?
Bot impact on Twitter,
Bot influence/followers,
Bot weight,
Impact on network
Entropy Account properties
ML/NLP Ranking
bot analyser
Figure 6.10: How Stweeler bot works.
6.5.1 Setting up a bot account
I extend Stweeler (Chapter 3) to collect click logs dataset (Table 3.5) from
my web server powered by the Twitter bot. The honeypot bot14 (Figure 6.10)
operat s as follows: (i) The bot fetches a popular ‘job’ related tweet fro the
Twitter Streaming API. It then disassembles the text and URL in the tweet.
(ii) The URL is then fetched into the web server (WS). The WS runs a shortener
module that shortens the URL into a reserved domain name. The shortener
is needed to enable redirecting click tra c to the WS in order to collect click
logs. (iii) The bot reassembles the tweet using the text and shortened URL.
(iv) The tweet is then posted to my bot’s Twitter account. In essence, the Twitter
bot and WS performs a simple ‘tweet manipulation’ to avoid retweeting, which
would otherwise prevent the click logs dataset from being obtained. (v) Finally,
whenever a user (Twitter user or from theWeb) clicks on a tweet(s) or URL(s), the
WS records the click. Table 6.10 shows the type of information that is collected.
Note that in order to respect the ethical boundaries of social media research, I
only collect publicly available data about users and hash sensitive information
such as IP addresses.
Table 6.10: Data collected through click logging.
Data attribute Description
Click timestamp Date and time of click, local to my web server.
Tweet ID Tweet ID which received a click.
Hashed IP address Hashed IP address of the machine that clicked the URL in the tweet identified
by Tweet ID.
AS number Obtained using the IP addresses from CAIDA.
User agent string This records the HTTP USER AGENT string of the user clicking the URL in the
tweet identified by Tweet ID.
13Bot tra c report 2016 (last accessed 16 June 2018) – https://www.incapsula.com/blog/
bot-traffic-report-2016.html
14Details of honeypot experiment can also be found in Appendix A.2.
122
6.5.2 Bot detection
For the purposes of this particular study I implemented a simple bot detection
method. I used the two most relevant features from the click logs dataset, i.e.
(i) click frequency, and (ii) User agent strings. I use a di↵erent technique to
Chapter 5 because bots on the Web are di↵erent to bots on Twitter, thus pre-
senting a completely di↵erent dataset (§ 3.4.5) and activity profile. Since these
bots do not exist on the Twitter platform, they do not present the vast array
of attributes available from Twitter data. The information these bots generally
expose is outlined in Table 6.10.
My Twitter bot account receives more than 223,000 clicks from 21-11-2015 to
08-01-2017. Out of these 223,000 clicks more than 44.91% have been produced
by some sort of automated agent or a bot. I use a simple two-step bot detec-
tion method by analysing (i) frequency of clicks, and (ii) User agent strings. I
employ time series analysis that takes into account the frequency of clicks by a
single Twitter user account. As shown in [18] higher tweet frequency is indica-
tive of automated behaviour. I then perform User agent string analysis, which
reveals properties such as a URL containing description of the tool responsible
for performing clicks on my URLs. Moreover, I find that there are a total of
2,563 unique visitors, out of which only 113 are unique bots that have a recurring
presence. These facts are summarised in Table 3.5.
6.5.3 Characterisation
Next I highlight important behavioural properties of bots and humans. These
include click activity, revisiting a previously visited URL, and the use of IP ad-
dresses and Autonomous Systems (AS) to launch requests to the deployed web
server. Note that a tweet might have one or more URLs, however each request
translates to one click on one URL. Since one request is triggered by one click,
therefore they are equivalent in this chapter.
Surprisingly, from my click logs dataset only 4.08% of the visitors to my tweets
or URLs are Web bots but are responsible for almost half of the clicks (44.91%).
In contrast, from my Twitter dataset I found 43.13% accounts were operated by
bots which were responsible for 53.90% statuses. However, bots in my click logs
dataset account for a large chunk of the tra c produced on and contributed to the
Twitter CDN and the Web. This finding points to interesting implications since
123
n7vfn yd645 jfwjg 9h79e rqylu gq8gg iuw5o sfsqy fumfy e1cc8
0
500
1000
1500
#C
lic
ks
URL code
 
 
Bot
Human
(a) Clicks on top most popular URLs.
n7vfn yd645 jfwjg 9h79e rqylu gq8gg iuw5o sfsqy fumfy e1cc8
0
10
20
30
40
50
60
#R
ev
isi
t
URL code
 
 
Bot
Human
(b) Revisits on top most popular URLs.
Figure 6.11: Click logs dataset - Clicks, Revisits.
bots not only access these URLs on the Web, but may also repost or retweet these
tweets on their Twitter page or elsewhere using the website or platform-specific
APIs. This is evident from Figure 6.11.
Population IP Addr. Req. per IP
100
101
102
103
104
Co
un
t
Metrics
Total Req.
0
2
4
6
8
10
12 x 10
4
Co
un
t
Metrics
 
 
Bot
Human
(a) IP addresses and requests.
100
101
102
103
104
105
Co
un
t
 
 
#clicks
Applebot Googlebot PaperLiBot Showyoubot TwitterBot
0
1
2
3
4
5
6
7
8
Co
un
t
 
 
#IP used
#AS used
OpenHoseBotGo HTTP client Rogerbot TweetedTimes Yahoo! Slurp
(b) IPs and ASs used by bots.
Figure 6.12: Click logs dataset - IPs and requests, IPs and ASs used by bots.
Figure 6.11a shows the number of clicks received by top 10 most popular
URLs that my bot posted on its Twitter page. The URL code is the shortened
su x that replaces the original URL. The most popular URL for bots (n7vfn)
advertises a UI/UX job in Sunnyvale CA, and the least popular URL for bots
(gq8gg) advertises a job in Nairobi. The top 10 list would change by at least 3
URLs if bots had not existed, thus clearly showing that bots cause the rise in
URL popularity.
Revisits are more typical for humans than bots, as observed in Figure 6.11b.
This is because these bots usually follow tweet streams which always flow for-
124
wards, thus requiring additional functionality for fetching historic profile. More-
over, some of the bots in my click logs dataset are actually content crawlers that
maintain databases to avoid performing repeated activity.
Figure 6.12a shows the distribution of IP addresses used by bots vs IP ad-
dresses used by humans. 113 bots use 1,667 unique IP addresses to generate a
total of 100,194 requests. On the other hand 2,450 humans use 4,258 unique IP
addresses to generate a total of 115,137 requests. Human activity per IP address
is considerably lower (27 requests per IP) than bots (60 requests per IP).
Lastly, Figure 6.12b shows the distribution of number of unique IP addresses
and Autonomous Systems (AS) used by the top 10 most active bots (rank based
on User agent string analysis), along with their click activity. The top most ac-
tive bots detected from my click logs dataset tend to be Twitter bots that make
use of the Twitter API to perform actions (Twitterbot = 18,828 clicks), web
crawlers and indexers (Googlebot = 15,790, Yahoo! Slurp = 11,022, Applebot =
6,755), and content curators and publishers (PaperLiBot = 249, TweetedTimes
= 437). There is a possibility that Twitter might also inject its own bots for ac-
count profiling, spam detection, monitoring and reporting, by using its BotMaker
software.
Typically, the top most active bots use multiple static IP addresses from
within a single AS, possibly to parallelise tasks. Interestingly, this possibility is
further supported by the fact that all except one AS (25 of 26) are designated
as type ‘Content’ (content hosting and distribution system), while only one is
designated as type ‘Transit/Access’ (connecting networks through itself). Fur-
thermore, in the dataset for the top 10 most active bots, there was one exception
of an unusually aggressive (but benign) bot called Rogerbot, a web crawler for a
marketing firm, that used 6 IPs from 2 ASes to register 3,485 clicks.
6.6 Takeaways
Social bots are not unitary. In this chapter I explored the various shapes and
forms of social bots, that exist as semi-automated and fully automated social
entities. Using the Stweeler bot classifier (Chapter 5) I detect bots from the
datasets. I then decomposed the bots into a set of clusters exhibiting similar
traits. To achieve this, I developed an unsupervised clustering task to create un-
125
labelled clusters from features. I observe a range of behaviours, with three highly
populated clusters made up of bot accounts that follow well-known promotional
strategies. I also found a range of software services, tools and apps specifically
dedicated to generate tweet content and Twitter account management. Curi-
ously, it is observed that less popular accounts utilised a mix of apps and human
intervention (e.g. Web clients). This empirically confirmed that bots are not one
type, but are highly diverse with various patterns both in terms of their own
behaviour and the reactions of others.
Through a series of topical analyses, I then generated labels for these groups
based on the principle components of discussion within each cluster. I found that
the clusters focus on a range of overlapping topics, particularly: Advertisements
& Marketing, Daily A↵airs & Lifestyle, International A↵airs, News, Politics. I
further investigated the content of the tweets through polarity and subjectivity of
language used within each tweet. Although all clusters broadly exhibited positive
sentiment (i.e. > 0) and similar variance (0.0255–0.0572), a greater spread of
polarity was found that ranged from very low (0.0454), i.e. neutral content to
medium high (0.2777), i.e. definitely positive content.
Finally, I inspected the content links that accounts include in their tweet
(i.e.URLs). Although, examples of mainstream websites are found (e.g. youtube.com
is the most popular across most clusters), various other URLs are also observed.
These are largely dominated by a few accounts that contribute a disproportion-
ately large number of URLs within each cluster. For example, one cluster (#2)
contains links to elevatedfaith.com 926 times, just from a single account.
However, bots that exist outside the Twitter ecosystem can too impact con-
tent popularity and activity on Twitter. To study this I extended Stweeler to
implement a honeypot experiment to provide empirical evidence that the impact
on Twitter is not restricted to social bots on Twitter. Rather, bots on and o↵
Twitter form part of the larger automated agents of influence ecosystem, whose
reach and impact spreads across the Web. I showed bots, even from the Web,
play a significant role in boosting URL popularity, demonstrate di↵erences in
URL revisiting behaviour, and exercise increased usage of IP addresses and ASes
to launch requests.
Such a study provides supplementary evidence that bots indeed have many
types, and impact the popularity of content on Twitter while existing beyond its
boundaries. More generally, by carrying out an exhaustive analysis I find that
126
bots exist in diverse quantities: from hyper-active content producers to extremely
popular passive bots, and from social bots on Twitter to Web bots interacting
with Twitter content. If some are found to be tweeting positively about a product
or a political candidate, others are found to be sarcastic and negative. Through
these studies I have e↵ectively shown generalisability and applicability of the
Stweeler platform to a wider array of domain-specific problems. I am also con-
fident that Stweeler could be very useful in producing new research in future.
127
128
Chapter 7
Final Remarks
Social bots contribute a significant amount of activity on Twitter. They consume
and produce content, and interact with human users via Twitter’s many functions
(retweets, replies, mentions, likes, etc). Social bots are function-driven – functions
that are defined by their human masters.
During the course of research encompassed within this dissertation, I have
largely contributed to methods and tools that enable measuring, detecting and
investigating bots in online social networks using tools and techniques from data
science and machine learning. I embarked on the mission by first properly defining
the problem, outlining the background research (Chapter 2) and introducing a
framework (Chapter 3), measuring and characterising bots through exploratory
data science (Chapter 4), detecting bots through supervised machine learning
(Chapter 5), and categorising bots to discern types using unsupervised machine
learning and exploring the Web bots through the use of data curated from the
Web (Chapter 6).
7.1 Summary and Conclusions
During the beginning of this dissertation I set out a path as well as a framework
that would be extended along the journey of this research. I began in Chap-
ter 1 by introducing the scale of the problem and setting specific, measurable,
and attainable goals for this work, as well as outlining major contributions of
this dissertation. In Chapter 2–3, I outlined the background work and formally
introduced the Stweeler framework to the reader. Chapter 3 also introduced all
129
of the datasets used for the purposes of research carried out in Chapters 4–6. In
Chapter 4, I found that bots exercise a tremendous impact on Twitter. The work
gave me a set of principal features that I could use to formulate an understanding
of how bots are di↵erent to humans. I found bots to be generally more active,
but neither as novel as humans nor as appreciated as humans, in terms of content
produced. I also found that humans and bots maintain a certain characteristic
homophily amongst their kind, despite the lack of any real knowledge of another
user being a bot or human. Unsurprisingly, humans formed far more reciprocal
relationships than bots. I also argued that bot tra c can impact many aspects
of network operations, including tra c engineering, routing, cloud computing,
content distribution networks and quality of service.
Chapter 4 paved the way for Chapter 5, in which I used these findings to
develop and evaluate a thorough mechanism to reliably classify bots and humans,
through a supervised machine learning task. I used a dataset divided into four
major popularity groups and found how di↵erent feature splits performed for
di↵erent detection experiments. I found statistically most significant features
that could be utilised for accurately detecting bots. My evaluation revealed that
the Stweeler classifier was twice as much accurate than the current state of the
art bot detection tool.
These bot activities may lead to dramatic changes in social structures and
interactions in the longterm (as the bot population increases). Thus, there is a
wide array of problems to explore in future, such as: exploring credibility scores,
influence botnets, analysing bot content, and developing accurate detection tools.
Credibility of social media accounts and their following could be used as one of
the defining features for detecting dark bots. I therefore envisage that, in the
longterm, the distinction between human and bot research will wane, with greater
integration of their activities (e.g. greater automation of human accounts).
Using the Stweeler classifier developed in Chapter 5 I obtained a pre-classified
bot dataset in Chapter 6 that enabled a deeper understanding of types of bots.
Through unsupervised clustering I was able to divide a singular bot population
into a number of types. Then through topic modelling I was able to do con-
tent analysis to distinguish what di↵erent categories of bots produce as content.
Through an exhaustive analysis I found bots that varied from hyper-active content
producers to social chameleons. I even found individual bot-operated accounts
having quasi-celebrity status.
130
This work opened possibilities for related research in the future. A lot can
be learned from topic analysis of the type of lists an account is following: e.g. if
the main goal of an agent is to expand its reach it can be assumed that the
agent account would try to follow many di↵erent lists without particular topic
coherence. Another line of work could explore the provenance of social botnets,
and ask if least popular Twitter accounts (having minimum activity) are being
used to artificially inflate another account’s popularity.
Finally, in Chapter 6 I used Stweeler for studying bots more generally on
the Web. This was accomplished by deploying a honeypot experiment consisting
of a bespoke bot, a URL shortener and a Web server. It was found that bots can
have a substantial a↵ect on Twitter by impacting the popularity of content that
is displayed on the platform.
7.2 Future Directions
Though I have covered a wide spectrum of bot phenomenon, there is a list of
work outstanding. This dissertation paves the way for more research into this
developing phenomenon, as outlined below.
One of the most pressing issues is obtaining and updating the ground-truth
datasets for supervised classification. Supervised learning, particularly classifica-
tion, requires a training sample that is most often created by human annotators.
This task is tedious as well as requires a boilerplate involving task description,
recruiting annotators, data preprocessing to make it human readable and under-
standable, ensuring high quality through verification of results. All of this comes
at the cost of time and money, and it is impossible to scale or diversify to another
dataset. Despite a few drawbacks human annotators typically perform high qual-
ity annotations because of two reasons: (i) their cognitive ability to relate terms
and not be restricted to the set of those terms but use a term that represents all
of the given terms e.g. the words “chapters, contents, index” immediately bring
the term ‘book’ to our minds, (ii) realise the context beyond the corpus.
Though nearly impossible to accomplish without human or manual participa-
tion, perhaps this could be alleviated by extending Stweeler to automatically
verify and flag post-classified datasets for bot and human labels.
Despite the flexibility of unsupervised learning methods, they are prone to in-
131
accuracy if not applied properly. There is a great opportunity to extend Stweeler
with a combination of semi-supervised (such as [68]) and unsupervised approaches
to continue automated labelling of bot categories. This will enable deeper under-
standing into the latent bot categories that we do not know about.
7.3 Last Thoughts
Automation in social systems is a genuinely new direction. Made possible by
machine learning and language processing, its power is unprecedented and its af-
fects are profound. The impact factors of social automation are hard to measure
due to the interdisciplinary knowledge requirements and issues concerning busi-
ness, ethics, law, sociology and practical computing systems knowledge. In this
dissertation I have taken the first few steps to address the implementation require-
ments that should enable researchers of the future to utilise for understanding
this nascent social phenomenon. Nonetheless, the age of cognisant machines is
here.
132
Bibliography
[1] Norah Abokhodair, Daisy Yoo, and David W. McDonald. Dissecting a social botnet:
Growth, content and influence in twitter. In Proceedings of the 18th ACM Conference
on Computer Supported Cooperative Work &#38; Social Computing, CSCW ’15, pages
839–851, New York, NY, USA, 2015. ACM.
[2] David W Aha, Dennis Kibler, and Marc K Albert. Instance-based learning algorithms.
Machine learning, 6(1):37–66, 1991.
[3] Luca Maria Aiello, Martina Deplano, Rossano Schifanella, and Giancarlo Ru↵o. People
are strange when you’re a stranger: Impact and influence of bots on social networks.
In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media
(ICWSM), 2012.
[4] Aris Anagnostopoulos, Ravi Kumar, and Mohammad Mahdian. Influence and correlation
in social networks. In Proceedings of the 14th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’08, pages 7–15, New York, NY, USA, 2008.
ACM.
[5] Jonell Baltazar, Joey Costoya, and Ryan Flores. The real face of koobface: The largest
web 2.0 botnet explained. Trend Micro Research, 5(9):10, 2009.
[6] Marco T. Bastos and Dan Mercea. The brexit botnet and user-generated hyperpartisan
news. Social Science Computer Review, 0(0):0894439317734157, 0.
[7] Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida. Detecting
spammers on twitter. In Collaboration, electronic messaging, anti-abuse and spam confer-
ence (CEAS), volume 6, page 12, 2010.
[8] Alessandro Bessi and Emilio Ferrara. Social bots distort the 2016 u.s. presidential election
online discussion. First Monday, 21(11), 2016.
[9] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal
of machine Learning research, 3(Jan):993–1022, 2003.
[10] Yazan Boshmaf, Ildar Muslukhov, Konstantin Beznosov, and Matei Ripeanu. The socialbot
network: When bots socialize for fame and money. In Proceedings of the 27th Annual
Computer Security Applications Conference, ACSAC ’11, pages 93–102, New York, NY,
USA, 2011. ACM.
133
[11] Yazan Boshmaf, Ildar Muslukhov, Konstantin Beznosov, and Matei Ripeanu. Design and
analysis of a social botnet. Computer Networks, 57(2):556 – 578, 2013. Botnet Activity:
Analysis, Detection and Shutdown.
[12] D. Boyd, S. Golder, and G. Lotan. Tweet, tweet, retweet: Conversational aspects of
retweeting on twitter. In System Sciences (HICSS), 2010 43rd Hawaii International Con-
ference on, pages 1–10, Jan 2010.
[13] Danah Boyd. The politics of ”real names”. Commun. ACM, 55(8):29–31, August 2012.
[14] Jian Cao, Qiang Li, Yuede Ji, Yukun He, and Dong Guo. Detection of forwarding-based
malicious urls in online social networks. International Journal of Parallel Programming,
44(1):163–180, Feb 2016.
[15] Qiang Cao, Michael Sirivianos, Xiaowei Yang, and Tiago Pregueiro. Aiding the detection
of fake accounts in large scale social online services. In Proceedings of the 9th USENIX
Conference on Networked Systems Design and Implementation, NSDI’12, pages 15–15,
Berkeley, CA, USA, 2012. USENIX Association.
[16] Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and P Krishna Gummadi. Mea-
suring user influence in twitter: The million follower fallacy. Icwsm, 10(10-17):30, 2010.
[17] Kuan-Ta Chen, Hsing-Kuo Kenneth Pao, and Hong-Chung Chang. Game bot identification
based on manifold learning. In Proceedings of the 7th ACM SIGCOMM Workshop on
Network and system Support for Games, NetGames ’08, pages 21–26, New York, NY,
USA, 2008. ACM.
[18] Zi Chu, Steven Gianvecchio, Haining Wang, and Sushil Jajodia. Who is tweeting on
twitter: Human, bot, or cyborg? In Proceedings of the 26th Annual Computer Security
Applications Conference, ACSAC ’10, pages 21–30, New York, NY, USA, 2010. ACM.
[19] Zi Chu, Indra Widjaja, and Haining Wang. Detecting social spam campaigns on twitter.
In Feng Bao, Pierangela Samarati, and Jianying Zhou, editors, Applied Cryptography and
Network Security, pages 455–472, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
[20] Jacob Cohen. A coe cient of agreement for nominal scales. Educational and psychological
measurement, 20(1):37–46, 1960.
[21] Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, and Maurizio
Tesconi. The paradigm-shift of social spambots: Evidence, theories, and tools for the arms
race. CoRR, abs/1701.03017, 2017.
[22] Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini, and Filippo
Menczer. Botornot: A system to evaluate social bots. In Proceedings of the 25th In-
ternational Conference Companion on World Wide Web, WWW ’16 Companion, pages
273–274, Republic and Canton of Geneva, Switzerland, 2016. International World Wide
Web Conferences Steering Committee.
[23] O. V. Deryugina. Chatterbots. Scientific and Technical Information Processing, 37(2):143–
147, Apr 2010.
134
[24] Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage
approach to learning from noisy labels. arXiv preprint arXiv:1802.02679, 2018.
[25] Nicolas Dugue´, Anthony Perez, Maximilien Danisch, Florian Bridoux, Ame´lie Daviau,
Tennessy Kolubako, Simon Munier, and Hugo Durbano. A reliable and evolutive web ap-
plication to detect social capitalists. In Proceedings of the 2015 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining 2015, pages 741–744.
ACM, 2015.
[26] Chad Edwards, Autumn Edwards, Patric R. Spence, and Ashleigh K. Shelton. Is that a
bot running the social media feed? testing the di↵erences in perceptions of communication
quality for a human agent and a bot agent on twitter. Computers in Human Behavior,
33:372 – 376, 2014.
[27] Emilio Ferrara. Disinformation and social bot operations in the run up to the 2017 french
presidential election. First Monday, 22(8), 2017.
[28] Alessandro Finamore, Marco Mellia, Zafar Gilani, Konstantina Papagiannaki, Vijay Er-
ramilli, and Yan Grunenberger. Is there a case for mobile phone content pre-staging?
In Proceedings of the Ninth ACM Conference on Emerging Networking Experiments and
Technologies, CoNEXT ’13, pages 321–326, New York, NY, USA, 2013. ACM.
[29] Asbjørn Følstad and Petter Bae Brandtzæg. Chatbots and the new world of hci. Interac-
tions, 24(4):38–42, June 2017.
[30] Michelle C Forelle, Philip N Howard, Andre´s Monroy-Herna´ndez, and Saiph Savage. Po-
litical bots and the manipulation of public opinion in venezuela. 2015.
[31] Carlos Freitas, Fabricio Benevenuto, Saptarshi Ghosh, and Adriano Veloso. Reverse engi-
neering socialbot infiltration strategies in twitter. In Proceedings of the 2015 IEEE/ACM
International Conference on Advances in Social Networks Analysis and Mining 2015,
ASONAM ’15, pages 25–32, New York, NY, USA, 2015. ACM.
[32] Hongyu Gao, Jun Hu, Christo Wilson, Zhichun Li, Yan Chen, and Ben Y. Zhao. Detecting
and characterizing social spam campaigns. In Proceedings of the 10th ACM SIGCOMM
Conference on Internet Measurement, IMC ’10, pages 35–47, New York, NY, USA, 2010.
ACM.
[33] Saptarshi Ghosh, Bimal Viswanath, Farshad Kooti, Naveen Kumar Sharma, Gautam Ko-
rlam, Fabricio Benevenuto, Niloy Ganguly, and Krishna Phani Gummadi. Understanding
and combating link farming in the twitter social network. In Proceedings of the 21st Inter-
national Conference on World Wide Web, WWW ’12, pages 61–70, New York, NY, USA,
2012. ACM.
[34] Steven Gianvecchio and Haining Wang. Detecting covert timing channels: An entropy-
based approach. In CCS ’07. ACM, 2007.
135
[35] Steven Gianvecchio, Zhenyu Wu, Mengjun Xie, and Haining Wang. Battle of botcraft:
Fighting bots in online games with human observational proofs. In Proceedings of the 16th
ACM Conference on Computer and Communications Security, CCS ’09, pages 256–268,
New York, NY, USA, 2009. ACM.
[36] Zafar Gilani, Jon Crowcroft, Reza Farahbakhsh, and Gareth Tyson. The implications of
twitterbot generated data tra c on networked systems. In Proceedings of the SIGCOMM
Posters and Demos, SIGCOMM Posters and Demos ’17, pages 51–53, New York, NY,
USA, 2017. ACM.
[37] Zafar Gilani, Reza Farahbakhsh, and Jon Crowcroft. Do bots impact twitter activity?
In Proceedings of the 26th International Conference on World Wide Web Companion,
WWW ’17 Companion, pages 781–782, Republic and Canton of Geneva, Switzerland,
2017. International World Wide Web Conferences Steering Committee.
[38] Zafar Gilani, Reza Farahbakhsh, Gareth Tyson, Liang Wang, and Jon Crowcroft. An in-
depth characterisation of bots and humans on twitter. arXiv preprint arXiv:1704.01508,
2017.
[39] Zafar Gilani, Reza Farahbakhsh, Gareth Tyson, Liang Wang, and Jon Crowcroft. Of bots
and humans (on twitter). In Proceedings of the 2017 IEEE/ACM International Conference
on Advances in Social Networks Analysis and Mining 2017, ASONAM ’17, pages 349–354,
New York, NY, USA, 2017. ACM.
[40] Zafar Gilani, Ekaterina Kochmar, and Jon Crowcroft. Classification of twitter accounts into
automated agents and human users. In Proceedings of the 2017 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining 2017, ASONAM ’17,
pages 489–496, New York, NY, USA, 2017. ACM.
[41] Zafar Gilani, Liang Wang, Jon Crowcroft, Mario Almeida, and Reza Farahbakhsh.
Stweeler: A framework for twitter bot analysis. In Proceedings of the 25th International
Conference Companion on World Wide Web, WWW ’16 Companion, pages 37–38, Repub-
lic and Canton of Geneva, Switzerland, 2016. International World Wide Web Conferences
Steering Committee.
[42] Chris Grier, Kurt Thomas, Vern Paxson, and Michael Zhang. @spam: The underground
on 140 characters or less. In Proceedings of the 17th ACM Conference on Computer and
Communications Security, CCS ’10, pages 27–37, New York, NY, USA, 2010. ACM.
[43] James Grimmelmann. The law and ethics of experiments on social media users. J. on
Telecomm. & High Tech. L., 13:219, 2015.
[44] Tin Kam Ho. Random decision forests. In Proceedings of 3rd International Conference on
Document Analysis and Recognition, volume 1, pages 278–282 vol.1, Aug 1995.
[45] Bernie Hogan. Pseudonyms and the rise of the real-name web. 2012.
136
[46] Eduard Hovy and Chin-Yew Lin. Automated text summarization and the summarist
system. In Proceedings of a Workshop on Held at Baltimore, Maryland: October 13-
15, 1998, TIPSTER ’98, pages 197–214, Stroudsburg, PA, USA, 1998. Association for
Computational Linguistics.
[47] Eduard H. Hovy. Automated discourse generation using discourse structure relations.
Artificial Intelligence, 63(1):341 – 385, 1993.
[48] Philip N Howard and Bence Kollanyi. Bots,# strongerin, and# brexit: Computational
propaganda during the uk-eu referendum. Browser Download This Paper, 2016.
[49] Qian Huang, Zhu Liu, Aaron Rosenberg, David Gibbon, and Behzad Shahraray. Au-
tomated generation of news content hierarchy by integrating audio, video, and text in-
formation. In Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE
International Conference on, volume 6, pages 3025–3028. IEEE, 1999.
[50] H. Husna, S. Phithakkitnukoon, and R. Dantu. Tra c shaping of spam botnets. In CCNC
2008. IEEE, 2008.
[51] Yuede Ji, Yukun He, Xinyang Jiang, Jian Cao, and Qiang Li. Combating the evasion
mechanisms of social bots. Comput. Secur., 58(C):230–249, May 2016.
[52] George H John. Robust decision trees: Removing outliers from databases. In KDD, pages
174–179, 1995.
[53] Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache. Learning
visual features from large weakly supervised data. In Bastian Leibe, Jiri Matas, Nicu
Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 67–84, Cham,
2016. Springer International Publishing.
[54] Erhan J. Kartaltepe, Jose Andre Morales, Shouhuai Xu, and Ravi Sandhu. Social network-
based botnet command-and-control: Emerging threats and countermeasures. In Jianying
Zhou and Moti Yung, editors, Applied Cryptography and Network Security, pages 511–528,
Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
[55] Balachander Krishnamurthy, Phillipa Gill, and Martin Arlitt. A few chirps about twitter.
In Proceedings of the First Workshop on Online Social Networks, WOSN ’08, pages 19–24,
New York, NY, USA, 2008. ACM.
[56] J Richard Landis and Gary G Koch. The measurement of observer agreement for categor-
ical data. biometrics, pages 159–174, 1977.
[57] Kyumin Lee, James Caverlee, and Steve Webb. Uncovering social spammers: Social honey-
pots + machine learning. In Proceedings of the 33rd International ACM SIGIR Conference
on Research and Development in Information Retrieval, SIGIR ’10, pages 435–442, New
York, NY, USA, 2010. ACM.
[58] Kyumin Lee, Brian David Eo↵, and James Caverlee. Seven months with the devils: A
long-term study of content polluters on twitter. In ICWSM, 2011.
137
[59] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning
from noisy labels with distillation. In ICCV, pages 1928–1936, 2017.
[60] Andre´ L. B. Miranda, Lu´ıs Paulo F. Garcia, Andre´ C. P. L. F. Carvalho, and Ana C.
Lorena. Use of classification algorithms in noise detection and elimination. In Emilio
Corchado, Xindong Wu, Erkki Oja, A´lvaro Herrero, and Bruno Baruque, editors, Hybrid
Artificial Intelligence Systems, pages 417–424, Berlin, Heidelberg, 2009. Springer Berlin
Heidelberg.
[61] Silvia Mitter, Claudia Wagner, and Markus Strohmaier. Understanding the impact of
socialbot attacks in online social networks. CoRR, abs/1402.6289, 2014.
[62] Fabrice Muhlenbach, Ste´phane Lallich, and Djamel A. Zighed. Identifying and handling
mislabelled instances. Journal of Intelligent Information Systems, 22(1):89–109, Jan 2004.
[63] Max Nanis, Ian Pearce, and Tim Hwang. Pacsocial: Field test report, 2011.
[64] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari.
Learning with noisy labels. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani,
and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26,
pages 1196–1204. Curran Associates, Inc., 2013.
[65] Jose Nazario and Thorsten Holz. As the net churns: Fast-flux botnet observations. In
Malicious and Unwanted Software, 2008. MALWARE 2008. 3rd International Conference
on, pages 24–31. IEEE, 2008.
[66] Christopher Olston and Marc Najork. Web crawling. Found. Trends Inf. Retr., 4(3):175–
246, March 2010.
[67] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
[68] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. Labeled lda:
A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings
of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume
1 - Volume 1, EMNLP ’09, pages 248–256, Stroudsburg, PA, USA, 2009. Association for
Computational Linguistics.
[69] Jacob Ratkiewicz, Michael Conover, Mark Meiss, Bruno Gonc¸alves, Snehal Patil, Alessan-
dro Flammini, and Filippo Menczer. Truthy: Mapping the spread of astroturf in microblog
streams. In Proceedings of the 20th International Conference Companion on World Wide
Web, WWW ’11, pages 249–252, New York, NY, USA, 2011. ACM.
[70] Jacob Ratkiewicz, Michael Conover, Mark R. Meiss, Bruno Gonc¸alves, Snehal Patil,
Alessandro Flammini, and Filippo Menczer. Detecting and tracking the spread of astroturf
memes in microblog streams. CoRR, abs/1011.3768, 2010.
138
[71] Saiph Savage, Andres Monroy-Hernandez, and Tobias Ho¨llerer. Botivist: Calling volun-
teers to action using online bots. In Proceedings of the 19th ACM Conference on Computer-
Supported Cooperative Work & Social Computing, CSCW ’16, pages 813–822, New York,
NY, USA, 2016. ACM.
[72] Lauren Scissors, Moira Burke, and Steven Wengrovitz. What’s in a like?: Attitudes and
behaviors around receiving likes on facebook. In Proceedings of the 19th ACM Conference
on Computer-Supported Cooperative Work & Social Computing, CSCW ’16, pages 1501–
1510, New York, NY, USA, 2016. ACM.
[73] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transac-
tions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
[74] Ashutosh Singh. Social networking for botnet command and control. 2012.
[75] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret
Mitchell, Jian-Yun Nie, Jianfeng Gao, and William B. Dolan. A neural network approach
to context-sensitive generation of conversational responses. In HLT-NAACL, pages 196–
205. Association for Computational Linguistics, May–June 2015.
[76] Gianluca Stringhini, Christopher Kruegel, and Giovanni Vigna. Detecting spammers on
social networks. In Proceedings of the 26th Annual Computer Security Applications Con-
ference, ACSAC ’10, pages 1–9, New York, NY, USA, 2010. ACM.
[77] V. S. Subrahmanian, A. Azaria, S. Durst, V. Kagan, A. Galstyan, K. Lerman, L. Zhu,
E. Ferrara, A. Flammini, and F. Menczer. The darpa twitter bot challenge. Computer,
49(6):38–46, June 2016.
[78] Jiang-wen Sun, Feng-ying Zhao, Chong-jun Wang, and Shi-fu Chen. Identifying and cor-
recting mislabeled training instances. In Future generation communication and networking
(FGCN 2007), volume 1, pages 244–250. IEEE, 2007.
[79] Ruck Thawonmas, Yoshitaka Kashifuji, and Kuan-Ta Chen. Detection of mmorpg bots
based on behavior analysis. In Proceedings of the 2008 International Conference on Ad-
vances in Computer Entertainment Technology, ACE ’08, pages 91–94, New York, NY,
USA, 2008. ACM.
[80] Kurt Thomas, Chris Grier, and Vern Paxson. Adapting social spam infrastructure for
political censorship. In LEET, 2012.
[81] Kurt Thomas, Chris Grier, Dawn Song, and Vern Paxson. Suspended accounts in retro-
spect: An analysis of twitter spam. In Proceedings of the 2011 ACM SIGCOMM Confer-
ence on Internet Measurement Conference, IMC ’11, pages 243–258, New York, NY, USA,
2011. ACM.
[82] Christian Thurau, Christian Bauckhage, and Gerhard Sagerer. Learning human-like move-
ment behavior for computer games. In Proc. Int. Conf. on the Simulation of Adaptive
Behavior, pages 315–323, 2004.
139
[83] Onur Varol, Emilio Ferrara, Clayton A Davis, Filippo Menczer, and Alessandro Flam-
mini. Online human-bot interactions: Detection, estimation, and characterization. arXiv
preprint arXiv:1703.03107, 2017.
[84] A. Vishwanath, Jiazhen Zhu, K. Hinton, R. Ayre, and R. S. Tucker. Estimating the
energy consumption for packet processing, storage and switching in optical-ip routers. In
OFC/NFOEC, 2013, pages 1–3, March 2013.
[85] Bimal Viswanath, Muhammad Ahmad Bashir, Mark Crovella, Saikat Guha, Krishna P
Gummadi, Balachander Krishnamurthy, and Alan Mislove. Towards detecting anomalous
user behavior in online social networks. In Usenix Security, volume 14, 2014.
[86] Claudia Wagner, Silvia Mitter, Christian Ko¨rner, and Markus Strohmaier. When social
bots attack: Modeling susceptibility of users in online social networks. Making Sense of
Microposts (# MSM2012), 2, 2012.
[87] De Wang, Shamkant B Navathe, Ling Liu, Danesh Irani, Acar Tamersoy, and Calton Pu.
Click tra c analysis of short url spam on twitter. In Collaborative Computing: Network-
ing, Applications and Worksharing (Collaboratecom), 2013 9th International Conference
Conference on, pages 250–259. IEEE, 2013.
[88] Jen Weedon, William Nuland, and Alex Stamos. Information operations and facebook.
version, 1:27, 2017.
[89] Joseph Weizenbaum. Eliza&mdash;a computer program for the study of natural language
communication between man and machine. Commun. ACM, 9(1):36–45, January 1966.
[90] Xian Wu, Ziming Feng, Wei Fan, Jing Gao, and Yong Yu. Detecting Marionette Microblog
Users for Improved Information Credibility, pages 483–498. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2013.
[91] Je↵ Yan. Bot, cyborg and automated turing test. In International Workshop on Security
Protocols, pages 190–197. Springer, 2006.
[92] Chao Yang, Robert Harkreader, Jialong Zhang, Seungwon Shin, and Guofei Gu. Analyzing
spammers’ social networks for fun and profit: A case study of cyber criminal ecosystem on
twitter. In Proceedings of the 21st International Conference on World Wide Web, WWW
’12, pages 71–80, New York, NY, USA, 2012. ACM.
[93] Louis Yu, Sitaram Asur, and Bernardo Huberman. Dynamics of trends and attention in
chinese social media. 2013.
[94] Jinxue Zhang, Rui Zhang, Yanchao Zhang, and Guanhua Yan. On the impact of social
botnets for spam distribution and digital-influence manipulation. In Communications and
Network Security (CNS), 2013 IEEE Conference on, pages 46–54. IEEE, 2013.
[95] Ziming Zhao, Gail-Joon Ahn, and Hongxin Hu. Examining social dynamics for countering
botnet attacks. In Global Telecommunications Conference (GLOBECOM 2011), 2011
IEEE, pages 1–5. IEEE, 2011.
140
Appendix A
Tasks, Experiments and Ethics
Approval
A.1 Human Annotation Task for Binary Classi-
fication
The Human (or Manual) Annotation Task adheres to the ethical considerations of
the institutional ethics review board at the University of Cambridge Computer
Laboratory1. This task is only indicative and informative, not disruptive or
decisive.
A.1.1 Task Description
We recruited four undergraduate students for the purposes of annotation, who
classified the accounts over the period of a month. This was done using a tool that
automatically presents Twitter profiles, and allows the recruits to annotate the
profile with a classification (bot or human) and add any extra comments. Each
account was reviewed by all recruits independently, before being aggregated into
a final judgement using a final collective review (via discussion among recruits if
needed).
Human annotators were paid accordingly per task successfully performed. Per
item payment made to 4 annotators was roughly USD 0.11 (PKR 11) per anno-
tation for 3535 annotations. The task was completed in August 2016. A receipt of
1https://www.cl.cam.ac.uk/local/policy/ethics/
141
the payment that confirms the date can be requested via email (szuhg2@cam.ac.uk).
The task is to create a labelled dataset given a list of Twitter accounts (screen
names) and list of sources for these accounts. There will be four lists each for
Twitter accounts and their associated sources:
1. 10M followers
2. 1M followers
3. 100k followers
4. 1k followers
Note: It is recommended that at least 3-4 people perform this task indepen-
dently of each other for fairness, cross inspection (inter-annotator agreement),
and calculating confidence levels (Cohen’s kappa). It is the responsibility of the
human worker to make sure these lists are kept segregated.
The following attributes are provided to consider from Twitter profile for
labelling an account as either human or bot:
1. date when account was created (not entirely sure if bots could be older than
humans)
2. number of tweets, retweets, tweet frequency = number of tweets / age of
account in days (if an accounts posts more than 25 tweets, that account has
higher chances of being automated)
3. do they reply to tweets? (replying to tweets is an indication of human
behaviour)
4. content they post on their Twitter wall (tweeting about certain topics only
is a sign of automation)
5. number of favourited tweets (higher number is associated with human be-
haviour)
6. ratio of followers / friends (higher ratio is associated with human behaviour)
7. account profile description and picture (natural looking description and
personal picture is a sign of human behaviour)
8. number of URLs posted in tweets (more URLs in tweets point towards
automated behaviour)
9. size of content uploaded (more content points towards automated behaviour)
The other important piece of information is to consider sources used by a
Twitter account to post content on Twitter. Sources information to consider:
142
1. number of sources used (higher number is associated with human behaviour)
2. types of sources (humans tend to use Twitter app from their devices such as
smartphones and tablets, third party applications, Web interface; whereas,
automated accounts might be using API, scheduling tools, automating
tools, etc.).)
Note:
1. Known feature apps: echofon.com, snappytv.com, periscope.tv
2. Account sharing & scheduling: tweetdeck
3. Automation and scheduling: bu↵er.com, socialflow.com, hootsuite.com, sprin-
klr.com, spredfast.com, twu↵er.com, sendible.com
4. Smart automation & scheduling: ifttt.com, dlvr.it
The worker might need to perform some research for tools listed in sources
for each account. However, this is easy as he/she mostly only needs to go to the
URL of a source given along with the source name in the source list. Using all
this information a human worker will annotate a Twitter user as either human or
bot along with reasons why did he/she annotate it as such, as done in the format
and example below (Table A.1).
Table A.1: HAT example.
Twitter
screen name
Reason Annotation
(bot, human)
khloekardashian uses iPhone and iPad to post
tweets
human
nytimes 292K tweets since May 2010 =
130 tweets a day and uses an au-
tomating tool socialflow.com
bot
Rules for payment:
• Successful annotation = payment.
• If the worker fails to provide an annotation, payment for that annotation
will be discounted or withheld.
• If the worker provides an annotation but the annotation fails to provide a
well-defined reason in a phrase or a sentence, then the payment for that
annotation will be discounted or withheld.
143
A.1.2 Ethics Approval #379
Ethics approval form as filled below, and subsequently approved.
TITLE: Characterising usage and impact of bots on Twitter
APPLICANTS: Syed Zafar ul Hussan Gilani, Jon Crowcroft
EMAIL: szuhg2@cam.ac.uk, jac22@cam.ac.uk
DATES: 01/07/2016 to 30/09/2016
STUDY TYPE: Other
FUNDING BODY: EU MARIE CURIE METRICS ITN
DESCRIPTION
The WWW has seen massive growth in variety and opportunistic usage of
OSNs. Most of these pursuits are exploited via automated programs, aka bots.
We know for a fact that more than 45% of clicks we get on tweets are from
bots. Stweeler is a framework under development to study usage and impact
of bots on Twitter from social media and systems perspectives. Our aim is to
define and measure metrics to analyse how automated programs impact (1) user
engagement, (2) content dissemination, (3) geographical spread of tweets, and (4)
tra c contributed on the Web due to tweets (or due to Twitter CDN). Our goal
is to model the impact of automation on information propagation in OSNs.
For this purpose we require a labelled dataset. Essentially, a list of Twitter
accounts categorised / annotated / labelled into either humans or bots. Since
the Machine Learning techniques fall short of accurately judging an account as
either human or a bot, we would like humans workers to carry out the task. This
will be done using Amazon Mechanical Turk. We are not studying any human
workers or their responses/behaviour. This is purely an activity to create lists
of human accounts and bot accounts divided into four buckets: (i) approx. 1M
followers, (ii) approx. 100k followers, (iii) approx. 1k followers, and (iv) approx.
500 followers. The labelled dataset will be used to characterise the di↵erences
between human Twitter accounts and bot Twitter accounts, measure the impact
of bot accounts on Twitter, and evaluate Machine Learning approaches to bot
detection against this dataset.
We will provide four lists and their corresponding sources lists, one for each
bracket. The human workers will have to look at the Twitter profile of those users,
compare their attributes such as when was account created, number of tweets,
do they reply to their tweets, what kind of stu↵ they tweet about, number of
144
favourite tweets, number of following (friends), number of followers, account pro-
file description, account profile picture, etc. They will then look at the sources
list to find the number of sources and what sort of sources a Twitter user em-
ploys to post content on Twitter: smartphone, tablet, Web interface, third party
app, API, scheduling tools, etc. Using all this information a human worker will
annotate a Twitter user as either human or bot.
PRECAUTIONS
All collected data from Twitter is public. Collection is done via the Twitter
Streaming API. All annotations will be done using a controlled method and will
reflect the outputs of a method along with what a human worker rates as a more
important attributes e.g. number of tweets vs number of followers. No personal
information will be collected regarding human workers.
A.2 Honeypot Experiment
The Honeypot Experiment adheres to the ethical considerations of the institu-
tional ethics review board at the University of Cambridge Computer Laboratory2.
This task is non-intrusive and non-engaging.
A.2.1 Task Description
A honeypot bot is deployed on a web server that operates a Twitter account.
The bot uses the Twitter Streaming API to tweet job opportunities including
shortened URLs. These URLs are shortened by the shortener service running
on the web server. The shortener is needed to enable redirecting click tra c to
the web server in order to collect click logs. The bot is non-intrusive and non-
engaging. This experiment helps to find bots that exist on the Web, i.e. crawlers,
indexers, spiders and curators.
The bot algorithm follows the steps as outlined: (i) The bot searches for a
popular job-related tweet from the Twitter Streaming API. It then disassembles
the text and URL in the tweet. (ii) The URL is then fetched into the web server.
(iii) The bot reassembles the tweet using the text and shortened URL. (iv) The
tweet is then posted to my bot’s Twitter account. (v) Finally, whenever a user
2https://www.cl.cam.ac.uk/local/policy/ethics/
145
(Twitter user or from the Web) clicks on a tweet(s) or URL(s), the web server
records the click.
A.2.2 Ethics Approval #556
Ethics approval form as filled below, and subsequently approved.
TITLE: The impact of Web bots on Twitter content
APPLICANTS: Syed Zafar ul Hussan Gilani, Jon Crowcroft
EMAIL: szuhg2@cam.ac.uk, jac22@cam.ac.uk
DATES: 21/11/2015 to 08/01/2017
STUDY TYPE: Other
FUNDING BODY: EU MARIE CURIE METRICS ITN
DESCRIPTION
This application is to check if the experiment (detailed below) fulfilled ethical
considerations since the type of study does not study people, recruit outside
participants, collect information on people or even release software.
The experiment deploys a honeypot bot that operates a Twitter account.
The bot only tweets job opportunities including shortened URLs. These short-
ened URLs are shortened by a shortener service running on a deployed web server.
The bot is non-intrusive and non-engaging, i.e. the bot does not engage in com-
munication with other Twitter users. The purpose of this experiment was to find
bots that exist on the Web, i.e. crawlers, indexers and curators, among others.
The web server collects all clicks performed on tweets posted by the bot.
Click data can only be collected for those tweets which contain a shortened URL.
Once a click is performed on the URL, the URL is redirected to the web server
where the click is logged, before the click is redirected to the original source. The
following pieces of information were collected: {timestamp, web browser or app
name, IP address of web browser or app}.
As data is collected outside the Twitter platform, no user data (such as Twit-
ter username, profile info, etc.) was collected. In fact, it was impossible to collect
user data, since the web server can only collect clicks data and no information of
who clicked it. Timestamp is date and time of click, web browser or app name is
the software used to click (e.g. Chrome, Twitter Web App, Googlebot, Applebot,
etc.), and IP address is collected to plot a time series of repeating sources as a
heuristic to identify Web bots (crawlers, indexers, curators).
146
We did not share this data with any external entity and we did not try to
identify the source of clicks.
PRECAUTIONS
Data is not shared with any external entity.
Minimum information is collected, i.e. timestamp, browser or app name, IP
address of browser or app
147
148
Appendix B
Publications
This is a comprehensive list of papers published in conferences in reverse chronological
order during my PhD (September 2014 to August 2017). Bold face shows publications
that are directly relevant to this dissertation.
[36] Zafar Gilani, Jon Crowcroft, Reza Farahbakhsh, and Gareth Tyson.
“The Implications of Twitterbot Generated Data Tra c on Networked Sys-
tems.” In Proceedings of the SIGCOMM Posters and Demos, pp. 51-53.
ACM, 2017.
[40] Zafar Gilani, Ekaterina Kochmar, and Jon Crowcroft. “Classification
of Twitter Accounts into Automated Agents and Human Users.” In Pro-
ceedings of the 2017 IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining 2017, pp. 489-496. ACM, 2017.
[39, 38] Zafar Gilani, Reza Farahbakhsh, Gareth Tyson, Liang Wang, and
Jon Crowcroft. “Of Bots and Humans (on Twitter).” In Proceedings of the
2017 IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining 2017, pp. 349-354. ACM, 2017.
[37] Zafar Gilani, Reza Farahbakhsh, and Jon Crowcroft. “Do Bots impact
Twitter activity?.” In Proceedings of the 26th International Conference on
World Wide Web Companion, pp. 781-782. International World Wide Web
Conferences Steering Committee, 2017.
Andre´s Arcia-Moret, Zafar Gilani, Arjuna Sathiaseelan, and Jon Crowcroft. “Peer
149
provided cell-like networks built out of thin air.” In Consumer Communications & Net-
working Conference (CCNC), 2017 14th IEEE Annual, pp. 369-372. IEEE, 2017.
Sarim Zafar, Usman Sarwar, Zafar Gilani, and Junaid Qadir. “Sentiment analysis
of controversial topics on Pakistan’s Twitter user-base.” In ACM DEV, pp. 35-1. 2016.
[41] Zafar Gilani, Liang Wang, Jon Crowcroft, Mario Almeida, and Reza
Farahbakhsh. “Stweeler: A framework for twitter bot analysis.” In Pro-
ceedings of the 25th International Conference Companion on World Wide
Web, pp. 37-38. International World Wide Web Conferences Steering Com-
mittee, 2016.
Zafar Gilani, Arjuna Sathiaseelan, Jon Crowcroft, and Veljko Pejovic´. “Inferring net-
work infrastructural behaviour during disasters.” In Consumer Communications & Net-
working Conference (CCNC), 2016 13th IEEE Annual, pp. 642-645. IEEE, 2016.
150
Appendix C
Press, News and Print Media
This is a list of my research work covered by press, news and print media. The list only
shows coverage by 1st hop entities (i.e. entities who covered me directly), and does not
include others who picked stu↵ from elsewhere.
Celebrities Tweet Like Bots. In Scientific American 60-second Science podcast
on Saturday, 5 August 2017.
Cambridge Study finds that Celebrity Twitter Accounts act like Bots. In
Digital Trends on Sunday, 6 August 2017.
Celebrity Twitter accounts display ’bot-like’ behaviour. In University of Cam-
bridge O ce of External A↵airs and Communications on Wednesday, 2 August 2017.
Twitter ‘Celebrity’ Accounts Behave Like Bots, Not Humans, Study Finds.
In International Business Times on Wednesday, 2 August 2017.
‘Celebrity’ Twitter accounts act like bots. In The Hindu on Wednesday, 2 August
2017.
151
152
Appendix D
Environment - Platforms,
Systems, Resources, Dashboard
Given that there was a lot of work that studied a variety of related questions, it required
to make available a number of di↵erent environments, platforms and systems. These
are summarised below.
Platforms: Ruby, Ruby Gems (nokogiri, rest-client 1.1, thor, tree, mechanize,
twitter 5.15, tweetstream, json, twitter ebooks, shortener), Ruby on Rails, Embedded
Ruby (ERB), Python, Python modules (numpy, scipy, sklearn, langdetect, textblob).
Systems: I used a desktop/workstation for data collection from the Twitter Stream-
ing API as well as all of the processing involved. Figure D.1 shows the CPU utilisation
during data processing workloads.
Figure D.1: A typical CPU workload graph during data processing.
I also used a VM in Cambridge University Information Services DMZ as a live
Web server to deploy the Twitter bot1 (for a honeypot experiment), a Web server to
capture the alternate clicks dataset and a URL shortener. The Web server presents
a dashboard2 to display analytics around the clicks dataset (Figure D.2). Table D.1
shows the specifications of the two systems.
1The bot was non-invasive and did not engage in direct communication with Twitter users.
2Stweeler dashboard – http://svr-szuhg2-web.cl.cam.ac.uk/graph/graphs
153
Figure D.2: Stweeler dashboard.
Resources: I used the University of Cambridge network to obtain data from the
Twitter Streaming API. Figure D.3 shows a screen capture of the network utilisation
during the typical data collection routine. The code for data collection is available
here3 as part of Stweeler .
Figure D.3: A typical time graph during data collection.
Challenges: As briefly mentioned multiple times during the course of this disser-
tation, I used the Twitter Streaming API for collecting data on a daily basis. This
3Stweeler collector – https://github.com/zafargilani/stcs/blob/master/lib/
collector.rb
154
Table D.1: System specification.
System Specification
Desktop/Workstation Ubuntu 14.04 LTS 64-bit
15.5 GiB
Intel R  CoreTM i5-4690 CPU 3.50GHz 4
Intel R  Haswell Desktop
Web Server Ubuntu 16.04.3 LTS 64-bit
4.0 GiB
Intel R Xeon R E5-2650L v3 @ 1.80GHz x 2
Intel R  Haswell Desktop
constituted of 2.5 to 3 million tweets per day. I did not use any keywords, which let me
collect everything that was available from the API. During the data collection process
I encountered the following challenges: expiring OAuth tokens and keys, API errors,
and local system failures.
I also deployed a Twitter bot as a part of the honeypot experiment, which was op-
erationalised using the web server. During the operational life of the bot I encountered
the following challenges: tweet rate limits, limits on following people, API errors, and
occasionally passing two-factor verification by Twitter.
155