Repository logo
 

A tight scrape: Methodological approaches to cybercrime research data collection in adversarial environments

Accepted version
Peer-reviewed

Loading...
Thumbnail Image

Change log

Abstract

We outline in this article a study of ‘adversarial scraping’ for academic research, which involves the collection of data from websites that implement defences against traditional web scraping tools. Although this is primarily a research methods article, it also constitutes a valuable systematic accounting of the different defensive techniques used by the administrators of illicit online services. Some of these administrators intentionally implement functionality which attempts to prevent web scrapers from gathering data from their site, and some will unintentionally design their sites in ways that make data gathering harder. This is of particular importance for criminological research, where websites such as cryptomarkets and underground forums are publicly available (and hence there is an ethical case for data collection), but the illicit activity involved means that the administrators of these services limit scraping. We classify different anti-crawling techniques taken by websites and outline our developed countermeasures. Based on this, we evaluate which of these methods do and do not succeed at preventing data gathering from a website, as well as those which impact the scraper but do not necessarily prevent the data from being obtained. We find that there are some defences that, if used together, might thwart scraping. There are also a series of defences that are successful at slowing down scrapers, making historical scraping more difficult. On the other hand, we show that many defences are easy to work around and do not impact scraping.

Description

Journal Title

Proceedings 5th IEEE European Symposium on Security and Privacy Workshops Euro S and Pw 2020

Conference Name

2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)

Journal ISSN

Volume Title

00

Publisher

IEEE

Rights and licensing

Except where otherwised noted, this item's license is described as All Rights Reserved
Sponsorship
EPSRC (EP/V026178/1)