Recommender systems and market approaches for industrial data management PhD thesis From Torben Jeß Robinson College Distributed Information and Automation Lab Institute for Manufacturing Department of Engineering University of Cambridge Date: 6. December 2017 This dissertation is submitted for the degree of Doctor of Philosophy 2 3 Recommender systems and market approaches for industrial data management Author: Torben Jeß Abstract Industrial companies are dealing with an increasing data overload problem in all aspects of their business: vast amounts of data are generated in and outside each company. Determining which data is relevant and how to get it to the right users is becoming increasingly difficult. There are a large number of datasets to be considered, and an even higher number of combinations of datasets that each user could be using. Current techniques to address this data overload problem necessitate detailed analysis. These techniques have limited scalability due to their manual effort and their complexity, which makes them unpractical for a large number of datasets. Search, the alternative used by many users, is limited by the user’s knowledge about the available data and does not consider the relevance or costs of providing these datasets. Recommender systems and so-called market approaches have previously been used to solve this type of resource allocation problem, as shown for example in allocation of equipment for production processes in manufacturing or for spare part supplier selection. They can therefore also be seen as a potential application for the problem of data overload. This thesis introduces the so-called RecorDa approach: an architecture using market approaches and recommender systems on their own or by combining them into one system. Its purpose is to identify which data is more relevant for a user’s decision and improve allocation of relevant data to users. Using a combination of case studies and experiments, this thesis develops and tests the approach. It further compares RecorDa to search and other mechanisms. The results indicate that RecorDa can provide significant benefit to users with easier and more flexible access to relevant datasets compared to other techniques, such as search in these databases. It is able to provide a fast increase in precision and recall of relevant datasets while still keeping high novelty and coverage of a large variety of datasets. 4 5 Declaration I hereby declare that this dissertation titled “Recommender systems and market approaches for industrial data management” is the result of my own work and includes nothing, which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. Additionally, this dissertation is not substantially the same as any that I have submitted, or, is being concurrently submitted for a degree or diploma or other qualification at the University of Cambridge or any other University or similar institution except as declared in the Preface and specified in the text. I further state that no substantial part of my dissertation has already been submitted, or, is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other University of similar institution except as declared in the Preface and specified in the text. The length of this dissertation is less than 65,000 words, including appendices, bibliography, footnotes, tables and equations not to contain more than 150 figures. Torben Jeß Cambridge 6. December 2017 6 7 Table of Contents ABSTRACT  3 DECLARATION  5 TABLE OF CONTENTS  7 LIST OF FIGURES  11 LIST OF TABLES  15 1. INTRODUCTION  17 1.1 The data overload problem 17 1.2 Introducing recommender systems and market approaches 18 1.2.1 Recommender systems and data overload 18 1.2.2 Market approaches and the resource allocation problem 18 1.3 Research questions and methodology 19 1.4 Using recommender systems and market approaches for data allocation 20 1.5 Definitions 20 1.6 Evaluation of the RecorDa approach 22 1.7 Thesis novelty, results, and contributions 23 1.8 Applicability of this research 24 1.9 Key assumptions 24 1.10 Thesis outline 25 2. RESEARCH BACKGROUND  26 2.1 Introduction 26 2.2 Industry background 26 2.2.1 Increasing number of data and data users 27 2.2.2 Increasing task and organisational complexity 28 2.2.3 Data overload 29 8 2.3 Data management 31 2.3.1 Value of Information techniques 32 2.3.2 Search 34 2.3.3 Data analytics and business intelligence 35 2.3.4 Data development 36 2.3.5 Data architecture management 36 2.3.6 Metadata management 36 2.3.7 User interface design 36 2.3.8 Overview 37 2.4 Market approaches in data management 37 2.4.1 Background 37 2.4.2 Applications of market approaches in industrial companies 40 2.4.3 Applications of market approaches in data management 41 2.5 Recommender systems in data management 43 2.5.1 Background 43 2.5.2 Applications of recommender systems in industrial companies 43 2.5.3 Applications of recommender systems in data management 44 2.6 Summary 45 3. RESEARCH METHODOLOGY  48 3.1 Research questions 48 3.2 Research approach 48 3.2.1 Epistemological approach 49 3.2.2 Selected research approach 49 3.3 Research methodology 51 3.4 Summary 52 4. AN APPROACH TO USING RECOMMENDER SYSTEMS AND MARKETS  53 4.1 Introduction 53 4.2 Selection of high‐level architecture 53 4.2.1 Criteria for selection of high‐level architecture 53 4.2.2 Potential high‐level architectures 54 4.2.3 Comparison 55 4.3 Main functionality 58 4.3.1 Recommender system functionality setup 58 4.3.2 Market approach functionality setup 59 4.3.3 Setting up the Interface between the Recommender system and Market approach 60 4.4 The RecorDa approach 60 9 5. RECOMMENDER SYSTEM COMPONENT  63 5.1 Introduction 63 5.2 Data allocation with recommender systems 63 5.3 Summary 68 6. MARKET APPROACH COMPONENT  69 6.1 Introduction 69 6.2 Overall market architecture 69 6.3 Utility function 71 6.4 The Value Map 74 6.5 The costs of data 75 6.6 The data allocation problem 76 6.7 Market approaches for solving the data allocation problem 77 6.8 Auction mechanisms 80 6.9 Influencing the recommender system 81 6.10 Summary 82 7. EVALUATION  83 7.1 Introduction 83 7.2 Evaluation measures 83 7.2.1 Evaluation of known data 83 7.2.2 Evaluation of unknown data 84 7.2.3 Evaluation of computation time 85 7.3 Methods for comparison 85 7.3.1 Search 86 7.3.2 Requirement analysis 86 7.3.3 Decision‐theory‐based techniques 87 7.4 Experimental evaluation 87 7.4.1 Experimental design method 89 7.4.2 Experimental environments 91 7.4.3 Evaluation variables 93 7.4.4 Experimental results – RecorDa approaches 98 7.4.5 Experimental results – RecorDa vs. Search 109 10 7.5 Case study evaluation 111 7.5.1 Case study plan 112 7.5.2 Case Study A: Manufacturing part procurement 115 7.5.3 Case Study B: healthcare part catalogue for customers and internal users 119 7.6 Setup times 123 7.7 Evaluation summary 124 8. DISCUSSION AND CONCLUSION  127 8.1 Introduction 127 8.2 Summary of research 127 8.3 Key results 128 8.4 Conclusion 129 8.5 Novelty 131 8.6 Contributions 132 8.7 Limitations 133 8.8 Future work 135 REFERENCES  138 ATTACHMENTS  153 Attachment A: Experiment description sheets 153 Attachment B: Case Study B dataset description 200 Attachment C: Empirical evaluation 206 Attachment D: Formulas for evaluation measures 238 Attachment E: Case study process steps 240 11 List of figures Figure 1: Types of relevant data and their relations to each other ..................... 22 Figure 2: Concept of information overload (referred to as data overload in this thesis) from Eppler et al. [11] .............................................................................. 31 Figure 3: Research process ............................................................................... 52 Figure 4: Overview of architectures using market approaches and / or recommender systems ........................................................................................ 56 Figure 5: High-level architecture of data relevance evaluation and data allocation in RecorDa .......................................................................................................... 62 Figure 6: Description of the different process steps of the recommender system ............................................................................................................................ 65 Figure 7: Description of the overall market architecture ..................................... 70 Figure 8: Illustration of the data combination evaluation process for one user .. 73 Figure 9: An example of relevance allocations for all users regarding different combinations of datasets .................................................................................... 74 Figure 10: Description of the experiments along the flow of the RecorDa approach architecture ......................................................................................... 97 Figure 11: Blurred picture of the existing graphical user interface (GUI) for the electronic catalogue .......................................................................................... 120 Figure 12: Experiment 1a: Average novelty over different recommendation iterations of different recommender system functions for all users ................... 206 Figure 13: Experiment 1a: Average coverage over different recommendation iterations of different recommender system functions for all users ................... 207 Figure 14: Experiment 1a: Average precision over different recommendation iterations of different recommender system functions for all users ................... 208 Figure 15: Experiment 1a: Average precision for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 209 Figure 16: Experiment 1a: Average recall for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 210 Figure 17: Experiment 1a: Average computation time over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 211 Figure 18: Experiment 1b: Average novelty over different recommendation iterations of different recommender system functions for all users ................... 212 Figure 19: Experiment 1b: Average coverage over different recommendation iterations of different recommender system functions for all users ................... 212 Figure 20: Experiment 1b: Average precision over different recommendation iterations of different recommender system functions for all users ................... 212 Figure 21: Experiment 1b: Average precision for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 213 12 Figure 22: Experiment 1b: Average recall for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 213 Figure 23: Experiment 1b: Average computation time over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 213 Figure 24: Experiment 1c: Average novelty over different recommendation iterations of different recommender system functions for all users ................... 214 Figure 25: Experiment 1c: Average coverage over different recommendation iterations of different recommender system functions for all users ................... 215 Figure 26: Experiment 1c: Average precision over different recommendation iterations of different recommender system functions for all users ................... 216 Figure 27: Experiment 1c: Average precision for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 217 Figure 28: Experiment 1c: Average recall for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 218 Figure 29: Experiment 1c: Average computation time over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 219 Figure 30: Experiment 2: Average novelty over different recommendation iterations of different recommender system functions for all users ................... 220 Figure 31: Experiment 2: Average coverage over different recommendation iterations of different recommender system functions for all users ................... 220 Figure 32: Experiment 2: Average precision over different recommendation iterations of different recommender system functions for all users ................... 220 Figure 33: Experiment 2: Average precision for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 221 Figure 34: Experiment 2: Average recall for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 221 Figure 35: Experiment 2: Average computation time over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 221 Figure 36: Experiment 3: Average novelty over different recommendation iterations of different recommender system functions for all users ................... 222 Figure 37: Experiment 3: Average coverage over different recommendation iterations of different recommender system functions for all users ................... 223 Figure 38: Experiment 3: Average precision over different recommendation iterations of different recommender system functions for all users ................... 224 Figure 39: Experiment 3: Average precision for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 225 13 Figure 40: Experiment 3: Average recall for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 226 Figure 41: Experiment 3: Average computation time over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 227 Figure 42: Search: Average novelty over different recommendation iterations of different recommender system functions for all users ....................................... 230 Figure 43: Search: Average coverage over different recommendation iterations of different recommender system functions for all users ................................... 230 Figure 44: Search: Average precision over different recommendation iterations of different recommender system functions for all users ................................... 230 Figure 45: Search: Average precision for tables over different recommendation iterations of different recommender system functions for all users ................... 231 Figure 46: Search: Average recall for tables over different recommendation iterations of different recommender system functions for all users ................... 231 Figure 47: Search: Average computation time over different recommendation iterations of different recommender system functions for all users ................... 231 Figure 48: Case Study A: Average novelty over different recommendation iterations of different recommender system functions for all users ................... 232 Figure 49: Case Study A: Average coverage over different recommendation iterations of different recommender system functions for all users ................... 232 Figure 50: Case Study A: Average precision over different recommendation iterations of different recommender system functions for all users ................... 232 Figure 51: Case Study A: Average precision for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 233 Figure 52: Case Study A: Average recall for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 233 Figure 53: Case Study A: Average computation time over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 233 Figure 54: Case Study B: Average novelty over different recommendation iterations of different recommender system functions for all users ................... 235 Figure 55: Case Study B: Average coverage over different recommendation iterations of different recommender system functions for all users ................... 235 Figure 56: Case Study B: Average precision over different recommendation iterations of different recommender system functions for all users ................... 235 Figure 57: Case Study B: Average precision for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 236 Figure 58: Case Study B: Average recall for tables over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 236 14 Figure 59: Case Study B: Average computation time over different recommendation iterations of different recommender system functions for all users ................................................................................................................. 236 15 List of Tables Table 1: Key definitions for this thesis ................................................................ 21 Table 2: Matching types of data with their relevant evaluation metrics ............... 22 Table 3: Illustration of the user problems caused by data overload and user task diversity ............................................................................................................... 30 Table 4: Overview of data-related applications of market approaches ............... 42 Table 5: Overview of different approaches to the data allocation problem and their degree of application and implementation .................................................. 47 Table 6: Framework for research method adaptation based on Yin [179] and Kitchenham and Pickard [180] ............................................................................ 50 Table 7: Factors relating to choice of research technique identified by Pfleeger [181] .................................................................................................................... 50 Table 8: Comparison between different recommender system and market approach archetypes .......................................................................................... 57 Table 9: Overview of the detailed configurations for the recommender system component .......................................................................................................... 68 Table 10: ‘File Retention Policy Determination Methods’ by Wijnhoven et al. [89] ............................................................................................................................ 72 Table 11: Describing the impact that the limitations of the Value Map have on algorithms using the Value Map for individual dataset evaluations ..................... 77 Table 12: Experimental design steps from Jess et al. [214] and an analysis for the three experiments ......................................................................................... 91 Table 13: Experiment validation questionnaire ................................................... 93 Table 14: Different variation variables for the RecorDa approach with standalone recommender system and with market approach component ............................. 95 Table 15: Variations in user’s rating behaviour ................................................... 96 Table 16: Experiment 1a evaluations, average of experiment results .............. 100 Table 17: Experiment 1b evaluations, average of experiment results .............. 101 Table 18: Experiment 1c evaluations, average of experiment results .............. 102 Table 19: Experiment 2, market approach component settings on the auction mechanism being used ..................................................................................... 103 Table 20: Experiment 2 evaluations, average of experiment results ................ 104 Table 21: Experiment 3, influence of market mechanism on the recommender systems ............................................................................................................. 104 Table 22: Experiment 3 evaluations, average of experiment results ................ 106 Table 23: Experiment 4, settings of the different RecorDa approach setups used for further evaluation ......................................................................................... 107 Table 24: Experiment 4, influence of user behaviour on system performance . 107 Table 25: Experiment 4 evaluations, average of experiment results ................ 108 Table 26: Search evaluations, average of experiment results .......................... 111 Table 27: Case studies, settings of the different RecorDa with standalone recommender systems and with market approach component for the case study evaluation ......................................................................................................... 111 Table 28: Case study selection criteria ............................................................. 113 16 Table 29: Variables impacting the case study and how they were controlled or eliminated ......................................................................................................... 114 Table 30: Case Study A, user interaction with the dataset ............................... 117 Table 31: Case Study A evaluations, average of experiment results ............... 118 Table 32: Case Study B, user interaction with the dataset ............................... 121 Table 33: Case Study A evaluations, average of experiment results ............... 123 Table 34: Describing the performance of different approaches with different relevance measures for the main case studies ................................................. 125 Table 35: Procurement experiment description ................................................ 169 Table 36: Production experiment description ................................................... 185 Table 37: Support experiment description ........................................................ 199 Table 38: Description of different tables used for the Case Study B ................ 205 Table 39: Experiment 4 evaluations, Average experiment results .................... 229 Table 40: Case A evaluations (Details), Average of experiment results ........... 234 Table 41: Case B evaluations (Details), Average of experiment results ........... 237 Table 42: Evaluation measures based on Shani and Gunawardana [208] ....... 239 Table 43: Case study A: Process steps descriptions ........................................ 240 Table 44: Case study B: Process steps descriptions ........................................ 241 17 1. Introduction 1.1 The data overload problem Industrial companies, such as suppliers, manufacturers, and distributors, maintain databases with large amounts of data [1], [6], [3] and this data is constantly increasing [1], [2], [4]–[8]. The data is allocated to users in the form of datasets to help them make better decisions, for example, decisions about supplier selection, manufacturing operations scheduling, and inventory management, among many other areas. However, due to the increase in the amount of data, finding the relevant data for a user can be difficult. Companies are often overloaded with datasets [4], [9]–[15] and they often cannot decide which datasets to present to users. This is due to the following two data challenges faced by many industrial companies. - Large amounts of data: Driven by the increasing availability of new technology, such as increased storage capacity and better sensor technologies, the amount of data increases by approximately 40-100% every year [1], [2], [4], [5]. - High variability of user data requirements: Task difficulty increases are driven by a) the increasing diversity of tasks that a user is required to perform, and b) the automation of simpler tasks. Some studies suggest the task complexity per user increases by 6.7% every year [16], [58]. Due to these problems, companies miss opportunities in their data [4], [11], [12], [18]–[22]. For example, if procurement has data on all orders placed to a certain supplier, it can ask for discounts [23]. However, this is often not the case for various reasons, such as old legacy systems. In other situations, users are overloaded with data and cannot decide what is relevant to the decision at hand. This problem is called data overload [11]. Current approaches to these problems are requirement analysis and decision theory, intended to identify and provide the right data. However, these approaches often rely on static allocation of data to users using fixed queries or manual user searches of various databases based on fixed requirements. Search as another alternative approach often shows the data the user enters for the search. All these approaches often require heavy implementation in complex information systems, difficult analysis of users’ needs, or searching through company databases. Therefore, current solutions i) enable less discovery of datasets previously unknown to the user or the organisation, and ii) do not prevent irrelevant data from appearing. Recommender systems and market approaches have shown good results for similar types of problems. However, there is a lack of methods, applications, and proven benefits from these techniques regarding industrial data allocation. This 18 thesis analyses architectural approaches using recommender systems and market approaches. These approaches are used in this thesis individually and in combination. Based on this analysis, the thesis identifies an approach called RecorDa (Recommender systems and market approach based data allocation) with two variations. One variation uses a standalone recommender system and another variation combines the recommender system with a market approach. RecorDa provides the user with additional relevant data in a flexible manner and improves the user’s decision-making. The recommender system finds interesting data and recommends it to the user. For the second variation, the market approach uses the rankings and details on data usage from the recommender systems to further analyse data relevance and to influence the recommendations based on this analysis. 1.2 Introducing recommender systems and market approaches 1.2.1 Recommender systems and data overload Users often have various choices (i.e. datasets) and not enough time to review them. To address this, recommender systems are used successfully to solve various data overload problems, such as online shopping (where stores typically show ‘additional items’) and online movie selection [24], [25], [26]. However, recommender systems must be adjusted for industrial data for two main reasons. First, industrial data has specific characteristics, such as many recommendable items in the form of many data fields. Many of these data fields are similar in their content descriptions. For example, every row in a table has a similar structure and may have similar values either syntactically (e.g., all values in a column are yes or no, all values in a column are dates) or semantically (e.g., all values in a column are surnames). This makes it more difficult for a recommender system to distinguish between items of data, and to recommend the relevant data to a user. The second reason is that many techniques used by recommender systems require a content description of the data to make recommendations (these systems are so called content-based recommender systems). Therefore, a technique for describing the content of data for recommender systems is required. 1.2.2 Market approaches and the resource allocation problem In industrial companies, there are a) datasets, which have costs for providing them, and b) users, who have only a limited ability to be presented with data and receive varying benefit from different datasets. 19 This is similar to a food market, where there are a) products, which cost money to provide, and b) customers, who have limited money. Ensuring that only the relevant data is kept for potential presentation is a resource allocation problem1 known from market approaches [27]. Market approaches work by assigning value to resources given the user’s needs [28]–[31] and then letting the user bid in auctions for the use of the resource. This thesis aims to ensure that the relevant data reaches the right user. The problem of data allocation is therefore similar. Various researchers have used market approaches in similar domains [32]–[35] or suggested their application to data management problems [22], [28]. However, to date there are no specific applications or suggestions for implementing market approaches in the problem of data overload to thus obtain better data allocation. The main difficulty is finding mechanisms to show different combinations of data to the user and to identify their relevance, which can then be used with a market mechanism. This thesis therefore tests market approaches to improve the overall relevance of data shown to the user. 1.3 Research questions and methodology In the previous sections, this thesis introduced one of the main problems in data allocation, data overload, and the potential of recommender systems and market approaches to overcome this limitation. Therefore, this thesis sets out to answer two main research questions: 1. What is the best way of using recommender systems and / or market approaches in industrial data allocation to improve performance in terms of precision, recall, novelty, coverage and computation time? 2. Can recommender system and market approach individually or in combination identify relevant data better than potential alternative techniques? These questions are addressed by assessing different architectural approaches and through a series of case studies and experiments comparing recommender systems and market approaches against alternatives. 1 Resource allocation is the allocation of a limited resource, such as food to users, who are interested in this resource. 20 1.4 Using recommender systems and market approaches for data allocation Given that recommender systems and market approaches have worked successfully in other applications, the main challenge of this thesis is to show that they can overcome the problem of data overload either individually or when they are combined. There are various ways to use recommender systems and market approaches. This thesis assesses them against different alternatives by analysing their potential for solving the data allocation problem. Based on this assessment this thesis focusses on a series of similar specific architectural approaches called RecorDa, which use the following components: - a recommender systems component, to suggest relevant datasets to the user and identify which data the user would like to be presented with regularly, and - a market approach component upon the recommender system, to use the datasets most often presented to the user, evaluate their relevance, and improve the overall relevance of datasets that are presented to the user. The recommender system focusses on initial data presentation and is similar to recommender systems used at Amazon and similar online retailers [36]. A series of adaptations makes it applicable to data allocation challenges. Using the recommender system on its own is one variation tested in this thesis. As an additional variation, the market approach component uses the usage and ranking information from the recommender systems and evaluates its overall relevance to all users using a utility function. Thus, it decides which datasets should be eliminated and no longer shown to any user because the relevance of the datasets is too low. This thesis tests various variations of this approach by examining different variables and components. 1.5 Definitions This thesis uses a series of terms that require further definition to ensure clarity in their application. Term Definition Data ‘Data is the representation of facts as text, numbers, graphics, images, sound or video’ [37]. Decision The selection of an option from a series of available options based on the available information. 21 Decision- making The process that a user follows to reach a decision. It often involves looking for additional data. Information Information is defined as ‘data placed in context’ [20], [37]. Information system “Information systems use data stored in computer databases to provide needed information” [40]. Knowledge Knowledge is defined ‘as anything that is known by somebody’ [38], [39] in the organisation. Table 1: Key definitions for this thesis The terms information and data are closely linked. This thesis assumes that a user is presented with relevant information can place it in context and transform it into information and knowledge. This thesis therefore primarily uses the term data for the concept of showing additional data to the user, who can then transform it into information and knowledge, if the user understands it. If the user does not understand it, the data is considered irrelevant and hence less likely to be shown. This thesis only refers to information or knowledge where it is used as a standard term in the industry (e.g., Value of information, Knowledge management). Besides these core terms, there are various types of data that require definition before method proposed in this thesis can be described. - Data table: A set of data values using a model of vertical columns (identifiable by name) and horizontal rows [37], [41]. - Database: A collection of data tables. - Dataset: A subset of rows and columns from a data table. - Relevant data: Relevant data is data that would improve a user’s decision. Relevant data can be either known or unknown. - Known data: Data that is relevant to the user and the existence of which the user is aware of. - Presented known data: Data that is presented to the user within the existing graphical user interfaces of the user’s information systems. - Non-presented known data: Data that is not regularly presented to the user, requiring the user to search for it in a system that the user would not normally use for decision-making. - Unknown data: Data that is relevant to a user, but the existence of which the user is not aware of within the company or from external sources. This could for example be a dataset about the likelihood of supplier bankruptcy that has not been given to a user in procurement who must make a supplier selection. Unknown data can be either organisationally known or organisationally unknown. - Organisationally aware data: The user does not know this data for various reasons. For example, the user could be a new employee, or may not be aware of a new dataset that is available. 22 - Organisationally unaware data: Pieces of data (or datasets) that were only found to be relevant when they were presented to the user, with no prior knowledge from anyone in the organisation. Figure 1: Types of relevant data and their relations to each other 1.6 Evaluation of the RecorDa approach The evaluation of the RecorDa approach attempts to verify that the treatment – the RecorDa approach – can identify relevant data for a given user and present this data to the user in a better way than can existing techniques. For comparison, this thesis mainly uses search because other techniques are not able to cover a large number of datasets. Regarding search, this thesis assumes different types of search behaviours and compares them to the RecorDa approach. The aim of all techniques is to improve the relevance of the data for a user in order to improve data allocation and reduce data overload. To evaluate data relevance, this thesis classifies data into eight categories. For each category, different measurements can be used to evaluate the benefits of the different treatments. Known data Unknown data Presented data Non- presented data Organisationally aware data Organisationally unaware data Relevant data Measured with precision and recall metrics Measured with metrics such as novelty or coverage Non- relevant data Table 2: Matching types of data with their relevant evaluation metrics 23 A good treatment should achieve a high relevance of data. It can do this by correctly allocating as much data as possible to the following categories: - Relevant and known presented data - Non-relevant and known non-presented data - Relevant and organisationally aware data These can be easily measured with precision and recall metrics, evaluating how accurate a specific technique is in providing the relevant data to the user. This is the typical metric used in information retrieval. However, there is an additional group of data, organisationally unaware data, which is found to be relevant or irrelevant only once it has been shown to the user. Ideally, presenting many datasets to the user reduces the amount of data in this group, because the user can then form an opinion about it. These types of data therefore cannot be evaluated with precision or recall metrics, but instead require metrics such as novelty (how new is the dataset presented to the user) and coverage (how many of the possible available datasets have ever been shown to the user) [42]. 1.7 Thesis novelty, results, and contributions Existing applications that address the data allocation problem are often limited, and there is the potential to use recommender systems and market approaches to close this gap. The main limitations in the existing research are the following: - Recommender systems and market approaches have been suggested as solutions to data-management problems, but these suggestions are often highly unspecific and not adjusted for the data allocation problem. - There are no applications of data allocation that use recommender systems or market approaches. - The benefits of market and recommender system approaches for the data- allocation problem have never been tested. This thesis aims to address this research gap by finding the most promising approach (for improving precision, recall, novelty, and coverage) that uses recommender systems and market approaches for data allocation. Based on this evaluation, this thesis identifies a suitable approach (called RecorDa) and demonstrates how this approach would work. The results show that some existing techniques (i.e., requirement analysis techniques) are more precise than RecorDa in providing the user with relevant datasets, but more inflexible in finding additional datasets and in showing datasets of which the user is not aware. The RecorDa approach further outperforms similar techniques such as search in its ability to provide relevant datasets under changing conditions. 24 The findings of this research could help industrial companies develop and use better systems of data allocation. By incorporating RecorDa into their software, these companies can better leverage their data. These companies would gain a tool for use in situations that require additional flexibility when reacting to new user interests in data. The tool can also help them reduce the impact of data overload. 1.8 Applicability of this research This thesis is focussed on data management for industrial companies. In addition, it is most suitable to the following situations: - Types of data: This thesis focusses on structured data. It mainly addresses data at the data-table level. The approaches presented in this thesis show structured datasets to the user. However, these approaches could potentially be expanded to use unstructured data. - Types of users: This thesis is focussed on user decision-making based on data presented by an information system (e.g., ERP systems) - Types of decisions: The approaches presented in this thesis help users obtain more relevant data to improve their decision-making. These are found to be most useful in repeated decisions instead of just one-off decisions (made only one time in an organisation). The approaches require decisions to be made repeatedly to improve the data shown to users and to generate benefits. However, these decisions do not need to be made by one user. The presented approaches perform particularly well when the same type of decision is made by several people independently. - Types of information systems: The approaches presented in this thesis require interaction with the information system to identify the current data a user is looking at and show additional relevant datasets. The approaches are therefore limited to information systems that enable this type of functionality. While elements of this research might be suitable to other applications, these are the situations this thesis is analysing and testing. 1.9 Key assumptions This thesis is based on a series of assumptions. These assumptions guided this research and helped clarify its direction. The main assumptions are the following: - Users can rate the relevancy of data presented to them: This thesis assumes that when a user sees an additional dataset, the user is able to determine its relevance with a certain degree of accuracy and will provide ratings (on a scale from 1 to 5) to the approach presented in this thesis. 25 This assumption requires several abilities from the user and is hence relaxed in the comparison in Chapter 7 of different accuracies in user selection. - Users improve their data selection abilities: Users are capable of improving their data selection abilities and improve in selecting the type of data they are interested in. - Additional data can be presented to the user: The approach presented in this research aims to present additional data to the user. It therefore must assume that the existing systems enable it to capture data that is currently presented to the user and enable additional data to be presented to the user (e.g., on the left or right side of the existing system). 1.10 Thesis outline This thesis is organised into the following chapters: - Chapter 2 – Research background: This chapter presents the initial relevant research background by providing an overview of the current industry and research practices and their limitations, and by discussing the possibility of using recommender systems and market approaches in other areas with similar characteristics. - Chapter 3 – Research methodology: This chapter presents the research questions, as well as the guiding hypothesis and the approach used to answer them. - Chapter 4 - An approach to using recommender systems and markets: This chapter compares potential methods of using recommender systems with or without market approaches and selects the variation of an approach called RecorDa most likely to improve precision, recall, novelty, and coverage. - Chapter 5 – Introducing the functionality of the RecorDa recommender system component: To answer the research questions, this chapter introduces the first part of the RecorDa approach. It describes how the recommender systems and their potential variations operate. - Chapter 6 – Introducing the functionality of RecorDa with the market approach component: Building on the recommender system component, this chapter describes how the market approach component of the RecorDa approach and its potential variations work. - Chapter 7 – Evaluation: This chapter analyses the benefits of the proposed approach (i.e., the RecorDa approach) and its variations by comparing it to existing solutions. - Chapter 8 – Conclusion: Based on the evaluation, this chapter provides an overview of and an outlook on the main findings of this thesis. 26 2. Research background 2.1 Introduction This chapter describes the academic background of this research as well as the existing literature on the topic. It does this by examining: 1. current data management practises; and 2. applications of market approaches and recommender systems. For the first part, the chapter aims to provide an overview of industrial data management challenges (specifically in data allocation due to the data overload problem) and the current data management practises, in order to describe the environment in which a market and recommender system approach will have to operate. It further explains why solutions such as market approaches and recommender systems are used. For the second part, the chapter provides details about applications of market approaches and recommender systems in other domains, and the challenges they help to address there. The aim is to show the similarities between these domains and current data allocation problems. The chapter is structured as follows: 1. Chapter 2.2 focuses on the industry background and describes the underlying issues driving the necessity of a market and recommender system approach. It describes the increasing amount of data used by industrial companies and the problem of increasingly complex tasks for decisions makers. It demonstrates the issues arising from these problems for industrial data management. 2. Chapter 2.3 describes the current data allocation challenges and approaches that are based on the underlying industry complexity. It also discusses the existing techniques to overcome these problems. 3. Chapter 2.4 and 2.5 examine other applications of market approaches and recommender systems in domains with similar characteristics. 4. Chapter 2.6 then describes the research gap using market approaches and recommender systems to overcome the problems in data allocation. 2.2 Industry background Industrial companies’ decision-making is based on data [39]. These decisions may for example be related to investments or supplier selection. In order to facilitate this decision-making process, data is directly allocated to the user by the information system. An information system can be defined as ‘the entire collection 27 of data sources and related service capabilities, both internal and external to the organization, from which the users of the system may obtain messages’ [39]. The problem is that the allocation of specific data to specific users for decision- making is becoming too complex, causing companies to miss various opportunities for better decision-making. The reasons for this are the increase in the amount of data and the increased complexity of user tasks. These factors lead to the data overload faced by many industrial companies. Reducing the problem of data overload caused by these two industrial developments is the main aim of this research. 2.2.1 Increasing number of data and data users The amount of data is increasing rapidly every year. Precise numbers vary depending on the source and analysis. According to Manyika et al. [1], there is an increase of 40% every year, while this is 54% according to BAE Systems Detica [2] and some estimate that there is up to a 100% increase for the top companies [6]. Moreover, according to Feldman [4] and Bughin et al. [5], the amount of data doubles every 18 months, Industrial companies store a large proportion of this data. Rolls Royce is collecting huge amounts of data from its turbines [7], and a new Boeing 787 generates over half a Terabytes of data during every flight [8]. In 2012, 18% of companies reported data silos and data volume among their top concerns [6]. The average 1,000-employee UK company already stores 870 terabytes of data [2], which is more data than the Library of Congress [1], [3]. There are various drivers for this increasing amount of industrial data, including: 1. Increasing storage capacity 2. Improved sensing technologies 3. More publicly available datasets 4. More metadata generation Increased storage capacity The first driver for increasing the amount data is the increasing storage capacity [43]. This makes storage cheaper and easier; hence companies store more data. Given More’s law, which has continued to be observed since 1965 [44]–[47], it can be assumed that this trend of increasing storage capacity will continue into the near or medium-term future. Although storage is increasing, however, the underlying amount of data produced by various systems is predicted to grow at an even faster rate [43], making relevance decisions about which data to store even more difficult in the future, especially considering costs associated with storing data, such as running the storage equipment [47]. 28 In some data-rich industries, such as credit card lending, retail, and health care, the data collected is outgrowing the reduction in storage costs, resulting in a net increase in storage spending [48]. Improved sensing technologies Over the last few years, sensing technology has improved drastically, making sensors cheaper and more accurate. These improved sensing techniques include better cameras on mobile phones, for example. With these techniques, industrial companies can collect large amounts of data from their operations, supply chains, products, etc. Recent trends such as the Internet of Things clearly reflect this tendency. In 2011, manufacturers embedded over 30 million sensors in their products, and this number increases by 30% every year [49]. While this offers great opportunities for businesses, such as better asset management [50] and improved supply chain management [51], it also creates an increasing amount of data. More publicly available datasets Beside the data already created individually by single users or organisations on their own, there is also an increase in public data. This data comes in the form of datasets, such as Amazon’s large dataset repository [52] and the various open data initiatives by governments2, for example. The increase in public data has been facilitated by technologies such as the semantic web over the last few years [55]. Other drivers include the increase in users’ video and picture sharing on the Internet [56], and in semi-public datasets that can be acquired from companies. More metadata generation Metadata is data about data [37]. Metadata has increased massively in recent years, growing twice as fast other digital data [56]. The underlying trends of all of these drivers are likely to continue. This presents companies with the challenge of leveraging this data to obtain the most benefit from it3. 2.2.2 Increasing task and organisational complexity Today’s business environment is becoming more dynamic and complex, and it is continuously changing [17]. Indices using the number of procedures, vertical layers, and other organisational complexity metrics to measure company complexity show this complexity increasing by 6.7% every year [16]. Various business trends are causing this increased complexity, including the following examples: 2 Examples include data.gov in the US [53], or data.gov.uk in the United Kingdom [54] 3 Some estimates say that only 0.01 per cent of companies’ data is valuable/relevant [3]. 29 1. More intense interaction and integration between suppliers and customers due to a reduction in production depth [57]. 2. Increased task automation within companies, leaving more users to the more complicated tasks that cannot be automated. 3. Specialisation within the workforce makes the requirements of the whole workforce more diverse. Especially in large industrial companies, employees more often perform a specific task. 4. Companies are focusing more on making decisions based on data, including recent industry trends such as data analytics or data-driven organisation [1]. This requires more specific analysis and more tailored responses to the analytical results. 5. The decision-making is becoming more complicated. Allocating the relevant data to an increasingly complicated decision-making process is hence becoming an increasingly difficult challenge. In order to overcome some of these challenges and continuously increasing amount of data, software tools have evolved. Among the recent trends are tools and techniques such as data analytics [58] and master data management [59]. Some enterprises have up to 5,000 applications [60] and an increasing number of these are data management tools. While these tools make several aspects of data allocation easier and more efficient, they also make the decision and selection process for many companies even more complicated. 2.2.3 Data overload The previous two sections showed the problems of increasing amounts of data and increased task complexity. Combined, they lead to the following data problems for industrial companies: - Data overload of individual users due to this increasing diversity in the user’s tasks and environment [11], [12]; and - Lack of data for specific tasks, and specifically a lack of sharing data along organisational boundaries, such as different departments within a company or different companies [19], [21]. Users are asking for more data sharing among departments and industries4, while they are suffering from an overload of data [11] (see Table 3 for an illustration). 4 With lack of data sharing being one of the main failures of the various US secret services that could have prevented the 9/11 terror attacks for example [19]. 30 User problems Drivers for these problems Increasing amount of data Increasing diversity of user tasks Data overload Due to the increasing amounts of data, every user is presented with more data for each data source Due to the increasing amount of tasks, every user is presented with more varying data sources for each task Lack of data Due to the increasing amounts of data, ensuring that the relevant data reaches the right user is becoming more difficult Due to variations of tasks, the user does not receive all the data that the user would require for each of these tasks Table 3: Illustration of the user problems caused by data overload and user task diversity The underlying problem, the so-called data overload or file allocation problem, is shown to be NP-complete [61], [62]. Solving this internal resource allocation problem in the best possible way can be a great source of operational advantage [63]. According to Eppler and Mengis [11], data overload can be categorised into the following groups: - Cognitive overload - Sensory overload - Communication overload - Knowledge overload - Information fatigue syndrome The cause for these issues can in turn be categorised into five groups [9]: - Data itself (too much, too frequent, too intense, or data quality) - The person receiving the data - Processing and / or communicating this data - Task or process that the user needs to complete - Organisation and the design of the organisation Data overload usually happens as a combination of these issues. It often leads to disregarding low-priority inputs, paying less attention to each input presented to the user, shifting some of the data overload problem to other users, filtering specific data, refusing to receive communication, and creating institutions that receive the data [9]. Providing all data to users is not an option due to the large amounts available and the need to drastically limit it. Simply providing the most current data is also not a suitable approach [9], [48]. Therefore, approaches to provide the user with the relevant data for the right task at the right time are needed. Data is typically 31 relevant when allocated in the right amount at the right time. Too much or too little data is less relevant (see Figure 2). Figure 2: Concept of information overload (referred to as data overload in this thesis) from Eppler et al. [11] The relevance of data initially increases when more data is shown, but then decreases once a certain amount of data is reached. The ideal amount of data varies depending on the user and the context of that user’s decision. 2.3 Data management The goal of information systems is to ‘improve the solutions to decision problems whose outcomes are consequential to the organization’ [39]. Information systems are expensive and up to 40% of all company information technology (IT) spending is for maintenance [64]. They therefore need to find ways to deal with the data overload presented in the previous section. Data sources are a key component of information systems and one of the main sources of the data overload problem. Data needs management in large organisation. ‘Data management is a corporate service which helps with the allocation of data services by controlling or co- ordinating the definitions and usage of reliable and relevant data’ [20]. It ‘consists of: The planning and execution of policies, practices, and projects that acquire, control, protect, deliver, and enhance the value of data and information assets’ [37]. Good data management can help companies to obtain an competitive advantage [65], [66], and various existing techniques within data management can already be considered to address the problem of data overload. In order to achieve this overall objective, data management can be divided into five strategic and four non-strategic goals [37]. The strategic goals are: 1. ‘To understand the information needs of the enterprise and all its stakeholders.’ [37] 2. ‘To capture, store, protect, and ensure the integrity of data assets.’ [37] 32 3. ‘To continually improve the quality of data and information […].’ [37] 4. ‘To ensure privacy and confidentiality, and to prevent unauthorized or inappropriate use of data and information.’ [37] 5. ‘To maximize the effective use and value of data and information assets.’ [37] The non-strategic goals are: 6. ‘To control the cost of data management.’ [37] 7. ‘To promote a wider and deeper understanding of the value of data assets.’ [37] 8. ‘To manage information consistently across the enterprise.’ [37] 9. ‘To align data management efforts and technology with business needs.’ [37] All of these goals are covered by a broad range of existing research [37]. The aim of a market and recommender system approach is to overcome the data overload by giving the relevant data to the right users. It therefore focuses on the strategic goals 1 and 5 of data management. It also supports the non-strategic goals 6 and 7 by providing cost and revenue estimates and a prioritisation of importance for different datasets. However, there are a series of current techniques that already address parts of these challenges: 1. Value of Information (VoI) techniques 2. Search 3. Data analytics and business intelligence 4. Data development 5. Data architecture management 6. Metadata management 7. User interface design The following sections will provide a detailed overview of the existing work in these fields regarding data management. 2.3.1 Value of Information techniques The main fields that use VoI are computer science, economics, and business management [67]–[71]. In these fields, VoI is used to analyse data quality questions about specific issues in some data sets, as well as in more strategic problems about the sharing of data with partners of an organisation. In order to make decisions related to these issues, a value of different pieces of data needs to be calculated or estimated. 33 Calculating the VoI is difficult because information is an experience good [72]. In order to calculate information, most researchers have drawn from decision theory, the influence that a piece of information has on a decision, and the assessment of the economic value of this information [43],[73]. Different decisions have different outcomes based on the action implied by the decision. For decision theory, information influences the way the actions are selected and therefore the outcome. When all other variables stay the same, it is then possible to analyse how different information impacts the outcome of a decision, which can be used to calculate the value of this information [39], [73]. Howard’s first paper on the VoI [74], [75] already defined VoI in this way. VoI often relies on analytical techniques, such as Bayesian networks or other approaches [74], [76]–[79]. A different, less analytically focused research field relies mainly on surveys [80]–[84]. This implies asking users for their estimates or opinions regarding which data they would find relevant for their task. These techniques can be applied to specific information or to whole information systems [83]. In a sensor network, the value estimates are used to decide which sensors to keep and which not to keep [85]–[87]. Yemini et al. [88] provide an example of how value estimates are used to dynamically adjust the allocation of services in an information system environment. Other research has shown that value-based file storage is a promising approach [89]. Other approaches use the Value at Risk of information [48] or policies [90] to identify which data is more relevant for a company. However, they also rely on estimates from administrators and experts to calculate a solution to determine which data should be kept in which manner. In addition, they do not address the process of delivering the data to the user. Overall, there are various issues with VoI techniques. - It has been shown that more data does not always lead to better decision- making, such as in Huber et al. [91]. - Information is an experience good, which often requires the user to use the data in order to decide on its value [72]. - These techniques often assume a certain user reaction when presented with this data. However, the user’s reactions might be case-dependent, and may vary over time, making the analysis or survey more complicated. - The value of a piece of data can depend on the type of access, the time of acquiring it, or the specific content. A user can subscribe to a data source or pay for it on a per-use basis. For instance, a user can access limited data of the Financial Times website for free or pay for a subscription, while a user of Apple’s iTunes pays for every separate song. Balazinska et al. [92] describe various forms of subscriptions and related issues in market approaches. 34 - The assessment of data can be a complex process. It requires identifying all potential inputs (or messages) from the data, the statistical output of all of these messages, and the relationship between different messages [39], [73]. It is therefore difficult to conduct this kind of analysis for a large number of different types of data and different combinations of users with different tasks. Performing this analysis for every piece of data is too complex, particularly because each piece of data could have various impacts within a company, which are not always predictable, or could be included in analytical models that tend to focus on a limited number of impacts on decisions. Moshowitz [93] mentions that this analytical or mathematical approach ‘is not primarily economic value’, meaning that an analytical approach does not primarily cover the true value of a piece of data because it does not consider the various economic influences of a piece of data. - VoI techniques are used in high-impact and specific cases, such as oil or gas, healthcare, plant, and manufacturing design [78]. This is because the effort of conducting these analyses is often only justifiable when the impact and the stakes are large enough. VoI approaches are therefore less scalable and not suitable for the large amount of lower-impact VoI analyses that take place in many industrial companies. 2.3.2 Search Search describes the process in which the user types in keywords and then uses these keywords to scan through a set of databases. It has the ability to find certain pieces of data within a larger mass of data. The most famous applications for search are on the Internet and include websites such as Google and Yahoo. Tools such as data retrieval and indexing can also be seen as search mechanisms [94]. Search existed even before computers did, in the form of registers in libraries, for example. Search has long been used in computer science and especially in personal computers, for example with the Unix ‘find’ function. Within the current information systems environment, search finds either identical, or semantically or syntactically similar terms to the term written in the search query within a database. In order to execute the search process, search engines typically rely on the following three kinds of techniques. 1. Syntax and Semantics: Searching for similarity of the keyword typed into the search engine to the word in the database. The first search engines in computers relied intensively on measures for semantic similarity in order to find the right website to match a user’s request. 2. Structure: Using connections between different data items in order to identify the most relevant one for the user. Google, for example, uses PageRank [95], and other search engines use other kinds of network 35 measures, such as those presented in Kleinberg [96] and Lempel et al. [97]. These measures use hyperlinks, for instance, to identify the most central elements within the network. This gives them an indication of the quality of a website. If a page has various links, it should be better than other sources. These structural measures are often combined with syntax, semantic, and categorisation. 3. Categorisation: This is the structuring of content into specific categories to make it easier to find that content. Search has already been applied in companies’ databases, with techniques such as Google enterprise solutions, for example [98]. By producing search results, search engines implicitly make assumptions about which data is the most relevant to the user, and provide data in a more flexible manner. Search is one of the solutions with the most similarity to the approach that will be described in this thesis. However, it is limited because it requires the user to know what to search for and where to search for it. Neither is a given in large, complex organisations. Moreover, search also often involves additional process steps that many users are not always willing to take to complete a task. There are three types of search that are relevant to the present research. The first is when the user knows where to find data and which system or database provides this data (called directed search within this thesis). This would be the case, for instance, if the user goes directly to Amazon to find a specific product to buy. The second type of search is when the user does not already know where to find this data (called undirected search within this thesis). This would be the case if the user uses Google to find certain data. The third is a combination in which the user initially does not know where to find the data but then over time improves in finding it by gaining additional experience (called learning search within this thesis). 2.3.3 Data analytics and business intelligence Big data uses machine-learning techniques to draw insights from datasets. It automatically analyses datasets to improve the decision-making of industrial companies [58]. Similar approaches have been used for several years and are known as business intelligence [99], [100], [104]. Data analytics can present a great advantage for industrial companies [59]. However, data analytics does not always help the user’s decision-making, as the latter is often influenced by various factors and values. Data analytics focuses on the data of which the user is aware, and helps with the decision-making after the data is available for the user. Identifying the relevant datasets for analysis and decision-making remains one of the main challenges of industrial companies. 36 2.3.4 Data development Data development aims to identify the requirements for data that exist in a company, creating solutions to the problems and then implementing them [37]. It uses specific modelling of user requirements of the data to inform the detailed design of the data using specific databases or data tools. It then develops the most suitable design for the data within the company, and provides the data to the user using various techniques, such as data quality management and data integration. Finally it implements these techniques by converting data to different databases, for example [37]. However, data development mainly addresses the overall design of the system, and not the specific allocation of additional datasets to users and the overcoming of data overload. These are implicit issues. Data development faces the same issues of increase in data volume and increase in user complexity that have already been mentioned for data management in general. It does not overcome these problems, but instead applies existing solutions, such as search, as part of its toolbox. 2.3.5 Data architecture management Data architecture management involves ‘[d]efining the data needs of the enterprise, and designing the master blueprints to meet those needs.’ [37] Data architecture management aims to design standards and architectures for data management based on the company’s higher-level goals [37]. Similarly, to data development, it does not help to truly overcome the problem of data overload. It provides more of a framework in which a market and recommender system approach operates rather than a true alternative to overcome the problem of data overload. 2.3.6 Metadata management Metadata management ensures that the right metadata is collected. Metadata can include data such as time of data usage, amount of data usage, and time of data creation or change, for example. This metadata can help to identify the relevance of data to users. However, metadata does not provide a specific allocation of data to users or overcomes the problem of data overload. Nevertheless, it can inform further analysis or additional techniques [37]. 2.3.7 User interface design There are various techniques to improve the capability of users to comprehend data by presenting the data in the right way. These approaches can help the user 37 better understand the presented data [102], [103]. However, it does not solve the issue of data overload or identify the relevant data. Instead, it improves its presentation5. 2.3.8 Overview The literature regarding the allocation of data to users can be separated into three types of analysis or approaches: - Analytical analysis: Using techniques from VoI and decision theory to analyse the impact of specific decisions; and - Interview/survey-based analysis: Using surveys to identify which users require which data. - Search based approaches: Using search to find the data know to the user All of these three types have the following limitations in common: 1. They require users to know about the data in all its potential contexts and applications, which is becoming increasingly difficult given the increase in the volume of data. 2. It is difficult to maintain fine differences within different user groups who might slightly vary in the data in which they are interested. Users are often grouped together without any further differentiation between their tasks. With more task complexity and data volume, this user-group-based allocation can become difficult. 3. Users have to actively look or ask for additional data, and someone has to make an effort to obtain this data A market and recommender system approach presented in this thesis aims to overcome these limitations to better deal with data overload. 2.4 Market approaches in data management 2.4.1 Background Markets are used for various applications in the current economy, such as supermarkets. They concern buyers interested in certain products and sellers who 5 The approach developed in this thesis can integrate with these methods. However, this thesis focusses on the actual comprehension of data into information by the user. This research will therefore not address the issue of data presentation. Although research has shown that it can have significant impact [86], [87]. It could be envisioned that these techniques about user interfaces are incorporated in the presentation of the RecorDa approach. 38 sell this product to them. Adam Smith identified the use of markets in resource allocation and value estimation in his book, The Wealth of Nations, in 1776 [104]. Computer science has adopted the concept of markets to solve certain problems using market-based algorithms. A market-based algorithm ‘is the overall algorithmic structure within which a market mechanism or principle is embedded’ [105]. It uses concepts from markets such as auctions and negotiations to solve algorithmic challenges (and is called market approach throughout this thesis). Its features include ‘decentralization, interacting agents, and some notion of resource that needs to be allocated’ [106]. These features are effective in allocating resources and estimating value [105], [107], [108]. The reasons for this attribute 6 of market approaches are still disputed, and various factors need to be considered when discussing them. Potential reasons are the distribution of the allocation problem to various participants, the individual incentive to improve the allocation and valuation, robustness towards a changing environment, and the increased flexibility of individual users. Tucker and Berman [105] define a market method as ‘the overall algorithmic structure within which a market mechanism or principle is embedded’. They distinguish between strong market methods and quasi-market methods. The strong market mechanism is ‘close in structure and behaviour to a human market’ [105] in which the agents have a ‘high degree of independence in their demand and utility functions and their endowments’ [105]. Quasi-market methods use fewer degrees of freedom in the sense of flexibility available to agents in the market, [105] and therefore make less use of the market mechanism while following the same principles. The quasi-market approach offers better control for system-wide optimisation and can typically compute better results [105]. Strong market mechanisms are used more often in open systems with access by different parties [105]. This thesis will therefore follow a quasi-market approach, similarly to most research that uses market methods [105]. The strength of market approaches is that it computes complex problems with relatively simple properties [106]. Conversely, related disadvantages are that it can be difficult to design the right properties for these approaches, and that their behaviour is difficult to predict [106]. The present research can build upon a large amount of existing research, as market approaches have been studied intensively. Criticism regarding applying market approaches usually concerns the following problems that could occur in their applications: the risk of only finding a local optimum, and their reliance on simple game theory rules. Both can cause market approaches not to find the optimal solution [105]. 6 Which Adam Smith called the “invisible hand” [94] 39 One of the main components of market approaches is auctions [109]: ‘An auction is a market institution with an explicit set of rules determining resource allocation and prices on the basis of bids from the market participant’ [110]. It is the mechanism combining the buyers’ utility or interest to pay for a certain item with the sellers’ costs and willingness to sell for a certain price. There are different types of auctions, and they are usually used to allocate a resource from the seller or various sellers to a buyer or a selection of potential buyers. Auctions manage this process with a series of different items, buyers, and sellers, and have been shown to manage the process efficiently. Auctions have been studied intensively [109], [111]. The most common auction forms are English and Dutch auctions [109]. English auctions: In the English auction, the price is set low and then continuously raised until only one bidder remains, who wins and buys the item. The English Auction is equivalent to a second-price bid, in which the person with the highest bid wins the auction but pays the price of the second highest bid [109]. Dutch auctions: In the Dutch the auction, the price is set high and then continuously reduced until the first buyer agrees to it. This type of auction is equivalent to a first-price bid, in which the highest bidder wins and pays the offered price [109]. There are a series of other pricing mechanisms for auction theory. However these two or a slight adjustment of them are the main ones used for most auctions [109]. Many differences exist among different types of auctions. An overview can be found in Krishna [109] or Klemperer [112]. However, there are some specialities regarding the auctions and market approaches used for this thesis that should be explained in further detail: 1. A user interacts with various combinations of datasets instead of just one dataset: This thesis deals with a specific type of auction – the combinatorial auction – for market approaches [109]. In this type of auction, the user does not just bid for one item, but for a combination of items of interest. Various researchers have addressed these auctions in further detail [113]–[115], [125]. Combinatorial auctions lead to a better economical allocation but are also more computationally complex [117]. However, there are approaches for computing computational auctions efficiently [118]. 2. User utility: It is difficult to extract the utility from the user [119]. Goldberg et al. [120] describe various steps around these auctions. However, they require a specific value for one item for each individual user, which is the main challenge that the approach needs to overcome. 3. Procurement auctions: Procurement auctions are auctions where the sellers sell items with the goal of maximizing their earnings [109]. 4. Low variable costs: Data can easily be duplicated and shared. The incremental costs of reproduction are relatively low in comparison to other 40 goods, such as manufacturing products for example. This provides specific challenges with regard to pricing and valuing this data [72]. Due to the intensive research about market approaches, various special and established algorithms have evolved and are for example used to find a price equilibrium using a Walrasian approach [62], [121]. 2.4.2 Applications of market approaches in industrial companies Market approaches have been successfully used in various resource allocation tasks, often performing better than alternative resource allocation systems [27]. They have been successfully applied in different industrial scenarios, such as supply chain management, radio spectrum sharing [122], workforce allocation [123], truck allocation [124], airport traffic control [125], project management [126], robot coordination [127], [128], and task scheduling [112], especially in dynamic and complex situations [129]. Applications of market approaches in information systems include the pricing of computation resource use [29], [31], such as memory space or available EC2 instances [32], [33]. Other applications are the protection of information systems with MarketNet [88], [107], [130]–[132], database management using market approaches for query management among various databases [34], bandwidth allocation [73] [31], allocation of CPU and IO capacity [35], and supply chain management systems [134]. The principle of applying market approaches to data management was suggested for a long time [71]. Market approaches are also used to facilitate interactions among different companies, such as supply chain interactions [134], and even to facilitate intercompany data exchange about products [135]. Market approaches to data are particularly difficult because of the low duplication costs [120]. Brydon [136] and others [137] identify market approaches to be useful for resource allocation as a solution to intra-company allocation problems. Brydon [136] mentions ‘self-interest’ and ‘gains from trade’ as the main source of benefits because they allow the decomposition of the problem into various smaller problems. The author acknowledges that market approaches might solve the NP- complete resource allocation problem but at the same time create the winner determination problem in this market, which is also NP-complete [138]. Brydon [63] further presents various issues around developing these market approaches, such as 1) decomposition of the problem in a way that it can later achieve global-level optimisation; 2) identification of value for the various agents and entities within the market; and 3) the decomposition of the problem using market approaches. Overall, market approaches seem to have a good ‘time-quality trade-off’, to be more flexible and robust, and to be used for various objectives [63]. 41 2.4.3 Applications of market approaches in data management Beside this general work on market approaches in information systems, various authors have realised the potential use of market approaches with regard to internal data and data resource allocation and valuation (see Table 4). Authors Description of other market approaches Yemini et al. [88] The authors introduce market approaches as a concept of application and service resource management for large-scale information systems, which also provides benefits of relevance estimation. The authors identify various elements that have influence on the market, such as user utility, user budgets, and optimisation targets, as well as the potential to apply more advanced market techniques, such as futures and options. However, this work does not show a concrete application of this market approach type and does not specifically apply it to data, but rather focuses on access to resources and services. A et al. [87] The authors describe a market-based approach to a sensor network to help to identify which sensors provide benefits. Their idea is also tested in other research [85], [86]. They address the issue of sensor networks struggling with data overload due to the large number of sensors. Their work examines the data allocated based on user interests. It combines different tasks with the sensors used to execute these tasks. However, the authors mainly focus on sensor use and not on data allocation, and they rely on user input to provide the relevance of data/sensors for a specific task. Koifman et al. [139] Koifman et al. describe a system where various webpages deal with data tuples among each other in a network. They use techniques to estimate the quality of a piece of data and develop the negotiation mechanism between the different pages. However, their model mainly aims for the trade between different websites and does not try to identify questions regarding companies’ data allocation. Christoffel [140] The author describes a market approach used for integrating various data sources. The work shows that markets have abilities, which makes them more flexible and introduces various agent types required to build a market-based approach. However, the work does not cover industrial data allocation. Wang et al. [62] The authors introduce market approaches to better manage the allocation of data from data sources to users. The idea they describe is to have agents compete to deliver data to users. However, they only address the relocation of data resources in order to be more attractive to the users, but do not address industrial data allocation to specific users. Koroni et al. [28] Koroni et al. provide an overview of so-called ‘internal data markets’, and their approach and is similar to the approach that this research aims to develop. They introduce the idea that market approaches can be used for data evaluation, evaluation of data’s quality, costs 42 of data, and benefits that data can create. They also identify the main challenges in developing an ‘internal data market’: 1) organisational buy-in in the form of data transaction evaluation; 2) data quality problems; 3) standardisation, meaning issues around development of a consistent data product that can be sold repeatedly; and 4) product packaging, since the data often needs to be pre-processed before it is shown to the user. Overall, they indicate some potential benefits and challenges but they do not show ways to overcome these issues or concrete implementations of ‘internal data markets’. Wijnhoven et al. [141] These authors present an approach to align internal database ontologies for a data market and the importance of ontologies for internal data markets. They provide insight into the standardisation of internal data products with regard to quality and ontology, but do not address data allocation. Dignum and Dignum [22] The authors describe the application of market approaches for knowledge management. In their work, market approaches serve to incentivise participation in knowledge management. Koutris et al. [105K] Koutris et al. describe an approach of trading with online datasets and user queries accessing these datasets, called query market. This approach mainly addresses combining different datasets for user queries while still enabling payment to a combination of data providers. The authors generally describe how the pricing for a combination of datasets is a computationally difficult problem [142], [143], [144]. However, they do not address the issue of evaluating data relevance or of using fixed prices set by the data provider. They also do not influence the user’s selection of the data. Table 4: Overview of data-related applications of market approaches Besides the explicit uses of market approaches in data management, various implicit uses also exist. These various applications use different auction mechanism, market protocols, and other kinds of variations in market techniques [105]. Overall, however, the existing work on market approaches within companies has several limitations: 1. It does not address the issue of data allocation to users. 2. It only outlines the concept and leaves various open questions for practical application. 3. It does not show how data can be evaluated with limited user input (which can be expected in industrial companies) This research aims to analyse the potential benefits of market approaches by focusing the applications of markets on these three limitations and potentially using them with recommender systems to overcome data overload. 43 2.5 Recommender systems in data management 2.5.1 Background Recommender systems recommend items that they identify as relevant to a particular user. They are intensively used in online stores [24], such as Amazon [145]. Research on recommender systems started with Goldberg et al.’s work [146]. A review of the existing work on recommender systems can be found in Park et al. [147]. Recommender systems use items (the entity that is recommended) and users (the entity to which the item is recommended). To make their recommendation, they try to estimate the ranking that a user is expected to have for a previously unseen item. This ranking can either be made directly by the user, with the latter specifically stating the ranking, or indirectly by the user’s actions, such as clicking on a link, selecting an item to buy, and spending time on a website or product description. The estimation techniques for these rankings can be clustered into three categories of content-, user- or item-based recommendation [25], [148]: - Content-based filtering: These techniques suggest items to the user that are similar to the item that the user is looking at [25]. - Item-based filtering: These techniques suggest items to the user that are similar to items that other similar users rate highly [146]. - User-based filtering: These techniques suggest items to the user that are rated highly by similar users [146]. All three techniques compute a similarity score of the unseen item based on the existing rankings and other similarity functions. They then use this similarity score to calculate the expected missing rating. This rating is subsequently used to generate suggestions for the user. In addition, hybrid approaches also exist that combine these two techniques [25]. They often outperform algorithms that belong strictly to one class in some practical applications [149]. Recommender systems have been shown to reduce search effort [150] and to address the data overload problem [151],[152]. Some researchers have claimed that recommender systems might make search redundant in the future [153]. However, these two techniques are often combined, such as in Google’s auto- complete functionality. 2.5.2 Applications of recommender systems in industrial companies Recommender systems in industrial companies have been used in various applications, the best-known being the presentation of items in ecommerce [24], [154] such as Amazon, and the search for content [150], [153] such as in Google 44 and Netflix. However, these applications are usually outward-facing towards the customer, suppliers, or other external entities. Besides these external-facing applications, there are also adoptions of recommender systems towards internal usage. They have been applied to knowledge management [26], internal documents [151], corporate services [155], recommending datasets to a user in the field of economics [156], and SQL query recommendation for users [157]. Although there are various similarities between existing approaches, recommender systems have not been applied to data allocation and data overload. 2.5.3 Applications of recommender systems in data management Recommender systems are an intensively researched field, and various techniques and approaches have been tried with various adjustments, such as linked data [158],[159] and recommender systems for apps [160], for example. The five most important existing approaches with regard to the present research are the following. - Market-based recommender systems: Market approaches have been applied in recommender systems and researched in various domains by Wei et al. [161]–[165], Bohte et al. [166], [167], Melamed et al. [168], and Bothos et al. [25]. The authors use a variety of different recommender systems that compete for the user’s attention and have to make bids in order to obtain that attention [169],[170]. - User-focused recommender systems: Many existing recommender systems mainly work on the side of the selling company, such as Amazon. These systems’ main goal is to ensure that they increase the revenue of the selling company, and they only partially account for the interest of users. To overcome this limitation, recommender systems that increase the user’s utility have been developed [171],[172]. - Recommender systems within data allocation: Recommender systems have also been applied to companies’ internal data, such as in knowledge management [26]. Glance et al. [94] introduce an approach to using recommender systems within organisations called the Knowledge Pump. Users can bookmark data, receive recommendations, and make recommendations to other users. The authors describe various issues regarding employing recommender systems within companies, such as the smaller number of users, the need to be used intensively by people, and issues around incentivising users to use the recommender systems and make recommendations7. 7 In order to incentivize users they developed a virtual currency as a reward for good recommendations and they specifically mention the potential of their system to help in the calculation of Value of Information [94]. 45 - Distributed recommender systems: These comprise different recommender systems that exchange data with each other to improve their recommendations [173]. - Profitability-based recommender systems: These systems aim to improve the profit of the selling company [174] instead of simply finding what the user might like. The overview of existing approaches shows that recommender systems are a good tool for allocating data items to users in a flexible way, and that various ways of doing this have already been analysed. They have been shown to deal well with data overload in online news [175], for example, but have not been used in any application regarding data allocation in companies. 2.6 Summary This literature review has demonstrated that the following five types of approaches can potentially be used to address the data overload problem: - Search - Analytical approaches or decision theory - Survey/interview-based approaches or requirement analysis - Market approaches - Recommender systems These approaches have been applied to the data allocation problem in varying degrees. When examining the different types of implementation, this review found the following degrees to which they have been applied and implemented. a. Industrial application: Checks whether an approach has been used for other industrial applications. b. Data allocation in non-industrial applications: Checks whether an approach has been used for data allocation in a different domain. c. Suggested for industrial data allocation: Checks whether an approach has been suggested as a solution for data allocation. d. Methods for industrial data allocation: Checks whether an approach has been adjusted to work as a method for industrial data allocation (e.g. dedicated architectures). e. Applied to industrial data allocation: Checks whether there are implementations of this approach for industrial data allocation. f. Tested benefits for industrial data allocation: Checks whether this approach has shown proven benefits compared to existing techniques and whether the nature of these benefits is clear. 46 An overview of the degree of application of each of the approaches discussed in this chapter can be found in Table 5. The table shows the current lack of implemented and tested market approaches and recommender system approaches. While both recommender systems and market approaches have been suggested for industrial data allocation, to date no research has applied them to industrial data allocation. The suggested methods are vague and lack detailed description of how this application in industrial data management or data allocation might work. Given the limitations of the existing techniques, such as the scalability of surveys and analytical approaches, and the need to know what data to look for in search, recommender systems and market approaches can create benefits in industrial data allocation. This thesis aims to address this research gap by proposing an architectural approach based on recommender systems and / or market approaches. The following chapter will describe how this thesis will do this, and which questions need to be answered. Degree of application and implement. Search Analytical approaches Survey/ interview approaches Market approaches Recommender systems Industrial application      Data allocation in non- industrial applications      Suggested    () () Methods    () () Applied    47 Tested benefits    Table 5: Overview of different approaches to the data allocation problem and their degree of application and implementation 48 3. Research methodology 3.1 Research questions The previous section illustrated the research gap consisting of a lack of scalable and flexible data allocation techniques used to identify relevant data for the user, as well as the identified potential of recommender systems and market approaches in addressing this gap. This thesis aims to develop and test an approach based on recommender systems and/or market approaches for industrial data allocation and to close the research gap identified in chapter 2. This thesis therefore adopts the following hypothesis: Based on the characteristics identified in the previous chapter, recommender systems and market approaches can be used to identify the relevant data for users in a company and increase the amount of relevant data allocated to the user while reducing the problem of data overload. To test this hypothesis, this thesis must first answer the question: 1. What is the best way of using recommender systems and / or market approaches in industrial data allocation to improve performance in terms of precision, recall, novelty, coverage, and computation time? The literature review discussed the potential of these techniques, but the lack of existing methods and applications of data allocation (see chapter 2) has demonstrated the need to initially address this question. The first research question is answered by comparing different ways of using recommender systems and market approaches for data allocation (chapter 4) and then further describing the detailed development of these approaches (chapters 5 and 6). The approach must then be tested by answering the following question: 2. Can the recommender system and market approach individually or in combination identify relevant data better than potential alternative techniques? The second question is answered by comparing the accuracy of different techniques in providing different types of relevant data to users by reducing data overload and improving data allocation. 3.2 Research approach This section describes how this thesis aims to answer the questions identified above. It first describes the research philosophy underlying the epistemological 49 approach of this thesis, which forms the basis for identifying possible existing methodologies. This leads to the methodology selected for this thesis. 3.2.1 Epistemological approach This thesis adopts a realistic approach to its ontology, assuming that the world is independent of the researcher’s perspective and that science must observe nature in order to progress [176], [177]. Vessey et al. argued that information systems research is either descriptive or evaluative [178]. Since this study aims to test the benefits of recommender systems and market approaches in data allocation, it is evaluative by nature. With an evaluative approach, according to Vessey et al. the research can be either positivist or interpretivist [182]. A positivist approach is based on hypotheses, deductions, and causalities. Its results must be replicable and generalisable. Its results must also be quantitative and measurable [177]. The interpretivist view assumes that the world is affected by the subjective judgement of people. Its results must be interpreted and generalised into context [176], [177]. The present research adopts mainly a positivist approach, which fits the underlying realistic ontology, and the hypothesis-driven nature of this research [179], [180]. It attempts mainly to quantitatively measure the positive benefits of the approaches developed and presented in it. However, for the design of the experiments and the case studies, it adopts an interpretivist view to gather qualitative input through expert opinions. The specific application of these two views will become clearer in the following subsection. 3.2.2 Selected research approach In answer to the first research question, this thesis discusses the limitations of the existing techniques identified in the previous chapter. This thesis then identifies potential ways of using recommender systems and market approaches, either alone or in combination, and selects the approach most likely to improve precision, recall, coverage, and novelty for detailed analysis. Leveraging the approach identified by the first question, the second research question uses a framework from Yin [179] identifying the appropriate method to use in research (see Table 6). This thesis adapts this framework in accordance with Kitchenham and Pickard [180], including the question of ‘Which is better?’ for experiments and case studies. Given the research questions and the focus on contemporary events, experiments and case studies were identified as possible approaches. These are the methods typically used in information systems research [180]. 50 Strategy Form of research question Requires control of behavioural events? Focus on contemporary events? Experiment How, why, which is better? Yes Yes Survey Who, what, where, how many, how much? No Yes Archival analysis Who, what, where, how many, how much? No Yes/No History How, why? No No Case study How, why, which is better? No Yes Table 6: Framework for research method adaptation based on Yin [179] and Kitchenham and Pickard [180] Experiments and case studies offer different benefits for answering the research questions. In addition, to the difference in level of control identified by Yin [179], further differences are highlighted by Pfleeger [181] (see Table 7). Factor Experiments Case Studies Level of control High Low Difficulty of control Low High Level of replication High Low Cost of replication Low High Table 7: Factors relating to choice of research technique identified by Pfleeger [181] Important to answering the second research question, the characteristics of experiments and case studies have desirable attributes for different research stages. Due to the novelty of using recommender systems and market approaches for data allocation, various parameters within these approaches needed to be tested, requiring more replications and a larger level of control within the testing environment. Compared to informal experiments, formal experiments are often smaller in scale, more scientifically rigorous, and better when comparing different approaches [180], and they have a higher level of control (meaning the ability to adjust an experiment more directly, precisely, and systematically) [181], all of which were desirable when conducting the initial testing of the recommender systems, market approaches, and potential alternatives. Therefore, experiments were adapted in the early stages of this research to identify key factors influencing the performance of the different approaches to data allocation. However, the use of case studies is favourable due to the following: the control of behavioural events mentioned by Yin [179]; the desirable aspect of less control in a realistic environment to identify behavioural variables (of companies and / or 51 employees) not considered in the experiments; and the limited generalisability of experiments to a range of industrial problems8 [180]. Case studies provide a deeper, more valid, and more testable understanding of the true industrial environment [182], they can help judge whether a technology can be used in a company [180], and they help identify potentially previously unidentified variables [183]. This thesis uses different case studies to identify a broader range of variables [184] in different data allocation scenarios. Hence, a set of case studies follows the discussion of the initial experiments. This ensures that the most suitable configurations identified are tested in a more industrially relevant case study environment. In the experiments and case studies, this research generally followed a hybrid approach based on qualitative research and quantitative approaches, as suggested by various researchers [176], [185], following the positivist and interpretivist views outlined in the epistemological approach. Qualitative approaches were used in the development and identification of the experiments and initial assessments of the different architectural approaches. Quantitative measures were used to evaluate these experiments and case studies. The qualitative research in the development of the experiments and identification of case studies ensured industrial relevance. It was mainly based on a literature review (research as well as industrial white papers) and focussed on unstructured interviews with industrial experts. The aim was to identify typical characteristics of data allocation scenarios to select representative experiments and case studies. Due to the relatively small sample size of available experts with specific domain expertise on the various datasets, and due to the detail of information required, interviews offered the best option for obtaining the required information. Surveys do not offer the required level of detail of information. In addition, quantitative approaches provide the measurable facts and evaluation to answer the research questions in a logical and structured manner, following the positivist approach. 3.3 Research methodology The qualitative analysis identified the research gap, the current industrial problems, and the current industry standard. To close the research gap, this thesis developed an approach using recommender systems and/or market approaches. It initially compared various architectures and identified the most promising approach to improving precision, recall, novelty, and coverage using qualitative criteria. An architecture could be either only a market approach or only a recommender system or a combination. 8 While Pfleeger [181] mentioned that experiments are more generaliseable the author also illustrated their limitation to the specific experimental setup. In an industrial context with a variety of variables this specific setup can therefore not be generalised. 52 Based on this analysis, the approach most likely to improve precision, recall, novelty, and coverage was then developed. It was further adjusted and evaluated with quantitative analysis using experiments to develop a set of suitable setup variables. Further case studies and experiments were used to compare this approach against alternatives (see Figure 3). Figure 3: Research process The experiments and case studies were based on the literature review and interviews with experts in different companies. The experiments followed the approaches outlined by Pfleeger [181] and Kitchenham et al. [180]. These approaches are similar to the steps suggested by Basili et al. [186], who categorised preparation and execution into ‘experiment operation’, and analysis, dissemination, and decision-making into ‘experiment interpretation’. 3.4 Summary This thesis used experiments and case studies to verify its hypothesis that recommender systems and market approaches can help companies identify relevant data and solve many of its data overload and data allocation problems. The combination of experiments and case studies ensured the large number of repetitions needed to test this approach and the industrial relevance. The experiment design and case study selection were informed by interviews and a literature review. The research was based on a positivist view and a realistic ontology. The following three sections first determine the most suitable approach (to improve precision, recall, novelty, and coverage) to using recommender systems and/or market approaches to overcome the data allocation problem. Chapter 7 then compares this approach’s performance to alternative approaches to solving the data allocation problem. 53 4. An approach to using recommender systems and markets 4.1 Introduction Chapter 2 discussed the research gap regarding methods with tested benefits using recommender systems and market approaches. To address this gap, Chapter 3 identified the following initial research question: ‘1. What is the best way of using recommender systems and / or market approaches in industrial data allocation to improve performance in terms of precision, recall, novelty, coverage and computation time?‘. This chapter discusses this research question and defines the recommender systems and market approaches. Each approach can be divided into two elements: - High-level architecture: Ways of using recommender systems and market mechanisms to address the industrial data-allocation problem - Functionality: Functional elements required for recommender systems and market mechanisms to successfully allocate data to users This chapter first identifies the high-level architecture enabling the main functionalities and then reviews the key specific functional elements for recommender systems and market mechanisms used individually or in combination. 4.2 Selection of high-level architecture 4.2.1 Criteria for selection of high-level architecture For any approach to presenting additional data to the user, its architecture must have a specific set of high-level functionalities. Functionality based evaluation is a main part of most architecture evaluations [187]–[190]. The approaches developed in this thesis aim to support information systems by providing better data to users9. To achieve this, each approach requires specific functionalities. The approach must identify the user and what the user is working on (the task; see criterion A), identify the datasets that the user needs for the current task (see criterion B), and then present these to the user (see criterion D). However, as shown in subsection 2.2.3, the approaches tested in this thesis aim to reduce data overload with recommender systems and market approaches by finding the most relevant data to the user. To achieve this, these approaches 9 For this thesis information, systems are defined as systems that “use data stored in computer databases to provide needed information” [40]. 54 require the functionality of ordering data by relevancy (see criterion C) and, ideally, input from the user to improve the ordering of data (see criterion E). In terms of this research, there are five main functional requirements for the architecture: A. Identify current task: Relevant data must be applicable to the task of a user. B. Identify datasets relevant to the current task: The approach requires a mechanism to identify which additional datasets may be relevant out of all the datasets existing in a company. C. Order datasets by relevance: To present these datasets to a user, the approach must rank them. This way, the most relevant datasets are allocated to the user. D. Present the most relevant datasets: After ranking these datasets, the approach needs a mechanism to display the relevant datasets. E. (Optional) Improve on the selected datasets: This criterion is not always required for relevant data allocation. However, given the increasing numbers of datasets and increasing complexity of data (see Chapter 2), few approaches are likely to provide the correct information immediately, and therefore, the approach requires some ability to adjust its relevance evaluation. Without these steps, no approach can process the large numbers of available datasets and select the ones most relevant to the existing user. 4.2.2 Potential high-level architectures There are various potential architectures for using markets and recommender systems for industrial data allocation. This subsection tests various combinations. It considers naïve solutions and existing applications for arranging recommender systems and market approaches. The naïve architectures can generally be broken down into the following four archetypes: - Recommender only (standalone recommender system): This architecture uses only a recommender system without a market approach. - Market approach only: This architecture uses only a market approach without a recommender system. - Recommender first: This architecture uses a recommender system first and then builds the market approach on the recommender system using the recommender system results for its market analysis. - Market approach first: This architecture uses a market first and then builds the recommender systems on the market using the market results for its recommendation. 55 In addition to these naïve ways of using recommender systems and market approaches, the literature review (see chapter 2) also revealed the following types of combinations of market approaches and recommender systems: - Market based recommender systems: An approach in which a market is used for a competition between different recommender systems. Its architecture is similar to the recommender only archetype with the difference that, within the recommender systems, various types of sub- recommender systems compete for being allowed to present a recommendation to the user (see Figure 4). - Recommender systems in online market places: These are recommender systems like those used in online marketplaces (e.g., Amazon). The recommender system presents items. However, if an item receives bad reviews or does not sell enough to make up its purchasing costs, it will no longer be offered to the user. The market therefore helps regulate the offers made by the recommender system. This recommender-system architecture is identical to the recommender first archetype. An overview of each of these systems can be found in Figure 4. The following subsection compares these five approaches. 4.2.3 Comparison The initial stage of architecture selection compares each of the four high-level architectures from subsection 4.2.2 against the criteria identified in subsection 4.2.1. An overview comparing each of these approaches against the five main criteria can be found in Table 8. Overall, market approach only and market approach first have the disadvantage of having no existing technique for initially presenting relevant datasets to the user. These approaches must rely on initial random input or another ordering system, which can then be used to further evaluate the dataset. The recommender system has the existing capability of quickly improving and evaluating initially presented items because it is used for this in other applications, for example, ecommerce. 56 Figure 4: Overview of architectures using market approaches and / or recommender systems 57 Recom- mender only Recom- mender first Market approach only Market approach first Market-based recommender systems Identify current task All approaches can identify the current task and the current dataset. Identify datasets relevant to the current task Rank datasets by relevance All approaches can evaluate and therefore rank datasets. Content-based recommendations have been shown to work well without prior information. Recommender systems can therefore deal with the cold-start problem. Market approaches do not have a method to deal with the cold-start approach of initially presenting datasets. Setup of these systems is often more complicated due to the large number of recommender systems. They also require the existence of well-functioning recommender only archetypes to compete within the market, which is not the case in data allocation. Present the most relevant datasets All approaches can present the most relevant datasets to the user. Improve on the selected datasets Recommender systems can receive ranked and therefore have a direct feedback mechanism. Markets must transform a rating into a utility and they therefore rely on indirect input. Although these are similar to recommender only and recommender first, their feedback mechanism might be more complicated because rankings must be attributed to various recommender systems. Table 8: Comparison between different recommender system and market approach archetypes 58 Market-based recommender systems are an option for combining market approaches and recommender systems. However, they are more difficult to design and set up. They are also normally developed for domains that are already using existing recommender systems. Several types of recommender first approaches should be successfully tested in a domain before market-based recommender systems are applied, which is not the case in data allocation. Therefore, the recommender only (referred to as standalone recommender system for the remainder of this thesis) and the recommender first approach (referred to as recommender system with market approach component for the remainder of this thesis) are the high-level architectures most suited to improving precision, recall, novelty, and coverage. These are also the two most prominent ways of using recommender systems and/or market approaches for websites and other applications. Standalone recommender systems are used on various websites (e.g., for presenting movies to sell to the customer), and recommender first systems with a market approach component are used on websites such as Amazon and eBay, where a recommender system shows recommended items but the market determines which items are profitable enough to be on the website. 4.3 Main functionality The following subsection describes the main elements of the recommender systems and the market approaches, and then determines the main functionality decisions of each of these approaches separately and in combination. 4.3.1 Recommender system functionality setup As indicated in the literature review (see chapter 2), hybrid recommender systems often produce the best results. The recommender systems component used in this thesis therefore adopts a hybrid approach combining all three types of recommender systems (i.e., content-, user-, and item-based recommender systems). In a hybrid approach, each system relies on a series of separate functions for computing its results based on the user input, and these results are then aggregated. The user- and item-based functions are usually based on standard similarity measures (i.e., cosine similarity, Euclidean distance). This thesis therefore compares multiple functions in the experiments of chapter 6. Content-based systems rely on comparison of the content, which for these recommender systems is data. Therefore, for the content-based system, a new approach for data comparison needed to be developed, which is outlined in chapter 5. 59 For the aggregation, there exists a potentially infinite number of functions combining these different recommender systems. They range from a simple average to more complex techniques such as neural networks. This thesis uses potentially simpler functions, such as average, max, or min. This offers three specific benefits: - Improvements over other recommender systems: These relatively simple functions have worked successfully in various recommender systems [36], [191], [192]. - Initial nature of this research: This is the first application of recommender systems towards industrial data allocation. Using established and simple algorithms that have been used successfully and repeatedly reduces the risk of problems due to too many complex techniques and enables establishing a performance baseline which can be improved through additional research. - Attribution of benefits to specific sub-recommender systems: Simpler functions can be understood more easily, which makes it easier to attribute successful recommendations to one of the sub-recommender systems. Further details on these two critical functional setups (recommender system functions and aggregation) can be found in chapter 5. 4.3.2 Market approach functionality setup Chapter 2 showed that market approaches rely on two types of input to use their resource allocation capabilities: utility and costs [87], [131], [193]. In addition, the literature review showed that markets need an auction mechanism to combine these two inputs. The market approach therefore needs to decide on the following three main functional variations: - Utility function: There are various potential utility functions which can be adopted using inputs such as data quality and usage. A detailed assessment of the existing literature on potential criteria can be found in section 6.3. It shows one of the most common indicators is usage. Therefore, a usage based utility function was used in this thesis. - Cost description: This thesis attempts to capture all costs for maintaining and providing datasets to users in the future. A detailed breakdown of the costs can be found in section 6.5. Estimating these is difficult. There are various complex methods for estimating software development costs and costs of datasets. Finding a suitable method for a large number of datasets is complex, however. This thesis found that experts can provide helpful estimates for these costs. It hence applied an interview based method for cost evaluation. - Auction mechanism: An auction mechanism covers the type of auction and the mechanism controlling the participants. 60 o Type of auction: English auctions and Dutch auctions are used for similar problems [87] and are the two most commonly used auction types [112]. They also provide similar outcomes to various alternative auction types [112]. This thesis therefore applies these two auction types. o Auction organisation: There are centralised approaches in which a market maker takes all the price offers and a decentralised approach in which each market participant trades with each other [112]. This thesis uses a centralised approach because it is easier to compute10 and is a commonly used approach for similar problems [87]. For this initial research, this thesis used a market maker approach and tested the most common auction types, English auctions and Dutch auctions. These three functional decisions were key to setting up the market approach. They determined the initial direction of development which is detailed in chapter 6. 4.3.3 Setting up the Interface between the Recommender system and Market approach Furthermore, the market approach required an output in the form of impact on the recommender system. The main questions were a) How are the recommendations influenced, and b) By which measure are they influenced? - Type of influence: The recommender system could be influenced by either the rank of the datasets in the market approach or their specific evaluations. Both techniques are developed and tested in chapter 7. - Type of measure: The mechanism could potentially use revenue, costs, or profit as variables computed using the auctions and utility function. Of these, revenue and profit are mainly influenced by the dataset’s relevancy to the user. They are therefore the two approaches tested in chapter 7. These different combinations were tested to find the best way to influence the recommender system based on the market approach analysis. 4.4 The RecorDa approach The analysis of architectural approaches has shown that, overall, either of the following two high-level architectural approaches seems most suitable (in terms of improving precision, recall, novelty, and coverage) to allocating data to users: 10 Decentralised solutions are often more complex to develop because various auction participants must be coordinated. It is also more difficult to ensure a good result. As long as the number of datasets does not reach several millions of tables, a centralised approach should remain computable. 61 - Standalone recommender system, which provides data recommendation without a market approach - Market approach based on a recommender system, which leverages the recommender system to determine the inputs to the market evaluation Within each of these, the analysis in this chapter found the following key functional decisions to be the most suitable for future development. - The following decisions relate to the recommender system: o Use of all types of sub-recommender systems (content-, user-, and item-based) adopted towards the problem of data overload o Use of a series of simpler functions for aggregating the sub- recommender systems due to the benefits these functions showed in other recommender systems and the initial nature of this research, and to more clearly attribute benefits to specific sub-recommender systems - The following decisions related to the market approach: o Use of utility function based on usage, because this is the most established criteria for evaluating data relevancy o Use of cost assessment based on expert interviews because this is only a single effort per dataset and is hence potentially scalable and easiest to implement o The market approach attempts to use the two most established auction mechanisms, that is, English auctions and Dutch auctions. They represent the most typically and commonly used types of auctions and compute results equivalent to a series of other auction mechanisms. The approach will combine these with a market-maker mechanism. - For the impact of the market approach on the recommendations, the selected approach will test four potential types to influence the recommendations. These methods of using a recommender system, either in combination with a market approach or as a standalone recommender system, with these functionality elements, is called RecorDa (shown in Figure 5). The first variation of the RecorDa approach consists of a market approach component built on the recommender system component. The standalone recommender system variation relies on only the recommender approach component. The recommender system and market approach architecture rely on both components. The recommender system component initially identifies which additional datasets are relevant to the user (using the user details allocated in recommender system step 1) by providing the user with likely relevant datasets (step 3) using a combination of different recommender system techniques, such as content-based, user-based, and item-based recommendations (generated in step 2). It improves 62 the recommendations using ratings from the user on these additional datasets (step 4). The ratings and the logs serve as input to the RecorDa market approach component. A utility function that uses the number of times an additional dataset has been presented to the user evaluates the relevance of different combinations of datasets (step 5). Figure 5: High-level architecture of data relevance evaluation and data allocation in RecorDa Within the market component, the valuation of a combination of datasets is then allocated to individual datasets (step 6). Once the relevance of a dataset has been identified, it must influence which data is presented to a user. A function influencing the recommendations to the user manages this interaction (step 7). This process is completed iteratively each time new datasets are shown to the user to continuously improve the presented data. This architecture enables the RecorDa approach to follow the steps required for successful data allocation. Details on the two main components can be found in chapters 5 and 6. Chapter 7 then initially analyses various key configurations outlined in chapters 4–6 and compares the best performing configuration against alternatives like search. An initial comparison of market approaches and recommender systems against potential alternatives can be found in chapter 7. 63 5. Recommender system component 5.1 Introduction The previous chapter identified the RecorDa approach as the most suitable for using recommender systems and market approaches. The following chapter will describe the functionality for the recommender system component of the RecorDa approach. The functionality follows the key functionalities identified for the RecorDa approach in chapter 4. 5.2 Data allocation with recommender systems As seen in chapter 2 a recommender system is typically based on three types of recommender systems (user, item, and content based) which are combined to recommend items to the user. Before using these approaches, the recommender system component of the RecorDa approach first needs to know who the user is and on which datasets the user is using. This allows the approach to approximate the user’s task. The user is known based on the login details. The data with which the user is working can be identified and captured from the user interface. The recommender system component therefore takes the data presented to the user within the current information system (called operational record) (see Step 1 of Figure 6 and algorithm 5.1). Next, the recommender system component identifies the source (the data tables) of the currently presented data (called working tables in Algorithm 5.1). Only then it asks the recommendation engine for additional data tables of relevance to the user (Step 2). All further sub-recommender systems work on a data table and task per user basis11. In order to identify data tables relevant for the current task, the recommender system component uses a recommendation engine. As outlined in chapter 4 the recommendation engine is based on the following three separate sub-recommender systems. User recommender system: Identifies additional data tables by looking for users with similar rankings for other data tables12. The RecorDa user recommender system uses existing similarity measures for recommender systems (e.g. Cosine similarity) from the standard Mahout library [194]. Different combinations of these similarity functions are tested in the experiments of chapter 7.4.3. 11 This ensures a high specificity to provide relevant datasets, but it is also generic enough for the recommender systems to be able to collect various user interactions. Changing the granularity by being more specific e.g. on the data record would create too many combinations (i.e. between all records in a database) or being more generic would lose a lot of granularity. 12 This thesis uses the Mahout recommender system library and Pearson correlation to find similar users as one implementation. 64 Item recommender system: Recommends data tables to the user by looking for data tables that are similar to the currently presented data table in a way that they have received similar ratings13. It is also using standard similarity measures from the Mahout library [194] which are tested in the experiments to identify the best performing similarity function. Content-based approach: Uses data characterisation [195] to identify similar data tables. It takes all columns from a data table and generates metadata about the data in the column (e.g. mean word length, fraction of NULL values). A neural network is used to find matches between columns. Data tables with columns that have a high likelihood of matching are recommended as similar content. The benefit of the content-based approach is that it does not require any input from the user. Data characterisation or automatic schema matching is a standard method and various papers have been written about it [195]. It is used to pre-compute the various similarity measures between data tables. However, adopting it for data recommendations by using its results for a content-based approach is one of the novelties of this thesis. Both the user and the item recommender system are using previous rankings provided by the user to influence their recommendations. Each of these provides a list of potentially relevant data tables and a relevance estimate for each data table (called table similarity in Algorithm 5.1). Each sub- recommender system is based on the standard techniques typically used in order to ensure relevance of the provided recommendations. As shown by the second functionality from chapter 4 to rank the datasets, the results of these three separate sub-recommender systems are aggregated into a single list of recommended data tables (Step 3) using the average, maximum or minimum, of the individual calculated recommendation scores (all variations are tested in chapter 7 and typically use in recommender systems). However, there are potentially other implementations which could be analysed in the future. In order to ensure that the most relevant datasets are presented to the user, it is critical that the recommender system component does not just present the full tables, because the user will not be able to find the relevant records within such large tables. The RecorDa therefore uses the operational record to identify similar records in the recommended data tables by accessing the relevant database (Step 13 This thesis uses the Mahout recommender system library and Pearson correlation to find similar users as one implementation. 65 4). Records with an identical join14 to the operational record are extracted from the system (Step 5) and presented to the user in descending order of the rating (Step 6). The user is presented with data sets that are relevant to the data sets on which the user is working. The user is only shown the matching records (data sets) from these data tables, significantly reducing the search effort and improving the relevance of the presented data. In the current setting of RecorDa, the user is initially presented with the first five recommended tables (called Top tables in algorithm 5.1) on the side of the existing information system, and has the option to click through to additional recommendations. The user also has the opportunity to rate the data with the getUserRating function in Algorithm 5.1); these ratings are then used to further improve the data presented by the recommender systems by using the item and user based recommendations. An overview of the recommender system approach can be found in Figure 6. The figure and algorithm 5.1 show all of the computational steps of the RecorDa recommender system component required to ensure that the recommender system can allocate data to the user. Figure 6: Description of the different process steps of the recommender system Algorithm 5.1: Recommender system algorithm Variables: Task = Defines the task that a user is working on User = Defines the specific user (e.g. via ID, or name) Record = Defines the specific datasets that a user is working on WorkingTables = Defines the tables that contain the data from the record variable 14 The current system works with identical joins. However, further approaches could include not identical joins and also show similar items using techniques like fuzzy matching [196] for example. 66 TableSim = Defines a matrix of tables and similarity scores for the existing workingTables SimScores = Matrix of all tables against all other tables containing similarity scores based on its data characteristics. It is pre/computed with data characterisation algorithms. TopTable = List of highest rated tables that are recommended to the user MatchingRecords = Records from the TopTable that have a direct syntactic match to the Record variable Ratings = Contains a list of ratings of user, data table, and rating score for each element in the list Functions: GetCurrentUserAndTask = Identifies the current task, user, and record that a user is working with in the Information system GetWorkingTables = Identifies the tables that a user is currently working with UserRecommenderSystem = Applies the existing similarity measures for user recommendations from the mahout library using the completed ratings from the user ItemRecommenderSystem = Applies the existing similarity measures for item recommendations from the mahout library using the completed ratings from the user contentRecommenderSystem = Identifies tables that have likely similar content from the precomputed SimScores variable by selecting the tables with the highest ranking for the current workingTable Aggregate = Combines the different TableSims by taking the min, max or average of the values from the sub-recommender system Match = Gets a list of tables and the current record. It identifies the records from the tables in the list that have a direct syntactic match to the current record Prsent = Shows the recommended datasets to the user getUserRatings = Gets the rating from the user when a dataset is presented Algorithm: // Step 1: Task, user, Record  GetCurrentUserAndTask() // Step 2: Working_Tables  getWorkingTables(user, task, record) // Step 3 (recommendation engine): TableSimA  userRecommenderSystem(user, task, ratings) TableSimB  itemRecommenderSystem(user, task, ratings) TableSimC  contentRecommenderSystem(user, task, simScores) TableSimAggregate  Aggregate(TableSimA, TableSimB, TableSimC) // Step 4: TopTable  SelectTopTables(TableSimAggregate) // Step 5: MatchingRecords  Match(Record, TopTable) // Step 6: Present(MatchingRecords) // Capture ratings: Ratings  getUserRatings(ratings, user, task, TopTable) 67 There are different functions and approaches for recommender systems. They can be used for all six steps. An overview of these steps can be found in Table 9. Chapter 7 provides further details regarding the evaluation of these configurations for each step. Step Configuration 1 This step captures the currently presented tables and rows. There are no different configurations for this capturing process. It relies on existing logs or administrator input.15 2 Each of the sub-recommender systems uses a series of different configurations as mentioned in chapter 4. User sub-recommender system; there are various methods for identifying similar users given their recommendations: Log Likelihood Similarity [197], City Block Similarity, Euclidean Distance Similarity [197], Pearson Correlation Similarity [198], Spearman Correlation Similarity [197], Tanimoto Coefficient Similarity [197], and Uncentered Cosine Similarity [198]. These are implemented based on the Mahout library [194]. Item sub-recommender system; there are various methods for identifying similar items given their recommendations: Log Likelihood Similarity [197], City Block Similarity [199], Euclidean Distance Similarity [197], Pearson Correlation Similarity [198], Tanimoto Coefficient Similarity [197], and Uncentered Cosine Similarity [198]. These are implemented based on the Mahout library [194]. Content-based sub-recommender system: the content-based sub-recommender system relies on external pre-computed input from data characterisers [195]. The specific setup of the data characteriser is not the focus of this research. In addition, recommender systems typically rely on thresholds to decide which data to present. Different thresholds are tested in the experiments. Cuff-off low-rated data: If a data table has a low ranking, the RecorDa approach will not present this data to a user. This threshold ensures that the user will not see data that the user has ranked lowly. It can be set between 1 and 5. Item sub-recommender threshold: threshold for the recommendations from the item sub-recommender system to be considered for the following step. User sub-recommender threshold: threshold for the recommendations from the user sub-recommender system to be considered for the following step. 3 Each of these recommender systems provides a ranking from 0 to 1 for the different potentially recommended tables. This step combines these three ratings into one rating. The following are approaches for this aggregation process (as described in chapter 4). Max: Takes the maximum from the three recommendation systems. Min: Takes the minimum from the three recommendation systems. Average: Takes the arithmetic mean from the three recommendation systems. 15 In the approach presented in this thesis works with tables of rows and columns using relational databases. However, the concepts could be extended to hierarchical data structures such XML or unstructured data such as text files or pdf documents assuming an alternative approach for the content-based recommendations. 68 Total threshold: Threshold that the aggregated score needs to achieve to be considered as a recommendation. 4, 5, 6 These steps take the operational records and find rows where there is an identical data cell value in the recommended table for the columns that have sufficiently close metadata. RecorDa then does a join of the tables, providing additional columns from the recommended tables, joined where the data cell values match. This thesis does not use different configurations. Table 9: Overview of the detailed configurations for the recommender system component 5.3 Summary The RecorDa approach recommender system component shows that recommender systems can be adjusted in order to provide relevant data to the user. The key adjustments compared to existing recommender systems techniques are the following. - Defining users: Normal recommender systems work on a per-user basis. However, for RecorDa this needs to be adjusted to the user and task levels to ensure that specific dataset is allocated to each user, without confusing different actions that a user might take in a given system. - Item definition: Existing recommender systems often work with very granular items (e.g. products on Amazon). However, when addressing datasets, this granularity is often difficult to handle because there are several layers of these datasets. The RecorDa approach addresses this issue by aggregating data records on a table level to provide the relevant records from a table to the user. - Finding an approach for content-based sub-recommender systems: There is currently no approach that deals with content-based matching for recommender systems on datasets. However, content-based matching is often critical for recommender systems to overcome the ‘cold-start problem’ [156], [191] (as seen in chapter 4). By adopting techniques from other domains (e.g. data characterisation) and applying their rating scheme to recommender systems, this thesis closes this gap. - Fine tuning the systems: There are various variables involved in setting up the recommender system. Chapter 7 provides some insights into considerations and initial results. In the following section this thesis will address the market approach component. 69 6. Market approach component 6.1 Introduction Chapter 4 showed the key architectural decisions for the market approach. It is based on the recommender system (outlined in chapter 5) and uses the following four main elements: - A utility function based on usage - Cost assessment based on expert interviews - Two of the most established auction mechanisms, i.e. English Auction and Dutch Auction. - Four potential types to influence the recommendations by influencing their ranking This chapter will describe how the market approach works in more detail based on the architectural decisions and the recommender system component introduced in the previous sections. It will also provide a more detailed reasoning for specific design selections in addition to chapter 4 and give a detailed rational for using market approaches. 6.2 Overall market architecture As seen in chapter 4 market approaches rely on two types of input to use their resource allocation capabilities: utility and costs [87], [131], [193]. These are then combined via auctions. For utility the challenge is to identify the relevance of a dataset (see section 6.3). The utility functions are based on the datasets a user finds relevant initially identified with the recommender system component. They identify the utility of combinations of datasets. The results from the utility functions are combined in the Value Map. It is needed to represent the results of the utility functions which are specific valuations for combinations of datasets for specific users (see section 6.4). For costs it is important to measure the cost of providing the data (see section 6.5). The RecorDa market approach component (see Figure 7) uses these input of utility and cost. It has the data combinations and their utility on the one side, and the different datasets with their costs on the other. As shown in chapter 2 identifying which datasets provide a high enough benefit in these different data combinations is a difficult NP-hard problem (see section 6.6). Market approaches overcome this problem with auctions (see section 6.7). Based on the profits and losses for datasets, market approaches can then be used to influence the order in which they are presented. To then further improve the datasets presented to a user the RecorDa market approach component influences the data shown to the user (see section 6.9). It is thus also able to continuously improve its allocation. A high-level 70 architecture of the market approach within data allocation can be found in algorithm 6.1. Figure 7: Description of the overall market architecture Algorithm 6.1: High-level market architecture Variables: NumberOfViewsPerTableComb: Counts the number of times a combination of tables has been viewed by each user TableCombsUtil: Contains the calculated utility for each combination of tables TableCosts: Contains the costs for providing each individual table ProfitsPerTable: Contains revenues and profits generated by each individual table Functions: getViewedTablesCombs(): Provides the number of times a combination of tables has been viewed per user getUtilities(): Calculates the utility per combination of data tables as shown in subsection 6.3 getCosts(): Calculates the costs per table as described in subsection 6.5 AuctionMechanism(): Executes the auction algorithm as shown in subsection 6.7 InfluencePresentedTables(): Influences which tables are presented based on the profits or revenues as shown in subsection 6.9 71 Algorithm: NumberOfViewsPerTableComb = getViewedTablesCombs() TableCombsUtil = getUtilities(NumberOfViewsPerTableComb) TableCosts = getCosts() ProfitsPerTable = AuctionMechanism(TableCombsUtil, TableCosts) InfluencePresentedTables(ProfitsPerTable) 6.3 Utility function The previous section showed the overall functionality of RecorDa market approach component. This section shows the first step: the utility functions. The RecorDa market approach component requires an evaluation of the datasets in order to determine their relevancy. It has to transform the data presented from the recommender system component into a relevancy created by this data. Utility functions are used in market approaches to identify the relevance of a specific product for a user. In this thesis, a ‘pay-per-use’ utility function is used, which is based on the number of views that a dataset receives. Research has found a series of measures linking characteristics of a dataset and its usage to the relevance of the data [200]. Usage has a strong connection to file relevance; in that Wijnhoven et al. [89] state: ‘We found that the perceived frequency of use and user grade determine file use value.’ [89]. Wijnhoven et al. [89] include a table (see Table 10) describing different methods for deciding which files to keep (hence files that are relevant to the company), which demonstrate the significance of usage for data valuation. Number of use is the most prominent surrogate for values in a series of studies [89]. This makes it a good initial selection for a utility function. While views do not necessary equal use it is assumed to be the case within this thesis. The reason is that the RecorDa approach learns which dataset the user likes to see based on the ratings provided from the users. The user is therefore presented with the most relevant datasets. Given that the most relevant datasets will most likely impact the user’s decisions they are therefore also most likely to be used by the user. Number of views is therefore very similar to usage within the RecorDa approach due to its capability to learn user interests. This thesis therefore focuses on number of times a dataset is viewed as the only element of metadata to evaluate combinations of datasets due to the initial nature of the present research. This evaluation approach is based on the fact that the number of data items that a user utilises is limited. Each user can only comprehend so much data. Various studies indicate that humans can only remember around seven things at a time [201]. Future work could extend the list of valuation criteria including some of the methods (i.e. time) mentioned in Table 10. 72 Method Goal of data retention policy Important file attributes [Chen 2005] Capture the changing nature of file value throughout the lifecycle and present the differences in values among different files Frequency of use; recency of use [Turcczyk et al. 2007] Determine the probability of the future use of files to store them in the most cost-effective location Time since last access; age of file; number of access; file type [Bhagwan et al. 2005] Lay out storage system mechanisms that can ensure high performance and availability Frequency of use [Verma et al. 2005] Optimise storage allocation based on policies Frequency of use; file type [Mesnier et al. 2004] Automatically classify and predict the properties of files as they are created Frequency of use; file type; access mode [Zadok et al. 2004] Select files that can be compressed to reduce storage consumption Directory; File name; user; application [Strange 1992] Optimise storage in a hierarchical storage management (HSM) solution Least recently used [Gibson and Miller 1999] Reduce storage consumption on primary storage location Time since last access [Shah et al. 2006] Design a data placement plan that provides cost benefits while allowing efficient access to all important data Metadata; user input; policies Table 10: ‘File Retention Policy Determination Methods’ by Wijnhoven et al. [89] Because it is using the number of views as a proxy for utility this thesis adopts a cardinal utility function [202]–[205] and assigns its utility purely on its only criteria usage. The pay-per-use utility function works by giving each user a fixed utility budget which is allocated towards dataset combinations based on the number of times a dataset is viewed. Whenever the user is presented with a combination of datasets, the presented data combination receives part of the budget (see Figure 8). This allocation mechanism is in line with typical cardinal utility functions [202]–[205]. A cardinal utility function was selected due to its underlying measure, the number of views, being on a cardinal scale and hence making a cardinal utility measure feasible based on a direct linear relationship between utility and number of view. In addition, a cardinal utility function has desirable properties such as adaptability among various users [202]–[205] . Overall, the utility function works as shown in Algorithm 6.2. Algorithm 6.2 for getUtilities(): Utility function Pre-set values from Administrator: B(u): Budget for user u U: List of users Important functions: 73 views(u,c): Outputs the number of times user u has looked at data combination c. (This data could be collected by logging the use of data from the user.) datacombos(u): Outputs the set of data combinations user u viewed. Relevancy calculation: For each user u in U: For each c in datacombos(u): M(u,c) = 0 For each user u in U: view_count = 0 For each c in datacombos(u): view_count = view_count + views(u,c) For each c in datacombos(u): M(u,c) = (B(U) / view_count) * views(u,c) Output: M(u,c) = Map of values of user u for a combination of datasets c The budget represents an overall amount of relevance perception that a user assigns to combinations of datasets based on the number of times they are viewed. It represents the amount of money a person can spend within a market. These relevance values are then fed into the value map (see next subsection). This function also allows RecorDa market approach component to be continuously updated. The values assigned with the budgets will change as the user’s usage of data changes, hence allowing a continuing improvement process. Figure 8: Illustration of the data combination evaluation process for one user 74 6.4 The Value Map After identifying the potential of utility functions this subsection addresses the use of the results from the utility function. The output of the functions is a relevance value for a combination of datasets for a specific user. The utility function evaluates dataset combinations, when they are presented to the user. The distribution of the budget is relative to the user’s number of views of different dataset combinations. The output of all of the utility functions can be combined in a table format consisting of the different combinations of datasets in one axis and the different users along another axis (see Figure 9 for an example). Figure 9: An example of relevance allocations for all users regarding different combinations of datasets This Value Map contains all possible combinations of the different datasets that a user can potentially use for a decision-making problem, along with the valuation that the utility function has found for each data combination. The benefit of the Value Map is that it can be continuously updated based on new incoming evaluations. However, it runs the risk of fast growth. Growth with more users The Value Map grows with the number of users. The number of rows represents the number of users, meaning that it increases linearly with the number of users. However, this thesis assumes that if an employee makes decisions in a group with other employees, this is a different user than when the user makes a decision independently. This means that the number of users could potentially increase exponentially with the number of employees in a company. Given that the number of personal combined decision-making opportunities for a human is limited, however, and because this thesis focuses on datasets and their evaluation instead of employee interactions and group decision-making, this issue will not be further addressed. Growth with more datasets The number of dataset combinations in this Value Map has the size ݊ݑܾ݉݁ݎ ݋݂ ݀ܽݐܽݏ݁ݐ ܿ݋ܾ݉݅݊ܽݐ݅݋݊ݏ ൌ 2# ௢௙ ௗ௔௧௔௦௘௧௦ 75 including the case in which none of these dataset combinations is used. Precisely as Avasarala et al. [87] observe different combinations of sensors, this can lead to a dramatically growing number of dataset combinations for each user. It is important to note that the utility functions do not necessarily provide relevance values for a single data source. This thesis is therefore not specifically able to take the valuation for data combinations and allocate them to a separate dataset. Some of these datasets might be supplementary to each other, meaning that one dataset can be used instead of another one without causing a significant increase in user utility. A typical example of supplementary datasets would be to rely on a Bloomberg financial dataset [206] instead of one internally generated by one’s own finance division. Moreover datasets may also be complementary, which is the case if two datasets obtain a higher relevance value when they are combined. This could be the case when the list of suppliers is combined with an external dataset containing details about this supplier’s solvency. When combined, the user is able to make better decisions about the suppliers, whereas each one on its own does not provide any additional relevance. These examples of complementary and supplementary dataset values are just two examples of various combinations of valuations of a combination of datasets. It should be noted that these types of combinations of valuations for a combination of products are often seen in market situations. Identifying which products truly create value for a company means combining these values with the costs of obtaining these datasets and identifying the ideal combination. This type of calculation can be complex, and will be introduced in detail in subsection 6.6, 6.7, 5.8, and 6.9. 6.5 The costs of data Beside the utility function and their use in the value map covered in the previous subsection, section 6.2 also introduced the importance of costs for market approaches. As shown as a second key functionality this thesis uses interviews to identify the costs (see subsection 4.3.2). It will examine the following costs for dataset allocation: ‐ Maintenance costs: Internal costs for ensuring that the data remains available to the user in its current form. ‐ Development costs: Internal costs for ensuring that the data will be available in a different form in the future. ‐ Subscription costs: Payments to external parties for use of a dataset. 76 ‐ Opportunity costs: Costs for the existence of the dataset in the system. Using one dataset might mean that this dataset takes the spot of another more relevant dataset. Opportunity costs capture this issue. The costs are considered with a future perspective, ignoring sunk costs, meaning that if a dataset created costs for allocation in the past, but will not in the future, then its costs are 0. However, if there is a large investment to be made to acquire this dataset, then the costs will be included. All of these costs can be estimated using established approaches for cost estimation in software project management and with questions of experienced experts. The opportunity costs are set as a low fixed value initially, but further research could use more elaborate techniques to identify them. 6.6 The data allocation problem The previous three subsections covered utility and costs of datasets. This section will address how these are combined to ensure the most relevant data is presented to a user. The problem of identifying which datasets should be presented to the user requires identifying individual dataset valuations. Therefore, a breakdown of valuations from the Value Map, which are based on dataset combinations as well as individual datasets, is needed. Section 6.4 showed that it can be difficult to allocate relevance values to individual datasets even for a small number of datasets. There are a large variety of evaluations for each dataset, depending on the comparison to the other datasets. Each evaluation is potentially valid, making it difficult to decide which one to use for further calculation. Besides these computational problems, there are additional issues with the Value Map based on the recommender system and the user’s budgets: A. It is continuously updated: The values in the Value Map are continuously changing whenever the user looks at different datasets, gives different feedback for recommendations, or changes the user’s preferences. B. It is incomplete. The Value Map does not contain valuations for all individual users and all dataset combinations, because not all of these are shown to the user. The algorithm to address this problem therefore needs to fulfil additional requirements (see Table 11). 77 Limitation Requirements for algorithm solving the Value Map evaluation A. The Value Map is continuously updated. Due to the continuous updating, the algorithm needs to be able to handle this additional information while still computing relatively accurate results without a complete restart of the whole calculation process. Due to the continuous updating of the Value Map, an algorithm needs to update its calculation relatively quickly within a couple of hours. It cannot take weeks or months to be completed. However, the processing is still independent of the Value Map creation, so it does not require an update within seconds. B. The Value Map is incomplete. The algorithms cannot rely on relative comparison of all kinds of dataset combinations because these are not always available. Table 11: Describing the impact that the limitations of the Value Map have on algorithms using the Value Map for individual dataset evaluations This thesis proposes the use of market approaches, introduced in the previous chapters, to overcome this problem and find individual dataset evaluations16. While potentially offering some time improvements, market approaches focus especially on the benefit of not having to do individual comparison and of being sensitive to new incoming data without requiring that the whole calculation be redone. 6.7 Market approaches for solving the data allocation problem The previous sections discussed the difficulty of finding a good solution for the individual dataset evaluations from the Value Map with dataset combinations. The literature review showed that market approaches could help to solve these types of problems. The market approach component of the RecorDa approach needs to manage the interactions between costs for individual datasets and the utility of combinations of datasets used by the user. The challenge of evaluating dataset combinations covered in the Value Map need to be broken up for individual datasets. The market-based algorithm needs to find the price that each individual dataset contributes to the different combinations of each user. This price represents the relevance that this dataset provides, and can be compared to the costs of offering this dataset. Market approaches use auctions for this challenge of price determination. An auction is based on two types of participants: buyers and sellers [109], [112]. The buyers are interested in acquiring a product and have a specific utility (or value) for this product. Sellers are interested in selling a product (in this case data) for as much as possible to cover their costs. 16 Which are often required to make specific data management decisions and select which datasets to present. 78 Buyers continuously look for other options that they are interested in buying, and sellers continuously look for other people to whom they can sell their product. One can transfer the problems described within the Value Map to a market approach problem, where auctions manage these transactions. This gives the benefit of faster calculation times [63]. Data sources want to sell data, and the users are interested in buying data. Therefore, the following types of sellers and buyers are used for the auction mechanism. Buyers The data buyer is the user. The data buyer’s evaluation of a data combination is given by a utility function, which influences the data buyer’s willingness to pay for a specific data combination. The buyer will participate in a variety of auctions to find the highest gain in utility given the available data combinations. Data buyers will pay based on the individual asking price of the data sellers. Seller The data seller is interested in selling data in order to best cover the cost associated with offering this dataset. Data sellers are individual datasets. They have to sell themselves to the data buyers (the users) by bidding for each individual user. The data seller tries to maximize its revenue in order to obtain a high relevance level for its data. If the revenue or income of a data seller is lower than the costs, the seller will be put out of business and will no longer be part of the market. The data buyer therefore evaluates combinations of datasets, while the seller tries to sell them individually to each data buyer. The difficulty is that the user buys a combination of datasets to use for decision-making. The auction mechanisms deal with situations in which a buyer evaluates and ultimately selects a combination of individual products (in this case datasets), while the data seller only sells individual product. The following algorithms describe how data sellers and data buyers operate within this market approach component (see Algorithm 6.3 and Algorithm 6.4). It runs a series of iterations until the price for all datasets is no longer changing or until a fixed iteration cut-off of 1000 iterations is reached. Algorithm 6.3 AuctionMechanism(): Data buyer’s algorithm for evaluating price offers from the auction Values given at start of market V: Value of data combination DC: Dataset combination for this seller Variables 79 P: Total price AP: Auction prices from sellers d and z: Generic variables for datasets pd: Price for dataset Key functions Auction(): Obtains a set of all auction prices for all datasets Get_Price(d,AC): Obtains price for dataset d from auction prices AC Add_Buyer(z,b): Adds buyer b to seller z Calculation at each market iteration For each auction iteration P=0 AP=Auction() For each dataset d in DC pd=Get_Price(d,AP) P=P+pd If P