The Centre of Excellence in Economics and Data Science (CEEDS) is part of the Department of Economics, Management and Quantitative Methods (DEMM). The CEEDS is a multi-purpose technology infrastructure, suitable for the empirical analysis of different research lines of the Department.
CEEDS deems it essential to avail of adequate infrastructures that allow the efficient combination of currently used data, mainly arising from sample surveys, with new data, which are now growing exponentially. Until now, DEMM scientists have mainly contributed to economic literature using data samples, often rectangular in shape (with N observations, and number of variables for observations equal to K << N), with a relatively simple structure as regards observations, and limited dependency among variables. These data are analysed using traditional econometric techniques, focusing on one or more coefficients of interest, which often represent the causal effect of a particular policy. However, empirical research in economics is experiencing an explosion in the availability of new data, which present new features and pose unprecedented demands. This data revolution, combined with the development of innovative techniques for their analysis, has inevitably led to further developments and research in economics; accordingly, CEEDS seeks to invest a significant part of the funds recently awarded by the Italian government through the competitive bid “Dipartimenti di Eccellenza” (http://www.miur.gov.it/dipartimenti-di-eccellenza) to properly respond to emerging challenges.
New data sources, generally defined as Big Data, constitute a heterogeneous and seldom structured data source, which share some key features including their generation frequency, high volume per time unit, format and type diversity, and unexplored information content. These features make such data a valuable tool to complement and integrate with traditional data. Interestingly enough, many of these new data are available today at a very granular level of detail. In economics, we find notable instances of Big Data application in the study of social networks (i.e. data from social media, interactions with websites), localisation of economic agents (i.e. fitness devices) and quantification of migration flows (i.e. mobile geolocation), individual productivity assessment (e.g., at educational institutions with Google scholar), individual job search activities (e.g., analysis of LinkedIn profiles), new hedging strategies for financial and insurance risk (e.g., risk profile in online transactions), and analysis of costs of the national health system (e.g., demand forecasting by territory). Finally, since data presents strong heterogeneity as to their size, frequency and structure, we must organise them by reducing their size. This allows empirical analysis using new approaches, which should replace traditional yet inappropriate statistical techniques.
As an example, we may consider linking administrative data regularly produced by public institutions including tax agencies, pension funds, schools and universities. Quite often, these data sets cover the entire population, as opposed to limited random samples. Compared to sample data, these sources present less issues with missing data or imputation errors, while usually offering a large longitudinal section with limited sample selection and attrition problems. Thus, these data prove a valuable resource for economists engaged in labour economics, international economics, public finance, health, education, innovation, finance, etc. By way of example, consider the impact of the work of Piketty and Saez, who used tax data to analyse income concentration in the top 1% of the population, or that of Raubman et al. on the effect of the expansion of state health insurance, linking data on hospital admissions with insurance data and surveys on the level of physical health and financial independence of individuals, or that of Akerman et al. on the impact of the progressive broadening of broadband on business productivity and wages using Norwegian tax data. The so-called data lakes are an important representative of new data types. At present, these data are collected continuously in every sector of the economy: from banking to pharmaceutical to supermarkets, and hold information about customers, suppliers, prices, costs, etc. Whilst they are not collected for statistical but purely for operational purposes, they can help us analyse the financial markets operation, as well as the behaviour of consumers, workers and businesses. The MIT’s Billion Prices Project, for example, collects data regularly on pricing and features of hundreds of thousands of products sold over the internet, aiming to produce thereby price indexes that closely follow official inflation indexes, offering nonetheless greater timeliness and reliability (particularly in some developing countries). Varian et al. use data from Google Search to provide real-time estimates (nowcasting) of unemployment, consumer confidence, and retail sales.
In most of the cases mentioned, using these data requires the development of innovative methods aimed at reducing their size, which rely on data-driven selection models, and focus on predictive skills through machine learning, deep learning and artificial intelligence. These methodologies, developed mainly in the field of Computing, Mathematics, and Statistics, are increasingly in use in Economics. On the one hand, economic theory can inform the latest data science techniques to help in reducing data sizes. The integration of these different data sources and new methods developed for analysis using traditional empirical methodology has allowed, for example, linking the theory of incentives of auctions to data-driven predictive models, with a view to determine commercial policies and set prices for online companies. What is more, the combination of traditional econometric methods with the granularity, frequency, and size of these data creates new opportunities for identifying causal relationships using experimental research designs: for example, Einav et al., using eBay data about online shopping and the heterogeneity of consumption taxes, managed to calculate the elasticity of consumption over taxation.
Moreover, it leverages the remote access experience gained with the Luxembourg Income Study, headed by a member of Department, as well as that acquired in the Big Data ISTAT Commission, which includes a member of DEMM. The CEEDS seeks first and foremost to assist researchers in building and maintaining an updated dataset archive (EconDataLake), commonly used by members of the DEMM (e.g., BHPS, EULFS, RFL, BvDOrbis, US-CPS, US-Census, PATSTAT, Bloomberg). Its secondary goal is to expand the technical skills to build and develop new data sets from public and private sources, whether of an administrative nature or arising from collaborations with private institutions. For example, collaboration between the INPS and the Lombardy region has led to a project on relations between employability and job stability on the one side, and individual demand for health services and use of psycho-drugs on the other side; another project, using Google scholar data and disambiguation methods based on machine learning algorithms, has been able to measure scientific productivity, even in non-bibliometric disciplines. The CEEDS envisages the investment in calculation capacity, fixed and human capital, with technical staff offering advanced skills in Big Data management, who should coordinate a working group that includes young technicians and researchers in economics and data science.