When: 06 July – 13 July 2020

Course fees: FREE. The Course is fully funded through the MIUR Department of Excellence grant.

Credits: 3 ECTS

Contact: MLsummerschool2020@unimi.it

Download here the flyer with additional information

Course Outline

Recent years have witnessed an unprecedented availability of information on social, economic, and health-related phenomena. Researchers, practitioners, and policymakers have nowadays access to huge datasets (the so-called “Big Data”) on people, companies and institutions, web and mobile devices, satellites, etc., at increasing speed and detail.

Machine learning is a relatively new approach to data analytics, which places itself in the intersection between statistics, computer science, and artificial intelligence. Its primary objective is that of turning information into knowledge and value by “letting the data speak”. To this purpose, machine learning limits prior assumptions on data structure, and relies on a model-free philosophy supporting algorithm development, computational procedures, and graphical inspection more than tight assumptions, algebraic development, and analytical solutions. Computationally unfeasible few years ago, machine learning is a product of the computer’s era, of today machines’ computing power and ability to learn, of hardware development, and continuous software upgrading.

This course is a primer to machine learning techniques using Stata. Stata owns today various packages to perform machine learning which are however poorly known to many Stata users. This course fills this gap by making participants familiar with (and knowledgeable of) Stata potential to draw knowledge and value form row, large, and possibly noisy data. The teaching approach will be mainly based on the graphical language and intuition more than on algebra. The training will make use of instructional as well as real-world examples, and will balance evenly theory and practical sessions.

After the course, participants are expected to have an improved understanding of the potential to perform marching learning, thus becoming able to master research tasks including, among others: (i) factor-importance detection, (ii) signal-from-noise extraction, (iii) correct model specification, (iv) model-free classification, both from a data-mining and a causal perspective.

PROGRAM

Day 1: July, 6th – Session 1: Fundamentals of Machine Learning

9:30 – 11:30: Lecture

Machine Learning: definition, rational, usefulness

– Supervised vs. unsupervised learning

– Regression vs. classification problems

– Inference vs. prediction

– Sampling vs. specification error

Coping with the fundamental non-identifiability of E(y|x)

– Parametric vs. non-parametric models

– The trade-off between prediction accuracy and model interpretability

Goodness-of-fit measures

– Measuring the quality of fit: in-sample vs. out-of-sample prediction power

– The bias-variance trade-off and the Mean Square Error (MSE) minimization

– Training vs. test mean square error

– The information criteria approach

Machine Learning and Artificial Intelligence

The Stata/Python integration: an overview

11:30 – 12:00: Break with questions preparation and collection

12:00 – 12.30: Replies to questions

Day 1: July, 6th – Session 2: Resampling and validation methods

14:30 – 15:30: Lecture

Estimating training and test error

Validation

– The validation set approach

– Training and test mean square error

Cross-Validation

– K-fold cross-validation

– Leave-one-out cross-validation

Bootstrap

– The bootstrap algorithm

– Bootstrap vs. cross-validation for validation purposes

15:30 – 16:00: Break with questions preparation and collection

16:00 – 16.30: Replies to questions

Day 2: July, 7th – Session 3: Model Selection and regularization

9:30 – 11:30: Lecture

Model selection as a correct specification procedure

The information criteria approach

Subset Selection

– Best subset selection

– Backward stepwise selection

– Forward stepwise Selection

Shrinkage Methods

– Lasso and Ridge, and Elastic regression

– Adaptive Lasso

Information criteria and cross validation for Lasso

11:30 – 12:00: Coffee break

12:00 – 13:00: Computer lab

13:00 – 15:00: Lunch with questions preparation and collection

15:00 – 15:30: Replies to questions

Day 3: July, 9th – Session 4: Discriminant analysis and nearest-neighbor classification

9:30 – 11:30: Lecture

The classification setting

Bayes optimal classifier and decision boundary

Misclassification error rate

Discriminant analysis

– Linear and quadratic discriminant analysis

– Naive Bayes classifier

The K-nearest neighbors classifier

11:30 – 12:00: Coffee break

12:00 – 13:00: Computer lab

13:00 – 15:00: Lunch with questions preparation and collection

15:00 – 15:30: Replies to questions

Day 4: July, 10th – Session 5: Tree-based methods

9:30 – 11:30: Lecture

Regression and classification trees

– Growing a tree via recursive binary splitting

– Optimal tree pruning via cross-validation

Tree-based ensemble methods

– Bagging, Random Forests, and Boosting

11:30 – 12:00: Coffee break

12:00 – 13:00: Computer lab

13:00 – 15:00: Lunch with questions preparation and collection

15:00 – 15:30: Replies to questions

Day 5: July, 13th  – Session 6: Neural networks

9:30 – 11:30: Lecture

The neural network model

– neurons, hidden layers, and multi-outcomes

Training a neural networks

– Back-propagation via gradient descent

– Fitting with high dimensional data

– Fitting remarks

Cross-validating neural network hyperparameters

11:30 – 12:00: Coffee break

12:00 – 13:00: Computer lab

13:00 – 15:00: Lunch with questions preparation and collection

15:00 – 15:30: Replies to questions

Prerequisites

Knowledge of basic statistics and econometrics including:

  • the notion of conditional expectation and related properties
  • point and interval estimation
  • regression model and related properties
  • probit, logit and multinomial regression

A working knowledge of Stata is also required.

Software

The course will mainly use Stata (versions 14, 15, and 16 are suitable). However, the attendees need also to have installed in their laptop the software R, RStudio, and Python 3.7. The Anaconda distribution of Python is appropriate.

Pre-course Reading List

  • Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013), An Introduction to Statistical Learning with Applications in R, Springer, New York, 2013.

Post-course Reading List

  • Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2008), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second edition, Springer.
  • Cerulli, G. (2020), “A Super-Learning Machine for predicting economic outcomes”, MPRA Paper 99111, University Library of Munich, Germany, 2020.

Instructor
Giovanni Cerulli ( IRCrES-CNR, Italy)

Scientific Committee
Tommaso Frattini (Chair)
Emanuele Bacchiocchi
Massimiliano Bratti
Fabrizio Iacone
Giancarlo Manzi
Francesco Rentocchini
Silvia Salini
Andreea Piriu

Organizing Committee
Anna Basoni (CEEDS)
Stefania Scuderi (DEMM)

Tutor
Andreea Piriu
Veronica Rattini (DEMM)
Giuseppe Gerardi (CEEDS)