Good knowledge of Statistics (i.e. probability, inferential statistics, regression model).
Previous exposure to a programming language, such as R or Python, is useful.
The course aims at providing the knowledge of the cutting-edge statistical tools for the modeling and understanding of complex and big data. These methods aim to automatically detect patterns in data (i.e. to “learn” from data) and the uncovered patterns can then be used by the analyst to make accurate predictions and decisions under uncertainty.
At the end of the course the student will gain the ability to:
a) choose and apply the appropriate statistical tool, in the class of statistical learning methods, for the analysis of different types of complex data coming from real-world problems;
b) use the open-source statistical software R (freely available for download at http://www.r-project.org) for statistical analysis, modeling and prediction;
c) interpret the results in a decision making perspective.
- Introduction to machine learning: supervised versus unsupervised learning, the bias-variance trade-off.
- Regression: review of simple, multiple and logistic regression, non-linear regression models, ridge and lasso regression.
- Resampling methods: cross-validation and bootstrap.
- Classification: regression trees, bagging, random forests, boosting.
- Unsupervised learning: principal components analysis and clustering.
- Elements of spatial data analysis: areal data, spatial autocorrelation, models for areal data, disease mapping.
The official course book is: James G., Witten D., Hastie T., Tibshirani R. (2013). An introduction to statistical learning with applications in R. Springer.
About R software, documentation is freely available at the following link: https://www.r-project.org/other-docs.html
The course consists in class lectures and R lab sessions. The lectures & labs calendar will be published before the beginning of the course on the e-learning platform; labs will take place within the hours scheduled for the course (roughly one-third of classroom time).
The exam consists in:
- a written test including open-ended and test questions (concerning theoretical topics or short applications of the studied methods);
- exercises to be solved using the R software (in order to evaluate the ability of the student in analysing data and interpreting software outputs).
The theoretical and practical sections are each worth 50% of the total score, approximately.
Attending class lectures and R labs is strongly recommended.