Insights

Sat 21 Mar 2015

tags: r presentations machine learning rstats

Following my introduction presentation to the Dublin R User Group on machine learning, I again had the pleasure of being invited to talk on more machine learning. My talk source and examples are in a Github repo. The talk was more intermediate than my previous introductory talk. It used two datasets focusing on computing cluster job usage statistics (e.g. similar to data from a HPC job scheduler) and machine level application and host usage statistics (e.g. similar to those from host monitoring software such as Nagios).

PDF version of this talk

Overview

The techniques covered included:

Background on large scale clusters (e.g. HPC) and rationale for ML of machine/cluster operational metrics
Workflow for model building
Data transformation
Addressing feature selection
Model assessment and selection
Interpreting a confusion matrix
Interpreting a ROC plot
Approaches to handling prediction errors
Boosting and Bagging
Data set 1 - Job scheduling data
Data set 2 - Machine metrics data
Summary / recap
(Unused) A review of various ML techniques:
- Decision trees
- Random forests
- k Nearest Neighbors
- Support vector machines

Background

This talk focused on the two datasets as well as highlightly how the business context was important to consider when developing the models. The talk looking at using machine learning to improvie the utilisation and scheduling of large scale clusters (significant hardware resources). It recapped some earlier material and focused a little more on feature selection and feature generation as well as applied domain expertise from the data set. It provides two datasets showing examples of how to use R to select, assess and create the models.

Machine Learning of Machines with R

Dublin R User Group

Overview

Background

Talk slides