Creating ideas and insights.

Machine Learning of Machines with R

Dublin R User Group

Following my introduction presentation to the Dublin R User Group on machine learning, I again had the pleasure of being invited to talk on more machine learning. My talk source and examples are in a Github repo. The talk was more intermediate than my previous introductory talk. It used two datasets focusing on computing cluster job usage statistics (e.g. similar to data from a HPC job scheduler) and machine level application and host usage statistics (e.g. similar to those from host monitoring software such as Nagios).

PDF version of this talk

Overview

The techniques covered included:

  • Background on large scale clusters (e.g. HPC) and rationale for ML of machine/cluster operational metrics
  • Workflow for model building
  • Data transformation
  • Addressing feature selection
  • Model assessment and selection
  • Interpreting a confusion matrix
  • Interpreting a ROC plot
  • Approaches to handling prediction errors
  • Boosting and Bagging
  • Data set 1 - Job scheduling data
  • Data set 2 - Machine metrics data
  • Summary / recap
  • (Unused) A review of various ML techniques:
    • Decision trees
    • Random forests
    • k Nearest Neighbors
    • Support vector machines

Background

This talk focused on the two datasets as well as highlightly how the business context was important to consider when developing the models. The talk looking at using machine learning to improvie the utilisation and scheduling of large scale clusters (significant hardware resources). It recapped some earlier material and focused a little more on feature selection and feature generation as well as applied domain expertise from the data set. It provides two datasets showing examples of how to use R to select, assess and create the models.

Talk slides

Related Posts
Previous [ Handy Interaction Design - DEFUSE Dublin 2013 ]
Next [ My Previous Research ]
 
Share via: Feed Subscription via: