Business Intelligence Tools / Data Science & Analytics / Developers Corner

Data Science Design Pattern: Train & Predict

Spotfire is a great tool that lets you run asynchronous R code right next to your data and visualizations. This makes for what I like to call the Data Science Trifecta. There’s lots of applications out there that provide the Data Science Trifecta – data, visualizations, and computation – and I prefer Spotfire’s relational data model, snappy visualizations, and embedded R engine. So let’s talk about reusing predictive models in this Trifecta. If you’re eager to try it out, you can grab the template off of

Train and Predict Design Pattern
Use Spotfire to first TRAIN and then PREDICT. You can see the model parameters updating with each new training set.

So, usually when you build a predictive model the first step is data quality – removing outliers, getting rid of bad data, etc. – and then we need to train the model. Once we have that trained and validated model, we want to leverage it to predict our response variable for other non-training data. Basically, we want to use what we just built. R has this pipeline built in by convention. Most packages that provide prediction algorithms return an object back that can be applied with predict. This model object can be serialized and unserialized (that’s what they call it in the R world).

Here’s the Design Pattern: you can return a serialized model and then send it to other data functions to use for prediction. It can live in the analysis as a Binary Document Property. You can even save the model to a file or database if you wanted to transfer it to other Spotfire analyses.

Spotfire Data Functions – the wrapper around R computations – can take a variety of inputs in and return data, columns, and other properties. So when you are building a predictive model in Spotfire, it’s good to make two data functions: train and apply.
First, you train a model. This data function needs a training set of data and any other options needed for the prediction algorithm.

# Hard-coded inputs could be dynamically generated, but basically you make your model here
fit <-lm(`Net Hourly Energy Output (EP)` ~ `Temperature (T)` + `Exhaust Vacuum (V)` + `Ambient Pressure (AP)` + `Relative Humidity (RH)`, data=input1)
# You could write out a bunch of other model info here
# Save the model out
model <- serialize(fit, NULL) 

Second, you apply the model in another Data Function. Here, I just return the column values (a vector in R) that I can append to the input table. This allows me to predict using my trained model IN PLACE. Meaning, if I have other data tables with the same input columns, I can create a new column called “Prediction” that applies the model. The function will also recompute if I change the model that gets sent in.

model <- unserialize(inputModel)

info <- "Successfully updated the model."

if(nrow(input1)<1) {
	info <- "Insufficient rows provided."
	fit <- rep(NA, nrow(input1));
} else {
	fit <- predict(model, newdata=input1)
output <- fit

The power of this simple framework is that now I can easily train on a set of data – for instance performance in 2000-2010 – and then predict on a different set of data – say performance in 2011-2016. Because we can rig this all up to marking and filtering, I can interactively change my training set and prediction sets.

I can also use this methodology to build and apply MULTIPLE competing models. For instance, I could make several trained models – one based on ALL the rows of data, one based on a filtering set, and another based on marking. I could then predict using each model: Predict (all), Predict (filter), Predict (mark). I could use these to understand differences between groups.
Building models is critical for prediction, but also can help understand patterns in the underlying data based on the idiosyncrasies that arise in predicted values. Seeing that Predict (all) returns higher values than Predict (mark) would tell me that I just marked a low performing set of rows – or at least more pessimistic.

Check out the Spotfire template that illustrates this best practice. You can grab it off of here.

Leave a Reply

Your email address will not be published. Required fields are marked *