Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
This article shows you how to create a machine learning (ML) Job using real, but anonymized data to build a predictive model. The example shows you how to use Talend Machine Learning components to build such predictive models. While these models could cover a wide range of use cases in several industries, the basic principles are the same.
For more information, see Machine Learning in the Talend Help Center.
The source data used in this article is from a university hospital study on the outcome of patients who suffered a subarachnoid hemorrhage. The patients were admitted to the hospital where their condition was monitored, for a couple of weeks, during treatment. Various blood tests were done every few days, and the samples were tested for specific blood markers. Each patient was classified with a likely outcome, a survivability score, known as a Hunt and Hess score. That data was collected alongside demographic data such as sex and age, together with clinical features such as predicted outcome for that patient.
The goal is to build a predictive model that predicts the clinical outcome for patients based on various parameters. Then compare the model to the actual predicted outcome to test and verify the Talend model. The ML Job uses the following specifications:
Spark MLlib is a fast, powerful, distributed machine learning (ML) framework on top of Spark Core. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib, and these simplify large scale machine learning pipelines.
MLlib consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization types such as basic statistics, classification and regression, linear models (SVMs, logistic regression, linear regression), collaborative filtering, clustering, dimensionality reduction, feature extraction and transformation and optimization.
Many Talend ML components allow you to call and utilize these MLlib algorithms to process and build ML Jobs.
Talend ML components are grouped into four categories, Classification, Clustering, Recommendation, and Regression. This article focuses on the tPredict and tRandomForestModel components in the Classification category.
The tPredict component uses a given classification, clustering, or relationship model to analyze datasets incoming from its preceding component.
This example uses a random forest model. These work by constructing a multitude of decision trees at training time and then they output the class, that is, the mode of the classes (classification) or mean prediction (regression) of the individual trees.
A good analogy to consider is that you're walking blindly through a forest with your arms outstretched. Each time you hit a tree, you are deflected in a different direction.
The following diagram illustrates the architectural process involved.
The tRandomForestModel component is parameterized by the number of trees in the forest. In the diagram above, this is shown as Tree 1 through Tree B.
To replicate these Jobs or to build your own predictive model, based on your own use cases, download a free trial of Talend Big Data & Machine Learning Sandbox.
The sandbox has everything you need to build and run ML Jobs, without having to install all the components yourself. It also comes complete with sample, ready to use, Jobs.
The patient data, in spreadsheet format, is used to build a predictive model to predict the Hunt and Hess score based on specific data in the file.
To build your model, use the following process:
Step 2 is the key: Whatever your use case is, you need to understand these relationships and how they are linked. If there are dependencies between certain variables, then you can model these using certain MLlib functions.
This use case established that there is a relationship between a patient's survivability score (the Hunt and Hess score), their age (the older you are, the less likely you are to survive) and the results of specific markers in blood tests (these are excellent indicators of your clinical outcome). This data is used to build a predictive model.
You can use certain Talend ML components to help identify these relationships in the data. For more information on Talend ML components, see Machine Learning in the Talend Help Center.
This section shows you how to create the predictive model by building Standard Jobs and Big Data Batch Jobs:
Set up the environment by building a Standard Job that takes in the raw data, then filters the data to select only valid data. For example, you may only want data from patients in a specific age range.
The Job has three sets of data created using the following components:
Note: If you only want a sample of the data for demo purposes, you can use a tSampleRow component.
This Job uses the following components:
This section details the components used to build a Big Data Job to train the model. The Job takes the training data produced in the first Job and uses it to train a random forest model before that data is passed through a tModelEncoder model.
This Job uses the following components:
tFileInput
This component takes the training data from the Hadoop file system. The only configuration is to specify the location of that file, that is, the same place where the setup Job wrote the training data.
tModelEncoder
This component performs operations to transform data into the data format expected by the model training components. These operations consist of processing algorithms which transform given columns of this data and sends the result to the model training component that follows to eventually train and create a predictive model.
By default, this component contains a set of four different transformations; this example uses the RFormula transformation.
RFormula is used for implementing transformations which are required for fitting data against an R model formula. Within the function are a small set of R operators that can be used to describe the transformation required.
This model uses the relationship between the Hunt and Hess score and the patient's age plus the results of various markers in blood tests. This example uses only two of these blood tests. Thus, the required transformation is defined as:
Hunt_Hess ~ Cyt_B + D_Loop + Age
Where Cyt_B and D_Loop are two of those blood tests.
tRandomForestModel
This component analyses featured variables. These variables are usually pre-processed by the tModelEncoder component to generate a classifier model that is used by the tPredict component to classify given elements. It analyses incoming datasets based on applying the Random Forest algorithm. It then generates a classification model out of this analysis and writes this model either in memory or in a given file system.
Note: It is necessary to configure the following settings:
Random forest hyper-parameters:
This is done for you in the configuration settings, as shown below.
After the Job is built and configured, you can run it. The Job runs through several stages, which are dependent upon the configuration (this is defined by the depth and the number of trees in the model). Running the Job produces the following output:
You are now ready to use the model to predict results.
This section details the components used to build a Big Data Job to predict results.
This Job uses the following components:
tFileInputDelimited
This component takes the training data produced in the Standard Job and uses it as input for the predictions.
tPredict
This component takes the input data and applies the model built to make predictions about a patient's survivability score (Hunt and Hess score) based on the variables previously defined; the patient's age and the results of certain blood tests.
Note: It is necessary to configure the following settings:
This is done for you in the configuration settings, as shown below.
tFileOutputDelimited
This component outputs the results of the predictive model to a file on the Hadoop filesystem.
After the Job is built, run the Job, the Job runs through the stages again, before finishing.
This section details the components used to build a Job to test the results of the model. You can compare the predictions the model made, against the test data you already have, and you can compare the predicted Hunt and Hess score against the actual Hunt and Hess score.
This Job contains the following components:
tFileInputDelimeted
This component takes the test data as your input.
tPredict
This component runs the test data through the model and sends the output is sent to the next component.
tAggregateRow
This component takes the predicted Hunt and Hess score and compares it to the actual score. The component configuration is shown below:
tLogRow
This component outputs the results.
This section examines the model to see how accurately it performs. For the comparison, construct what is known as a Confusion Matrix or an Error Matrix, as shown below:
These four types are used as a graphical and straightforward way to display your results:
To determine the accuracy of the model, divide the product of the True Positives and True Negatives, then divide that by the total number of data points.
In this case, the test data had 73 useable data points. The results scores are:
Overall, how accurate is your model? If you do the calculations, you'll find that you get the following result:
(TP+TN)/Total = (60+3)/73 = 0.863 = or approximately 86% accurate
The data had a total of 90 data points. However, it can only use 73 of these data points because not all of the data is complete. This is not a great deal of data, but to build a model with an accuracy of 86% is quite good.
Having more data could improve the model and increase accuracy.
This article showed you how to use real-life data to build a predictive model with Talend Machine Learning components. You can use it is an example when constructing your machine learning Job for your use cases using your data.