Guided automation for machine learning

Implementing a web-based blueprint for semi-automated machine learning, using the open source Knime Analytics Platform

Guided automation for machine learning
Thinkstock

We propose here a blueprint solution for the automation of the machine learning lifecycle.

The price to pay for automated machine learning (aka AutoML) is the loss of control to a black box kind of model. While such a price might be acceptable for circumscribed data science problems on well-defined domains, it might prove a limitation for more complex problems on a wider variety of domains. In these cases, a certain amount of interaction with the end users is actually desirable. This softer approach to machine learning automation—the approach we take at Knime—is obtained via guided automation, a special instance of guided analytics.

As easy as the final application might look to the end user, the system running in the background can be quite complex and therefore not easy to create completely from scratch. To help you with this process, we created a blueprint of an interactive application for the automatic creation and training of machine learning classification models.

The blueprint was developed with Knime Analytics Platform, and it is available on our public repository.

We propose here a blueprint solution for the automation of the machine learning lifecycle.

The price to pay for automated machine learning (aka AutoML) is the loss of control to a black box kind of model. While such a price might be acceptable for circumscribed data science problems on well-defined domains, it might prove a limitation for more complex problems on a wider variety of domains. In these cases, a certain amount of interaction with the end users is actually desirable. This softer approach to machine learning automation—the approach we take at Knime—is obtained via guided automation, a special instance of guided analytics.

As easy as the final application might look to the end user, the system running in the background can be quite complex and therefore not easy to create completely from scratch. To help you with this process, we created a blueprint of an interactive application for the automatic creation and training of machine learning classification models.

The blueprint was developed with Knime Analytics Platform, and it is available on our public repository.

This article is a follow-up to our introductory article on the topic, “How to automate machine learning.” In this second post, we describe in more detail the techniques and algorithms happening behind the scenes during the execution of the web browser application.

knime guided automation for ml 2 figure 1 Knime

Figure 1: The main process behind the blueprint for guided automation: data upload, application settings, automated model training and optimization, dashboard for performance comparison, and model download.

Guided automation from a web browser

Let’s see what the guided automation blueprint looks like from a web browser via Knime Server.

At the start, we are presented with a sequence of interaction points:

  • Upload the data
  • Select the target variable
  • Remove unnecessary features
  • Select one or more machine learning algorithms to train
  • Optionally customize parameter optimization and feature engineering settings
  • Select the execution platform.

These steps are all summarized in Figure 2 below.

knime guided automation for ml 2 figure 2 Knime

Figure 2: This diagram follows the execution of the blueprint for guided automation on a web browser: (1) upload the dataset file, (2) select the target variable, (3) filter out undesired columns, (4) select the algorithms to train, (5) define the execution environment. At the top is the flowchart that will serve as the navigation bar throughout the process.

After crunching the numbers—i.e., data pre-processing, feature creation and transformation, parameter optimization and feature selection, and final model re-training and evaluation in terms of accuracy measures and computing performance—the final summary page appears showing the model performance metrics. At the end of this final page we will find the links to download one or more of the trained models for future usage, for example, as a RESTful API in production.

For a look at the full end-user experience, you can watch the guided automation application in action in this demo video, “Guided Analytics for Machine Learning Automation.”

Knime Analytics Platform

This blueprint for machine learning automation was developed using Knime Analytics Platform.

Knime Analytics Platform is open source software for data science, covering all your data needs from data ingestion and data blending to data visualization, from machine learning algorithms to data wrangling, from reporting to deployment, and more. Knime Analytics Platform is based on a graphical user interface for visual programming. This makes it very intuitive and easy to use, considerably reducing the learning time. Knime Analytics Platform has also been designed to be open to different data formats, data types, data sources, data platforms, and external tools.

Computing units are displayed in the GUI of Knime Analytics Platform as small colorful blocks, named “nodes.” Assembling nodes in a pipeline, one after the other, implements a data processing application. The pipeline is referred to as the “workflow” (Figure 3).

A workflow can also be executed from a web browser through the Knime Server. Dedicated nodes build input and output web components that can be assembled to compose a webpage. Inserting these special nodes within a workflow provides interaction points in the execution flow where an input is required or an output is displayed.

Given its openness, flexibility, visual programming, and optional web-based execution, Knime Analytics Platform is well suited for the implementation of our guided automation blueprint to automatically create and train classification models.

The workflow behind machine learning automation

The workflow behind the blueprint is available on the public Knime Examples Server from the Knime website or via Knime Analytics Platform under 50_Applications/36_Guided_Analytics_for_ML_Automation/01_Guided_Analytics_for_ML_Automation.

You can import the workflow into your Knime Analytics Platform, customize it to your needs, and run it from a web browser on the Knime Server. In this video, you can find more details on how to access and import workflows from Knime Examples Server to Knime Analytics Platform. The blueprint workflow is shown in Figure 3 below.

knime guided automation for ml 2 figure 3 Knime

Figure 3: Guided automation workflow implementing all required steps and webpages: settings configuration, data preparation and model training, and final dashboard.

In the workflow, you can recognize the three phases of the web-based application:

  • Settings configuration: Upload, select, fine-tune, and execute the automation
  • Behind the scenes: Data preparation and model training
  • Final dashboard: Compare and download models

Settings configuration: Upload, select, fine-tune, and execute the automation

In the first part of the workflow, each light gray node produces a view, i.e., a webpage with an action request. When running the workflow via a web browser to Knime Server, these webpages introduce as many interaction points where the end user can set preferences and guide the analytics process. You can see the nodes to upload the dataset file, select the target variable, and filter certain features.

Data columns can be excluded based on their relevance or on an expert’s knowledge. Relevance is a measure of column quality. This measure is based on the number of missing values in the column and on its value distribution; columns with too many missing values or with too constant or too spread values are penalized. Moreover, columns can be manually removed to prevent data leakage.

After that, you can select the machine learning models to train, optionally introduce settings for parameter optimization and feature engineering, and finally select the execution platform. The sequence of webpages generated during the execution of these special nodes is shown in Figure 2.

Behind the scenes: Data preparation and model training

In the following phase, the number crunching takes place behind the scenes. This is the heart of the guided automation application. It includes the following operations:

  • Missing value imputation and (optional) outlier detection
  • Model parameter optimization
  • Feature selection
  • Final optimized model training or retraining

After all settings have been defined, the application executes all of the selected steps in the background.

Data partitioning. First the data set is split into training and test sets, using an 80/20 split with stratified sampling on the target variable. Machine learning models will be trained on the training set and evaluated on the test set.

Data preprocessing. Here, missing values are imputed column by column using the average value or the most frequent value. If previously selected, outliers are detected using the interquartile range (IQR) technique and capped to the closest threshold.

Parameter optimization. The parameter optimization process implements a grid search over a selected set of hyperparameters. The granularity of the grid search depends on the model and type of hyperparameters. Each parameter set is tested with a four-fold cross-validation scheme and ranked by average accuracy.

Feature engineering and feature selection. For feature engineering, a number of new artificial columns are created according to previous settings. Four kinds of column transformations can be applied:

  • Simple transformation on a single column (ex, x2, x3, tanh(x), ln(x))
  • Combining together pairs of columns with arithmetical operations
  • Principal component analysis (PCA)
  • Cluster distance transformation, where the data are clustered by the selected features and the distance to a chosen cluster center is calculated for each data point

A feature optimization process is run on the new feature set, consisting of all original features and some newly created features. The best feature set is selected by random search through a four-fold cross-validation scheme ranked by average accuracy. Parameter optimization and feature engineering can be customized. Indeed, for small datasets and easy problems, model optimization can be skipped to avoid overfitting.

Model retraining and evaluation. Finally, using the optimal hyperparameters and the best input feature set, all of the selected machine learning models are retrained one last time on the training set and reevaluated on the test set for the final accuracy measures.

Final dashboard: Compare and download models

The last part of the workflow produces the views in the landing page. The node named Download Models contains prepackaged JavaScript-based views producing plots, charts, buttons, and descriptions visible in the final landing page.

ROC curves, accuracy measures, gain or lift charts, and confusion matrices are calculated on the test set and displayed in this final landing page to compare accuracy measures.

Model execution speed is evaluated during training and during deployment. Deployment execution speed is measured as the average speed to run the prediction for one input. Thus, two bar charts show respectively the model training time in seconds and the average time to produce a single prediction in milliseconds.

All dashboard views are interactive. Plot settings can be changed, and data visualizations can be explored on the fly.

The same node also produces the links at the end of the page to download one or more of the trained models for future usage.

An easily customized workflow for guided machine learning

We have reached the end of our journey in the realm of guided automation for machine learning.

We have shown what guided automation for machine learning is, our own interpretation for semi-automated (guided) machine learning applications, and the steps required.

We have implemented a blueprint via Knime Analytics Platform that can be downloaded for free, customized to your needs, and freely reused.

After introducing the relevant GUI and analytics steps, we have shown the workflow implementing the application behind the web browser view.

This workflow already works for binary and multiclass classification and it can be easily customized and adapted to other analytics problems—for example, to a document classification problem or to a time series analysis. The rest is up to you.

Paolo Tamagnini, a data scientist at Knime, holds a master’s degree in data science from Sapienza University of Rome and has research experience from NYU in data visualization techniques for machine learning interpretability.

Simon Schmid is currently studying for a master’s degree in computer science at the University of Konstanz, Germany. His research interest is the automation of machine learning. He has been at Knime since 2016 as a software engineer. He is now based in Austin, Texas, completing a further internship, diving more deeply into exploring guided automation for machine learning.

Christian Dietz holds a diploma in business informatics from the Business and Administration Academy in Stuttgart and received a master’s degree in computer science from the University of Konstanz. After working as a research programmer at the University of Konstanz, where he developed frameworks and libraries in the fields of bioimage analysis and machine learning, Christian joined Knime as a software engineer. Some of his recent projects include the deep learning, H2O machine learning and image processing integrations for Knime Analytics Platform. 

For more information on Knime, please visit www.knime.com and the Knime blog .

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

This story, "Guided automation for machine learning" was originally published by InfoWorld.