Premise
As developers incorporate statistical machine learning (ML) models into their work, warding off application decay depends on maintenance of those models within end-to-end DevOps pipelines. Maintaining traditional application code and training statistical ML models are two sides of the same DevOps coin. Essentially, training ensures that embedded ML models continue to achieve intended application outcomes, such as making data-driven predictions and classifications, with high accuracy.
Analysis
In the DevOps lifecycle for applications that incorporate supervised-ML models, training should be an integral process, not an afterthought. As a key application maintenance function, supervised-ML algorithmic-model training can be successful if developers have access to relevant labeled data sets, implement retraining as often as necessary to maintain model accuracy, and can orchestrate all necessary human and automated training resources in a cost-effective, timely fashion if they:
- Incorporate ML model training into enterprise DevOps.
Maintaining traditional application code and training data-driven statistical algorithms are two sides of the same ML DevOps coin. The new generation of developers must train the supervised ML assets in their apps in order to ensure that their continued fitness as predictors, as classifiers, and in other roles.
- Assess model-training requirements within the broader enterprise ML practice.
For developers, assessing their ongoing requirement for training data resources hinges on identifying the proportion of their ML-infused application portfolio that relies on supervised learning.
- Estimate training resource requirements throughout the ML lifecycle.
To the extent that developers are building or managing supervised-learning ML applications, they will need to estimate training requirements throughout application lifecycles. This requires an understanding of what they will need in hardware, software, data, and human resources at each stage of the supervised-ML DevOps lifecycle.
- Explore approaches for training ML models within resource constraints.
In order to constitute a sustainable DevOps practice, training of supervised-ML models should be managed within financial, technical, and human resources available to the development, data science, and IT operations teams. Chief approaches for conducting training within these constraints include making most of smaller data sets, generating labeled training data, repurposing existing training data and labels, harvest one’s own training data and labels from free sources, leveraging prelabeled public datasets, and using crowdsourced labeling services.
Incorporate supervised-ML model training into enterprise DevOps
The objective of application maintenance is to keep programs fit for purpose. Typically, developers accomplish that through modifications to the code, schemas, metadata, and other artifacts from which their applications are composed.
Application decay is the process under which the fitness of some programmatic artifact declines. Though the app in question may remain bug-free, its value to users will decay if the code proves difficult or costly to modify, extend, or evolve to keep pace with changing conditions. For example, an app’s value may decay because developers neglect to fix a confusing user interface, or include a key attribute in its data model, or expose an API that other apps need in order to consume its output.
Application maintenance may involve modifications to layers of code, schemas, metadata, and other artifacts of which it is comprised. To the extent that developers have inadvertently built potential maintenance nightmares into their applications, the costs of keeping it all fit-to-purpose deepen and the inevitable value decay accelerates. Typically, these issues stem from reliance on problematic development practices such as opaque interfaces, convoluted interdependencies, tightly coupled components, and scanty documentation. These problems add fragility to the code and downstream costs and delays in the DevOps lifecycle.
Whereas traditional maintenance involves revisions to deterministic application logic, the new era of data science demands that developers also maintain—in other words, train—the probabilistic logic expressed in ML, deep learning, predictive analytics, and other statistical models. As developers incorporate ML models into their applications, they will need to ensure that those assets are also kept fit for purpose. Typically, those purposes revolve around ensuring that the ML model encodes which data features (e.g., independent variables such as customer attributes and behaviors) best predict the outcome of interest (e.g., dependent variables such as customer likelihood to churn).
Generally, developers test ML-model fitness, accuracy, or effectiveness through a methodology called supervised learning. This involves algorithmically assessing how accurately a model analyzes data that is a key input into some prediction, classification, or other challenge that the model was built to address. When used to fine-tune an ML model’s accuracy, this input data is generally referred to as “training data,” though it often goes by such other names as validation sets, test sets, design sets, reference sets, and ground truth (please note that professional data scientists make fine-grained distinctions among these concepts, so they are not strictly interchangeable).
Figure 1 presents a graphic overview of the high-level supervised-learning steps involved in training ML models.
Figure 1: Training Machine Learning Models Through Supervised Learning
Assess model-training requirements within the broader enterprise ML practice
Supervised learning involves creating ML models that infer functions (predictive, classification, etc.) from labeled training data. To the extent that developers have trained it effectively, a supervised ML model will learn the approximate correlation of data features to desired outputs that are implicit in the training data. In terms of the underlying mathematics, supervised learning seeks to minimize an ML model’s statistical loss function over a training data set. The ML models being trained may incorporate diverse underlying ML algorithms geared to supervised learning, such as artificial neural networks, support vector machines, linear regression, logistic regression, naive Bayes, linear discriminant analysis, decision trees, and k-nearest neighbor.
For developers, assessing their ongoing requirement for training data resources hinges on identifying the proportion of their ML-infused application portfolio that relies on supervised learning. Table 1 presents the most common alternatives to supervised learning, indicating when, if at all, each of them uses labeled training data.
ALTERNATIVE | DISCUSSION |
Semi-supervised learning | · Description:
enables more accurate inference of the correct labels for the larger data set from which both the labeled and unlabeled examples were taken without the cost of manually labeling the entire set · Applications: indexing, webpage classification, and online content curation · Use of labeled training data: uses a large amount of unlabeled data and a smaller amount of manually labeled training data from the same source |
Reinforcement learning | · Description:
involves iterative exploration by an ML-powered decision agent of an unfamiliar environment; agent processes feedback on whether iterative steps cumulatively help it maximize a predefined closer to some predefined statistical reward function; adjusts agent actions to exploit any knowledge gained that will improve its chances of maximizing the reward · Applications: ML-driven industrial applications, robotics, online gaming · Use of labeled training data: none (however, there are myriad use cases, such as those cited here, where reinforcement learning is supplemented through supervised-learning methods in the training of ML algorithms) |
Unsupervised learning | · Description:
uses cluster analysis algorithms (e.g., hierarchical clustering, k-means clustering, Gaussian mixture models, self-organizing maps, and hidden Markov models) to find hidden groupings and other patterns in unlabeled data · Applications: computer vision, image segmentation, unstructured-data pattern mining. · Use of labeled training data: none |
Table 1: Most Common Alternatives to Supervised Learning
Estimate training resource requirements throughout the ML lifecycle
To the extent that developers are building or managing supervised-learning ML applications, the will need to estimate training requirements throughout those applications’ lifecycles. That requires an understanding of what one will need in IT resources (cloud services, servers, software, data, etc.) and in skilled personnel at each stage in the supervised-learning ML pipeline. Table 2 presents the chief stages of this pipeline along with high-level resource requirements relevant to training of supervised-ML models.
STAGE | DISCUSSION |
Acquire the type of training data required |
At this stage, ML developers—including statistical modelers and domain experts–identify the historical data necessary to train the ML model in question. They will require tools for data discovery, visualization, and analysis. They will also require the ability to access and explore data from different sources, including internal enterprise data, external data marketplaces, and open-source repositories. |
Prepare the training data |
At this stage, developers—including all roles from the prior phase, plus data engineers–oversee all the requisite aggregation, cleansing, and other upfront functions needed to prepare data for building and training ML models. This will require the ability to set up and manage infrastructure and workflows for refining, storing, labeling, and otherwise managing the training data and all associated metadata. |
Evaluate results of the current model version against the data |
At this stage, developers—including all roles from prior stages–evaluate the current version of a built ML model by using it to analyze the current version of the training data. If the results of the analysis meet objectives—such as predicting the phenomenon of interest with high accuracy—the model is considered trained and is ready for deployment into the target application. If the results fail to meet objectives, the model may be revised and subsequently re-evaluated. Developers may revise the model by tweaking its feature set, altering its hyperparameters, or building it with a different statistical algorithm. In addition to or in lieu of any of these approaches, developers may also re-evaluate the model with different training data (drawn from the same or different sources as in the first training run) or using a different training methodology (such as any of the approaches listed in Table 1). Resource requirements in this stage will encompass all of the hardware, software, data, and human resources from prior stages. |
Table 2: Chief Stages and Resource Requirements in the Supervised ML Development and Training Pipeline
The costs, risks, and delays in the process of training supervised-learning ML apps may grow due to any of the following issues. Table 3 presents the principal issues that may impede successful training of supervised ML models.
ISSUE | DISCUSSION |
Data scarcity |
Lacking sufficient volumes of training data, or the right types, can frustrate efforts to boost supervised-ML model inference accuracies. Some supervised-ML algorithms may be in application domains for which adequate training data sets are difficult or impossible to obtain. Consequently, the upfront data discovery and acquisition workloads may be commensurately greater, costlier, and take longer to come to fruition. Sometimes, this paucity stems from intractable issues of data access, such as with cybersecurity applications of supervised ML for which necessary example data is highly sensitive. |
Data-labeling resource constraints |
All supervised-ML algorithms require that training datasets be labeled, typically through manual methods. Training data needs to be labeled so that supervised-ML algorithms can improve their accuracy as predictors, classifiers, or for whatever objective they’ve been built. However, the requisite labeling personnel are often scarce, unreliable, and of low productivity. In particular, the availability of subject matter experts for training-data labeling and curation may be inconsistent, due to their tasking on other responsibilities outside the scope of ML projects. |
Frequent retraining cycles |
Some supervised-ML algorithms may require frequent retraining, especially if the underlying predictors are poorly understood, interact in complicated patterns, or change dynamically. Even if a supervised-ML model was trained well at initial deployment, failure to keep retraining it at regular intervals with fresh observational data may cause it to become less effective over time. That’s especially true if the original independent variables become less predictive of the phenomenon of interest. Also, most ongoing requirements to update supervised ML models will require retraining them afresh with every app-update cycle. And if one considers how application maintenance may require parallel revisions to the corresponding app code, it’s clear that the maintenance burdens could intensify as developers build apps that rely on ML and other advanced analytics. |
Table 3: Issues That May Impede Successful Supervised ML Training
For ML-infused applications that drive real-time applications—such as in e-commerce, mobile, or Internet of Things edge scenarios–another complicating factor may be the need to accommodate two-tier concurrent training. As Figure 2 illustrates, this might involve batch training of global supervised-ML models (in other words, those applicable to all users, such as in e-commerce recommendation engines) alongside continuously adaptive training of local supervised-ML models (e.g., those that drive fine-grained personalization, such as in mobile-chatbots’ conversational user interfaces).
Figure 2: Training Machine Learning Models Through Supervised Learning
As training becomes more complex to set up and administer, the end-to-end burden grows on IT, developers, and others involved in the supervised-ML model development, training, deployment, and administration cycle. Consequently, it may become more difficult to estimate the downstream maintenance burden amid the flurry of technical details through an ML application’s lifeycle. For example, check out the detailed nuances of supervised-ML training configuration and administration as discussed in this recent article, or this, or myriad other online discussions.
Explore approaches for training supervised-ML models within resource constraints
In order to constitute a sustainable DevOps practice, training of supervised-ML models should be managed within financial, technical, and human resources available to the development, data science, and IT operations teams. Table 4 presents approaches for ensuring adequate training of supervised-ML models within constraints.
APPROACH | DISCUSSION |
Make the most of smaller training data sets |
Data scientists have developed various approaches for boosting accuracy of supervised ML models from small, sparse, and otherwise limited training-data sets. Here is an interesting HP research paper on the topic. |
Automate generation of labeled training data |
Developers should explore the promising approaches available now for generating good-enough training datasets of sufficient volume, with good-enough labeling, and with a less stringent “weak supervision” approach. Alternately, one should opt to explore tools for automating the labeling of training data. |
Repurpose existing training data and labels |
This may be the cheapest, easiest, and fastest approach for training, if we assume that the new learning task’s domain is sufficiently similar to the domain of the original task. When taking this approach, “transfer learning” tools and techniques may help determine which elements of the source training dataset are repurposable to the new modeling domain. Repurposing training data is easier if one maintains a curated corpus of this data within the enterprise data lake, uses tools for visualizing the attributes of training data, and automates training in a structured data-science DevOps pipeline. |
Harvest one’s own training data and labels from free sources |
The Web, social media, and other online sources are brimming with data that can be harvested if one has the right tools. In this era of cognitive computing, developers can in fact acquire rich streams of natural language, social sentiment, and other training data from various sources. If one has access to a data crawler, this might be a good option for acquiring training datasets–as well as the associated labels–from source content and metadata. Clearly, one will need to grapple with a wide range of issues related to data ownership, data quality, semantics, sampling, and so forth when trying to assess the suitability of crawled data for model training. |
Explore prelabeled public datasets |
There is a wealth of free data available in open-source communities and even from various commercial providers. To point to just a few, here are free supervised-ML training datasets for building chatbots, performing image manipulations, detecting sarcasm on social media, delivering online music recommendations, and so on . Data scientists should identify which if any of this data might be suitable at least for the initial training of their models. Ideally, the free dataset should have been pre-labeled in a way that is useful for a learning task. If it hasn’t been pre-labeled, one will need to figure out the most cost-effective way of doing so. |
Retrain models on progressively higher quality labeled datasets |
One’s own data resources may be insufficient for training a specific ML model. To bootstrap training, one might pretrain with free public data that is roughly related to the domain. If the free datasets include acceptable labels, all the better. One might then retrain the model on smaller, higher quality, labeled datasets that are directly related to the learning task one is attempting to address. As developers progressively retrain their models on higher-quality datasets, the findings might allow them to fine-tune the feature engineering, classes, and hyperparameters in their models. This iterative process that might suggest other, higher quality datasets to be acquired and/or higher-quality labeling that should be done in future training rounds in order to refine models even further. Bear in mind, though, that these iterative refinements might require progressively pricier training datasets and labeling services. |
Leverage crowdsourcing labeling services |
Developers might not have enough internal staff to label their training data. Or staff might be unavailable or too expensive to use for labeling. Or staff human resources might be insufficient to label a huge amount of training data rapidly enough. Under those circumstances, and budget permitting, one might crowdsource labeling chores to commercial services such as Amazon Mechanical Turk, CrowdFlower, Mighty AI, or CloudFactory. Outsourcing the task of labeling to crowd-oriented environments can be far more scalable to doing it internally, though one gives up some control over the quality and consistency of the resultant labels. On the positive side, these services tend to use high-quality labeling tools that make the process faster, more precise, and more efficient than one may be able to manage with in-house processes. |
Embed labeling tasks in online apps |
Web application data can be captured for use as training-data labels. For example, embedding of training data in CAPTCHA challenges, which are common in two-factor authentication scenarios, is a popular approach for training image and text recognition models. In a similar vein, one might consider presenting training data in gamified apps that provide incentives to users to identify, classify, or otherwise comment on images, text, objects, and other presented entities. |
Rely on third-party models that have been pretrained on labeled data |
Many learning tasks have already been addressed by good-enough models that have already been trained with good-enough datasets, which, presumably were adequately labeled prior to the training of the corresponding models. Pretrained models are available from various sources, including academic researchers, commercial vendors, and open-source data-science communities. Be aware that more ML-driven software and SaaS offerings will come bundled with pretrained models and even access to training data; these offerings are likely to become key vendor differentiators. Bear in mind that the utility of pretrained models will decline if one’s learning task’s domain, feature set, and learning task drifts further from the source over time. |
Table 4: Approaches for Successful Supervised ML Training Within Budgetary and Other Constraints
Another approach for moderating one’s requirements for training resources is to reduce the percentage of ML projects that rely on supervised learning. However, given one’s ML application requirements, the key learning approaches—semi-supervised, unsupervised, and reinforcement—may not be feasible alternatives. As noted above, the other approaches are suited to particular use cases and may not be as accurate in the use cases in which supervised learning is usually implemented.
Action Item
Developers need to institute practices for ongoing training of the supervised machine learning (ML) models within their applications. Training tasks should be built into structured DevOps pipelines to ensure that this critical function is adequately addressed in the supervised-ML lifecycle.