The Machine Learning Data Dilemma, To be effective, machine learning (ML) has a significant requirement: data. Lots of data. We can expect a child to understand what a cat is and identify other cats after just a few encounters or by being shown a few examples of cats, but ML algorithms require many, many more examples. Unlike humans, these algorithms can’t easily develop inferences on their own. For example, machine learning algorithms interpret a picture of a cat against a grassy background differently than a cat shown in front of a fireplace.

The algorithms need a lot of data to separate the relevant “features” of the cat from the background noise. It is the same for other noise such as lighting and weather. Unfortunately, such data hunger does not stop at the separation of signal from noise. The algorithms also need to identify meaningful features that distinguish the cat itself. Variations that humans do not need extra data to understand — such as a cat’s color or size — are difficult for machine learning.

Without an adequate number of samples, machine learning provides no benefit.

Not All ML Techniques Are Equally Hungry

Many types of machine learning techniques exist, and some have been around for several decades. Each has its own strengths and weaknesses. These differences also extend to the nature and amount of data required to build effective models. For instance, deep learning neural networks (DLNNs) are an exciting area of machine learning because they are capable of delivering dramatic results. DLNNs require a greater amount of data than more established machine learning algorithms as well as a hefty amount of computing horsepower. In fact, DLNNs were considered feasible only after the advent of big data (which provided the large data sets) and cloud computing (which provided the number-crunching capability).

Other factors affect the need for data. General machine learning algorithms do not include domain-specific information; they must overcome this limitation through large, representative data sets. Referring back to the cat example, these machine learning algorithms don’t understand basic features of cats, nor do they understand that backgrounds are noise. So they need many examples of this data to learn such distinctions.

To reduce the data required in these situations, machine learning algorithms can include a level of domain data so key features and attributes of the target data are already known. Then the focus of learning can be strictly on optimizing output. This need to “imbue” human knowledge into the machine learning system from the start is a direct result of the data-hungry nature of machine learning.

Training Data Sets Need Improvement

To truly drive innovation using machine learning, a good amount of innovation needs to first occur around how to input data is selected.

Curating (that is, selecting the data for a training data set) is, at heart, about monitoring data quality. “Garbage-in, garbage-out” is especially true with machine learning. Exacerbating this problem is the relative “black box” nature of machine learning, which prevents understanding why machine learning produces a certain output. When machine learning creates unexpected output, it is because the input data was not appropriate, but determining the specific nature of the problem data is a challenge.

Two common problems caused by poor data curation are overfitting and bias. Overfitting is the result of a training data set that does not adequately represent the actual variance of production data; it, therefore, produces output that can only deal with a portion of the entire data stream.

Bias is a deeper problem that relates to the same root cause as overfitting but is harder to identify and understand. Biased data sets are not representative, have skewed distribution, or do not contain the correct data in the first place. This biased training data results in biased output that makes incorrect conclusions that may be difficult to identify as incorrect. Although there is significant optimism about machine learning applications, data quality problems should be a major concern as machine-learning-as-a-service offerings come online.

READ MORE ON(The Machine Learning Data Dilemma): TDWI