Top 10 Machine Learning Algorithms You Should Know | Clarsentia Blog
Top 10 Machine Learning Algorithms You Should Know

Top 10 Machine Learning Algorithms You Should Know

Posted by Ivan Smith in Machine Learning & Artificial Intelligence

Before we begin, let’s talk a little about what the difference is between machine learning and artificial intelligence as the two are often thought to be one and the same. We’ll also explore briefly what we mean by ‘algorithms’ in the context of machine learning.

In the media, we often hear marketers use terms like machine learning, algorithms and artificial intelligence interchangeably. Although conceptually they are related each is a distinct concept:

What is Machine Learning?

Machine Learning describes the methods through which we train computer systems to get better at specific tasks. These tasks can include various levels of complexity but are typically not intelligent (at least by standard definitions) in how they function. We use machine learning to help our computer system interpret, classify and make predictions on new data we can use to further enhance and develop our problem solving system.

What is Artificial Intelligence?

Artificial Intelligence is a field of computer research in which scientists attempt to emulate human intelligence in every way. Despite what is reported in the media, there is no truly artificial intelligence system to date which meets all the requirements of mimicking human intelligence completely. When we read about advanced computing systems like ‘Google AI’ and ‘Microsoft AI’  what we’re really talking about are breakthroughs and repurposing of machine learning techniques, both old and new, that have been popularized  in the last few years to solve specific data problems.

Machine learning is an incredibly powerful tool when applied to big data problems. It has the capability of reducing the amount of manual effort required to process noisy data by making inferences on data before our analysis begins. It also has the ability to classify and cluster data to give us a better understanding where differences between the data lie -- but it isn’t a cure-all. In order to apply machine learning techniques effectively, a component of human guidance with model selection and testing is usually required.

What is an Algorithm?

When we talk about algorithms in the context of machine learning, we’re referring to a process of solving a problem using a calculation or logic process typically executed using computer code. This can be either a simple mathematical calculation or a series of calculations that give us an expected (or sometimes unexpected) result.

Throughout this article, we’ll refer to mathematical notations used in machine learning algorithms which are an expression of the logic processes and calculations required to demonstrate a ‘learning’ computer program. Don't be intimidated. If you’re in the process of designing a machine learning algorithm, your job will be to translate these notations into functional programs which is typically much less complicated than it appears.

Types of Machine Learning Algorithms

Machine learning algorithms are typically broken down into three categories:

Supervised Learning
Supervised learning algorithms require a ‘trainer’ or set of training data that has been correctly organized and labeled. With supervised learning, machine learning systems are provided with ‘an answer’ in the form of training data used to correctly learn or categorize new data. Supervised learning algorithms are most effective where the relationship and correlation in our data is clear. Logistic regression, Naïve Bayes, kNN and SVMs are all examples of supervised learning algorithms.

Unsupervised Learning
Unsupervised learning algorithms do not require training data, as associations are learned from new 'input' data that is processed by our algorithm. We use unsupervised learning to determine what differences and similarities are present in the data and categorize or cluster them accordingly. Unsupervised learning algorithms are most effective with complex data, where relationships are either too complex to guess or correlation is unknown. Clustering or grouping of the information can then be used to determine the relationships, if any, exist. A data scientist can then analyze the relationships and determine if they are meaningful and if any causations exists that account for those correlations. Random forests, Apriori, Neural Networks and K-Means are all examples of unsupervised learning algorithms.

Reinforced Learning
You can think of reinforced learning algorithms as supervised learning algorithms that have self-improvement mechanisms built-in. Typically, reinforced learning algorithms make improvements to the algorithms model by updating training data with new data. This data is typically selected by using a score or predetermined value that estimates how well categorized a data set is, then either adding it to the training set for future comparison or discarding it.

Common Algorithm Tasks

As with any complex computer system, different machine learning algorithms can be used to solve specific machine learning problems. In this article, we’ll break down our algorithms into the following two categories of typical machine learning tasks:

Predictive Tasks
Algorithms we use to make specific predictions with continuous output values. In machine learning, we can use predictions to make our machine learning system make assumptions about data, evaluate those assumptions and adjust our predictions to get better as more predictions are made.

Classification Tasks
Algorithms we use to classify data to determine if it belongs to one class or another. Classification allows our machine learning system to interpret new data and categorize it as it relates to the dataset as a whole. This is often employed in deep learning systems, where broad categories of information must be broken down and analyzed as part of a subset group or class.

A Word About Deep Learning

Before we get started I want to comment on another buzzword making its way around the internet like the Halley’s Comet of unavoidable hype, and that is the concept of ‘deep learning’. It’s hard to open a discussion about machine learning without comments about the many advancements made with ‘deep learning’ and how it will change the internet and our lives forever. This may or may not be true, but before we jump to conclusions we should first understand what ‘deep learning’ actually means.

Deep Learning by standard definition is simply an extension or culmination of several different machine learning algorithms working together (typically based on deep neural networks discussed later in this article) that produce a theoretical improved result. The idea of a super system of machine learning techniques is an attractive one, but the computing requirements of deep learning systems are enormous, and typically ill-suited to solve specific business problems.

Companies like Google use deep learning systems to build tools capable of complex, previously unachievable human-like characteristics. This includes things like image recognition and generation, concept mapping and tone-of-voice analysis. Keep in mind mimicking a subset of human-like qualities and building a fully intelligent computer system (often advertised as 'AI') are two very different things. It’s also important to note that many of these concepts, though recently repurposed and applied in innovative ways, are not new. Some of the techniques discussed in this article date back to the 1950’s and have been in use in one form or another since that time.

So with that said and without further ado, here are the top 10 algorithms you should have in your toolbox to help solve your data problems:

Algorithm #1: Linear Regression

Linear Regression is one of the simplest implementations of machine learning predictive modeling algorithms available. If you’ve ever taken an intro statistics course, you’ve probably used linear regression before. Think of linear regression as fitting a straight line through a series of points. That line gives you an idea of what ‘best fits’ your data and how you can reasonably predict on where the data is telling you things going.

The most common algorithm used in linear regression is known as the least squares method, the equation looks like this:

Y= A + BX

The A in the equation refers the y intercept and B refers to the slope of the least squares regression cost behavior line. X refers to the input variable or the value we want to predict for.

When to Use Linear Regression
Use linear regression for machine learning when you have a simple correlated data set with a visually apparent linear pattern you want to predict a specific outcome on. For example, consider the following dataset of height vs weight distribution:

The asterisk (*) symbol represents our data points, while the solid line represents our calculated regression line. You’ll notice the relationship of our data points visually appears to be linear in its distribution, which allows us to apply a straight ‘best fit’ line and predict for our value X:

Weight = 80 + 2 x (90) = 260 lbs.

If an individual is 90 inches tall, we can expect them to be approximately 260 lbs based on our data.

When to Avoid Linear Regression
Anytime you have noisy data or data that does not appear to have a linear relationship, avoid using linear regression for predictive purposes when building your machine learning algorithm. Linear regression makes a series of assumptions which include an existing relationship between your x and y values (in this case, height and age). You’ll need to make sure the relationships exist and a straight ‘best fit’ line can be used first before applying linear regression to your model.

Algorithm #2: Non-Linear Regression

Similar to linear regression, non-linear regression attempts to normalize and provide a method for predicting a value based on correlated data. Unlike linear regression however, non-linear regression attempts to solve the problem by applying a curved line instead of a straight one that best fits our data for predictive purposes.

When to Use Non-Linear Regression
If you have a data set that won’t fit a straight or linear line but a curved line will do the job, consider applying a non-linear regression method to fit your data.

From a machine learning perspective, just like linear regression, non-linear regression is typically only useful for creating predictive models. Depending on the type of curve your data appears to fit, you can select a number of different mathematical equations to accurately predict an outcome based your data. The role of the data scientist or algorithm designer is to select the model that best fits the data which in turn provides us with a model we can use to more accurately make predictions with. Take a look at the following examples of non-linear regression equations and their associated curve lines:

Exponential Curve
Y = A + BCX

Parabolic Curve
Y = A + B (XC)2

Gaussian Curve
Y = A B (X C)2

Visual inspection of your plotted data is normally required when applying a regression ‘best fit’ method for data predictions. Below is an example of a data set used with Clarsentia’s free to use search tool where a non-linear regression method has been applied to predict on how sentiment for the Retail Trade industry will change in the future:

By continuously improving our predictive models by evaluating existing models and developing new ones as new data becomes available, we’re able to accurately predict human sentiment about specific subjects and entire market segments which gives our customers an idea of where public sentiment is headed.

When to Avoid Non-linear Regression
You should avoid using non-linear regression techniques if your data does not appear to fit curved line, if you need to predict on more than one value or the correlation between variables (inferred relationships) are unknown.

Algorithm #3: Logistic Regression

The third and final regression technique we’ll discuss in the context of machine learning is logistic regression. Logistic regression is used in machine learning primarily to solve classification problems.

Up until this point, we’ve discussed how regression can be used to predict specific values (as with our linear regression age vs height example) based on our data. With logistic regression, we’ll look at the probability (likelihood) of whether data can be classified in a specific group.

The key to understanding logistic regression is that we are transforming our data into a set of probabilities (likelihood of occurrence), then using those probabilities to determine whether data belongs to a group which we can use as a predictor for a specific outcome.

Logistic regression accomplishes this by dividing datasets into different regions through a boundary line (which is why it is still considered linear). Calculating the decision boundary is what allows us to classify the resulting probabilities and determine if a data set belongs to a specific class.

Have a look at the graph below to see a representation of how the logistic regression algorithm operates:

You’ll note in the above example, we can see our boundary line has separated our data into two binary categories -- 0 (which represents data which cannot be classified) and 1 (which represents data that can be classified). Our y axis is always on a scale of 0 to 1, with the boundary discerning what data can be classified (higher probability with an acceptable margin of error), and what data cannot.

When to Use Logistic Regression

With logistic regression, our result is always a binary probability (meaning it is either a 0 or 1 result, yes or no). This means that in the context of machine learning it’s often best suited for solving classification problems where we need to know whether data belongs to a discrete set of classes (a specific group of data sets).

When to Avoid Logistic Regression

If you need to calculate a specific predicted value and your data does not need to be classified or broken down into a binary (yes or no) result, avoid logistic regression. If you have a limited dataset to work from, other classifier algorithms may perform better and converge faster than logistic regression.

Algorithm #4: Naïve Bayes

Naïve Bayes is a classifier algorithm made popular by email classifying systems used for identifying spam email.  The word ‘naïve’ in the name Naïve Bayes derives from the fact that the algorithm uses Bayesian techniques but does not take into account dependencies that may exist (hence ‘naïve’). This seems counter intuitive but yields surprisingly good results.

Like logistic regression discussed above, Naïve Bayes classifies by calculating a probability a data set belongs to a certain class. Unlike logistic regression however, probabilities are not calculated by minimizing the error rate but by building a model that could generate the data. 

To understand Bayesian technique, let’s have a look at the equation we use to calculate our probability:

Bayes theorem provides a way of calculating the posterior probability, P(c|x), from class prior probability (P(c)), predictor prior probability (P(x)), and likelihood (P(x|c)). Naive Bayes classifier assumes that the effect of the value of a predictor (x) on a given class (c) is independent of the values of other predictors. This assumption is called class conditional independence.

To understand the formula better, consider the following example where we want to build a classifier that can identify whether a message is spam or not.

In our training set, we’ve identified the following:

35 emails out of a total of 74 are spam messages.
50 emails out of those 74 contain the word “xxx”.
25 emails containing the word “xxx” have been marked as spam.
28 emails out of the total contain the word “viagra”.
26 emails out of those have been marked as spam.

Using the Naïve Bayes formula above, what is the probability that an email is spam, given that it contains both “viagra” and “xxx”?

P(spam/xxx,viagra) =

P(spam/xxx,viagra) = ((25/35)*(26/35)*(35/74))/((50/74)*(28/74)) = 0.98

So our calculated probability is near certainty (0.98) that the email is spam.

When to Use Naïve Bayes
A key strength of Naïve Bayes is that it performs well when we have multiple classes (such as category classification for news articles). Due to the naïve nature of the algorithm (assumption based) it also requires less training data to implement which means we can accurately classify data with minimal training examples to start with.

When to Avoid Naïve Bayes
As with logistic regression, Naïve Bayes is generally best used for classification purposes where a probability of a ‘best fit’ needs to be calculated. Naïve Bayes tends to underperform on larger datasets where the inherent conditional independence assumption does not perform as well. As more training data becomes available other classifiers may perform better with large training sets.

Algorithm #5: Random Forests

Random forest are our first example of an unsupervised learning algorithm and starts with a standard machine learning technique called a “decision tree” where an input is entered at the top, traverses down the tree and conditionally gets bucketed into smaller and smaller sets. If you’re familiar with programming, you can think of a decision tree as a series of nested ‘if / else’ statements.  Below is an example of a simple decision tree:

Here we are determining if we can classify someone as a male or female based on their height and weight attributes. When we create a large collection of those decision trees and group them together to come up with a certain prediction or classification, we get what’s known as a ‘forest’.

The ‘randomness’ to our forest comes from how our algorithm selects subsets of our training data and how we select the attributes to split decisions on (shown as height and weight as above). When we randomly select at least some of those attributes, we get interesting results which when combined, provide us with a better prediction overall.

A random forest is an example of an ensemble, which means it is a combination of predictive outputs from different models (other decision trees in this case). This makes it very powerful in both its predictive and classification capabilities as it’s able to leverage the ‘wisdom of the crowd’ by generating multiple decision trees and combining them to determine the best possible prediction on our data.

When to Use Random Forests
Because of its versatility, random forests are a good choice for both complex predictive or classification problems. Random forests are able to compute individual tree results quickly while the ensemble aspect allows them to perform with a high degree of accuracy.

When to Avoid Random Forests
Because random forest results are very difficult to interpret and visualize (known as a ‘black box’ problem), understanding and interpreting the results of the model and identify problems in predictive accuracy is challenging. Random forests also tend to overfit 'noisy' data (where random outliers in our data are unnecessarily included in calculating our result).

Random forests are traditionally computationally expensive and can be very complex to implement. Consider a simpler algorithm such as a regression technique if the data will closely fit using those methods.

Algorithm #6: K-Nearest Neighbors (kNN)

Widely used with a track record of success, K-Nearest Neighbors (referred to kNN for short) is a simple classification and predictive algorithm that stores and compares all the attributes of a dataset to the attributes of a class to determine the similarity and likelihood a new piece of data belongs to that class. This simple concept has proven to be enormously powerful and effective at classifying noisy and uncorrelated data.

kNN Algorithm is based on feature similarity: How closely out-of-sample features resemble our training set determines how we classify a given data point. kNN is also considered to be a ‘lazy’ (as opposed to ‘eager’) learning algorithm, in that training data typically does not need to be manipulated beyond being assigned to a specific class.

The similarity from another class determines whether new data is categorized as belonging to one class or another:

In the above test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles (its ‘nearest neighbor’). If k = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle.

When to Use K-Nearest Neighbor
Because it’s insensitive to outliers, kNN is incredibly effective when used with large sets of noisy data. kNN also does not make any predetermined assumptions about data which makes it useful when comparing nonlinear data. Accuracy is also high out-of-the-box without the need for complex, highly manipulated training data which makes speed of implementation attractive. It can also be used for both classification and predictive problems making it a highly versatile solution.

When to Avoid K-Nearest Neighbor
Don’t use kNN if computational requirements are a concern. Because each new piece of data is compared to every attribute in a given class’s training set typically stored in memory, the CPU and memory required to support it are high.

If you’re working with clean linear data that does not contain a lot of outliers, kNN might not be the right choice for you.

Algorithm #7: Support Vector Machines (SVMs)

Made popular by its success in handwritten digit recognition, Support Vector Machines (or SVMs for short) is the idea that you can find a hyperplane that best divides a dataset into two classes. Support vectors are the data points closest to the hyperplane. The position of the hyperplane changes as the position of the support vectors shifts (as show below):