# Top 10 Machine Learning Algorithms You Should Know

Posted by Ivan Smith in Machine Learning & Artificial Intelligence

Before we begin, let’s talk a
little about what the difference is between machine learning and artificial
intelligence as the two are often thought to be one and the same. We’ll also
explore briefly what we mean by ‘algorithms’ in the context of machine learning.

In the media, we often hear marketers use terms like machine learning,
algorithms and artificial intelligence interchangeably. Although conceptually
they are related each is a distinct concept:

**What is Machine Learning?
**

Machine Learning describes the methods through which we train computer systems to get better at specific tasks. These tasks can include various levels of complexity but are typically not intelligent (at least by standard definitions) in how they function. We use machine learning to help our computer system interpret, classify and make predictions on new data we can use to further enhance and develop our problem solving system.

**What is Artificial Intelligence?**

Artificial Intelligence is a field of computer research in which scientists attempt to emulate human intelligence in every way. Despite what is reported in the media, there is no truly artificial intelligence system to date which meets all the requirements of mimicking human intelligence completely. When we read about advanced computing systems like ‘Google AI’ and ‘Microsoft AI’ what we’re really talking about are breakthroughs and repurposing of machine learning techniques, both old and new, that have been popularized in the last few years to solve specific data problems.

Machine learning is an incredibly powerful tool when applied to big data problems. It has the capability of reducing the amount of manual effort required to process noisy data by making inferences on data before our analysis begins. It also has the ability to classify and cluster data to give us a better understanding where differences between the data lie -- but it isn’t a cure-all. In order to apply machine learning techniques effectively, a component of human guidance with model selection and testing is usually required.

**What is an Algorithm?**

When we talk about algorithms in the context of machine learning, we’re referring to a process of solving a problem using a calculation or logic process typically executed using computer code. This can be either a simple mathematical calculation or a series of calculations that give us an expected (or sometimes unexpected) result.

Throughout this article, we’ll refer to mathematical notations used in machine learning algorithms which are an expression of the logic processes and calculations required to demonstrate a ‘learning’ computer program. Don't be intimidated. If you’re in the process of designing a machine learning algorithm, your job will be to translate these notations into functional programs which is typically much less complicated than it appears.

Types of Machine Learning Algorithms

Types of Machine Learning Algorithms

**
**Machine learning algorithms are typically broken down into three
categories:

__Supervised Learning__

Supervised learning algorithms require a ‘trainer’ or set of training data that has
been correctly organized and labeled. With supervised learning, machine learning
systems are provided with ‘an answer’ in the form of training data used to
correctly learn or categorize new data. Supervised learning algorithms are most
effective where the relationship and correlation in our data is clear. Logistic
regression, Naïve Bayes, kNN and SVMs are all examples of supervised learning
algorithms.

__Unsupervised Learning__

Unsupervised learning algorithms do not require training data, as associations are
learned from new 'input' data that is processed by our algorithm. We use unsupervised learning to
determine what differences and similarities are present in the data and
categorize or cluster them accordingly. Unsupervised learning algorithms are
most effective with complex data, where relationships are either too complex to
guess or correlation is unknown. Clustering or grouping of the information can
then be used to determine the relationships, if any, exist. A data scientist can then analyze the relationships and determine if they are meaningful and if any causations exists that account for those correlations. Random forests,
Apriori, Neural Networks and K-Means are all examples of unsupervised learning
algorithms.

__Reinforced Learning__

You can think of reinforced learning algorithms as supervised learning
algorithms that have self-improvement mechanisms built-in. Typically, reinforced
learning algorithms make improvements to the algorithms model by updating training data
with new data. This data is typically selected by using a score or
predetermined value that estimates how well categorized a data set is, then
either adding it to the training set for future comparison or discarding it.

**Common Algorithm Tasks
**As with any complex computer system, different machine learning
algorithms can be used to solve specific machine learning problems. In this
article, we’ll break down our algorithms into the following two categories of typical
machine learning tasks:

__Predictive Tasks__Algorithms we use to make specific predictions with continuous output values. In machine learning, we can use predictions to make our machine learning system make assumptions about data, evaluate those assumptions and adjust our predictions to get better as more predictions are made.

__Classification Tasks__Algorithms we use to classify data to determine if it belongs to one class or another. Classification allows our machine learning system to interpret new data and categorize it as it relates to the dataset as a whole. This is often employed in deep learning systems, where broad categories of information must be broken down and analyzed as part of a subset group or class.

**A Word About Deep Learning**

Before we get started I want to comment on another buzzword making its way around
the internet like the Halley’s Comet of unavoidable hype, and that is the
concept of ‘deep learning’. It’s hard to open a discussion about machine learning without comments about the many advancements
made with ‘deep learning’ and how it will change the internet and our lives
forever. This may or may not be true, but before we jump to conclusions we
should first understand what ‘deep learning’ actually means.

Deep Learning by standard definition is simply an extension or culmination of
several different machine learning algorithms working together (typically based on deep neural
networks discussed later in this article) that produce a theoretical improved
result. The idea of a super system of machine learning techniques is an
attractive one, but the computing requirements of deep learning systems are
enormous, and typically ill-suited to solve specific business problems.

Companies like Google use deep learning systems to build tools capable of
complex, previously unachievable human-like characteristics. This includes
things like image recognition and generation, concept mapping and tone-of-voice
analysis. Keep in mind mimicking a subset of human-like qualities and building
a fully intelligent computer system (often advertised as 'AI') are two
very different things. It’s also important to note that many of these concepts,
though recently repurposed and applied in innovative ways, are not new. Some of
the techniques discussed in this article date back to the 1950’s and have been in use in one form or another since that time.

So with that said and without further ado, here are the top 10
algorithms you should have in your toolbox to help solve your data problems:

**Algorithm #1: Linear Regression
**Linear Regression is one of the simplest implementations of
machine learning predictive modeling algorithms available. If you’ve ever taken
an intro statistics course, you’ve probably used linear regression before.
Think of linear regression as fitting a straight line through a series of
points. That line gives you an idea of what ‘best fits’ your data and how
you can reasonably predict on where the data is telling you things going.

The most common algorithm used in linear regression is known as the least squares method, the equation looks like this:

**Y= A + BX**

The A in the equation refers
the y intercept and B refers to the slope of the least squares regression cost
behavior line. X refers to the input variable or the value we want to predict
for.

** When to Use Linear Regression**Use linear regression for machine learning when you have a simple correlated
data set with a visually apparent linear pattern you want to predict a specific outcome on. For
example, consider the following dataset of height vs weight distribution:

The asterisk (*) symbol
represents our data points, while the solid line represents our calculated
regression line. You’ll notice the relationship of our data points visually
appears to be linear in its distribution, which allows us to apply a straight
‘best fit’ line and predict for our value X:

**Weight = 80 + 2 x (90) = 260 lbs.**

If an individual is 90 inches tall, we can expect them to be approximately 260
lbs based on our data.

** When to Avoid Linear Regression**Anytime you have noisy data or data that does not appear to have a
linear relationship, avoid using linear regression for predictive purposes when
building your machine learning algorithm. Linear regression makes a series of
assumptions which include an existing relationship between your x and y values
(in this case, height and age). You’ll need to make sure the relationships
exist and a straight ‘best fit’ line can be used first before applying linear
regression to your model.

**Algorithm #2: Non-Linear Regression**

Similar to linear regression, non-linear regression attempts to normalize and provide a method for predicting a value based on correlated data. Unlike linear regression however, non-linear regression attempts to solve the problem by applying a curved line instead of a straight one that best fits our data for predictive purposes.

**If you have a data set that won’t fit a straight or linear line but a curved line will do the job, consider applying a non-linear regression method to fit your data.**

__When to Use Non-Linear Regression__

From a machine learning perspective, just like linear regression, non-linear regression is typically only useful for creating predictive models. Depending on the type of curve your data appears to fit, you can select a number of different mathematical equations to accurately predict an outcome based your data. The role of the data scientist or algorithm designer is to select the model that best fits the data which in turn provides us with a model we can use to more accurately make predictions with. Take a look at the following examples of non-linear regression equations and their associated curve lines:

__Exponential Curve__

**Y**** = ****A**** + BC ^{X}**

__Parabolic Curve__

**Y**** = ****A**** + ****B**** ****∗**** (****X**** – ****C) ^{2}**

^{}^{
}

__Gaussian Curve__

**Y**** = ****A**** ****∗**** ****B**** ^{(}**

^{X }

^{–}

^{ C)}

^{2}Visual inspection of your
plotted data is normally required when applying a regression ‘best fit’ method
for data predictions. Below is an example of a data set used with Clarsentia’s free
to use search tool where a non-linear regression method has been applied to
predict on how sentiment for the Retail Trade industry will change in the
future:

By continuously improving our
predictive models by evaluating existing models and developing new ones as new
data becomes available, we’re able to accurately predict human sentiment about
specific subjects and entire market segments which gives our customers an idea
of where public sentiment is headed.

__When to Avoid Non-linear Regression__

You should avoid using non-linear regression techniques if your data does not
appear to fit curved line, if you need to predict on more than one value or the
correlation between variables (inferred relationships) are unknown.

**Algorithm #3: Logistic Regression**

The third and final regression
technique we’ll discuss in the context of machine learning is logistic
regression. Logistic regression is used in machine learning primarily to solve
classification problems.

Up until this point, we’ve discussed how regression can be used to predict
specific values (as with our linear regression age vs height example) based on
our data. With logistic regression, we’ll look at the probability (likelihood) of
whether data can be classified in a specific group.

The key to understanding logistic regression is that we are transforming our
data into a set of probabilities (likelihood of occurrence), then using those
probabilities to determine whether data belongs to a group which we can use as a predictor for a specific outcome.

Logistic regression accomplishes this by dividing datasets into different
regions through a boundary line (which is why it is still considered linear). Calculating the decision boundary is what allows us to classify
the resulting probabilities and determine if a data set belongs to a specific class.

Have a look at the graph below to see a representation of how the logistic
regression algorithm operates:

You’ll note in the above
example, we can see our boundary line has separated our data into two binary categories
-- 0 (which represents data which cannot be classified) and 1 (which represents
data that can be classified). Our y axis is always on a scale of 0 to 1, with
the boundary discerning what data can be classified (higher probability with an
acceptable margin of error), and what data cannot.

__When to Use Logistic Regression
__

With logistic regression, our result is always a binary probability (meaning it is either a 0 or 1 result, yes or no). This means that in the context of machine learning it’s often best suited for solving classification problems where we need to know whether data belongs to a discrete set of classes (a specific group of data sets).

__When to Avoid Logistic Regression__If you need to calculate a
specific predicted value and your data does not need to be classified or broken
down into a binary (yes or no) result, avoid logistic regression. If you have a limited
dataset to work from, other classifier algorithms may perform better and
converge faster than logistic regression.

**Algorithm #4: Naïve Bayes**

Naïve Bayes is a classifier algorithm made popular by email classifying systems
used for identifying spam email. The
word ‘naïve’ in the name Naïve Bayes derives from the fact that the algorithm
uses Bayesian techniques but does not take into account dependencies that may
exist (hence ‘naïve’). This seems counter intuitive but yields surprisingly good
results.

Like logistic regression discussed above, Naïve Bayes classifies by calculating
a probability a data set belongs to a certain class. Unlike logistic regression
however, probabilities are not calculated by minimizing the error rate but by
building a model that could generate the data.

To understand Bayesian technique, let’s have a look at the equation we use to
calculate our probability:

Bayes theorem provides a way of calculating the posterior probability, P(c|x),
from class prior probability (P(c)), predictor prior probability (P(x)), and
likelihood (P(x|c)). Naive Bayes classifier assumes that the effect of the
value of a predictor (x) on a given class (c) is independent of the values of
other predictors. This assumption is called class conditional independence.

To understand the formula better, consider the following example where we want
to build a classifier that can identify whether a message is spam or not.

In our training set, we’ve identified the following:

**35 emails out of a total of 74 are spam messages.
50 emails out of those 74 contain the word “xxx”.
25 emails containing the word “xxx” have been marked as spam.
28 emails out of the total contain the word “viagra”.
26 emails out of those have been marked as spam.
**Using the Naïve Bayes formula above, what is the probability
that an email is spam, given that it contains both “viagra” and “xxx”?

**P(spam/xxx,viagra) =**

(P(xxx/spam)*P(viagra/spam)*P(spam))/(P(xxx)*P(viagra))

(P(xxx/spam)*P(viagra/spam)*P(spam))/(P(xxx)*P(viagra))

**P(spam/xxx,viagra) = ((25/35)*(26/35)*(35/74))/((50/74)*(28/74)) = 0.98**

So our calculated probability is near certainty (0.98) that the email is spam.

**A key strength of Naïve Bayes is that it performs well when we have multiple classes (such as category classification for news articles). Due to the naïve nature of the algorithm (assumption based) it also requires less training data to implement which means we can accurately classify data with minimal training examples to start with.**

__When to Use Naïve Bayes__

**As with logistic regression, Naïve Bayes is generally best used for classification purposes where a probability of a ‘best fit’ needs to be calculated. Naïve Bayes tends to underperform on larger datasets where the inherent conditional independence assumption does not perform as well. As more training data becomes available other classifiers may perform better with large training sets.**

__When to Avoid Naïve Bayes__

**Algorithm #5: Random Forests**

Random forest are our first example of an unsupervised learning algorithm and starts with a standard machine learning technique called a “decision tree” where an input is entered at the top, traverses down the tree and conditionally gets bucketed into smaller and smaller sets. If you’re familiar with programming, you can think of a decision tree as a series of nested ‘if / else’ statements. Below is an example of a simple decision tree:

Here we are determining if we can classify someone as a male or female based on
their height and weight attributes. When we create a large collection of those
decision trees and group them together to come up with a certain prediction or
classification, we get what’s known as a ‘forest’.

The ‘randomness’ to our forest comes from how our algorithm selects subsets of
our training data and how we select the attributes to split decisions on (shown
as height and weight as above). When we randomly select at least some of those
attributes, we get interesting results which when combined, provide us with a
better prediction overall.

A random forest is an example of an ensemble, which means it is a combination
of predictive outputs from different models (other decision trees in this
case). This makes it very powerful in both its predictive and classification
capabilities as it’s able to leverage the ‘wisdom of the crowd’ by generating
multiple decision trees and combining them to determine the best possible
prediction on our data.

** When to Use Random Forests**Because of its versatility, random forests are a good choice for both complex
predictive or classification problems. Random forests are able to compute
individual tree results quickly while the ensemble aspect allows them to
perform with a high degree of accuracy.

**Because random forest results are very difficult to interpret and visualize (known as a ‘black box’ problem), understanding and interpreting the results of the model and identify problems in predictive accuracy is challenging. Random forests also tend to overfit 'noisy' data (where random outliers in our data are unnecessarily included in calculating our result).**

__When to Avoid Random Forests__

Random forests are traditionally computationally expensive and can be very complex to implement. Consider a simpler algorithm such as a regression technique if the data will closely fit using those methods.

**Algorithm #6: K-Nearest Neighbors (kNN)**

Widely used with a track record of success, K-Nearest Neighbors (referred to kNN for short) is a simple classification and predictive algorithm that stores and compares all the attributes of a dataset to the attributes of a class to determine the similarity and likelihood a new piece of data belongs to that class. This simple concept has proven to be enormously powerful and effective at classifying noisy and uncorrelated data.

kNN Algorithm is based on feature similarity: How closely out-of-sample features resemble our training set determines how we classify a given data point. kNN is also considered to be a ‘lazy’ (as opposed to ‘eager’) learning algorithm, in that training data typically does not need to be manipulated beyond being assigned to a specific class.

The similarity from another class determines whether new data is categorized as belonging to one class or another:

In the above test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles (its ‘nearest neighbor’). If k = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle.

**Because it’s insensitive to outliers, kNN is incredibly effective when used with large sets of noisy data. kNN also does not make any predetermined assumptions about data which makes it useful when comparing nonlinear data. Accuracy is also high out-of-the-box without the need for complex, highly manipulated training data which makes speed of implementation attractive. It can also be used for both classification and predictive problems making it a highly versatile solution.**

__When to Use K-Nearest Neighbor__

__When to Avoid K-Nearest Neighbor__Don’t use kNN if computational requirements are a concern. Because each new piece of data is compared to every attribute in a given class’s training set typically stored in memory, the CPU and memory required to support it are high.

If you’re working with clean linear data that does not contain a lot of outliers, kNN might not be the right choice for you.

**Algorithm #7: Support Vector Machines (SVMs)**

Made popular by its success in handwritten digit recognition, Support Vector Machines (or SVMs for short) is the idea that you can find a hyperplane that best divides a dataset into two classes. Support vectors are the data points closest to the hyperplane. The position of the hyperplane changes as the position of the support vectors shifts (as show below):