Contents

Introduction to Naive Bayes

I think there’s a rule somewhere that says “You can’t call yourself a data scientist until you’ve used a Naive Bayes classifier”. It’s extremely useful, yet beautifully simplistic. This article is my attempt at laying the groundwork for Naive Bayes in a practical and intuitive fashion.

Motivating Problem

Let’s start with a problem to motivate our formulation of Naive Bayes. (Feel free to follow along using the Python script or R script found here.)

Suppose we own a professional networking site similar to LinkedIn. Users sign up, type some information about themselves, and then roam the network looking for jobs/connections/etc. Until recently, we only required users to enter their current job title, but now we’re asking them what industry they work in. New users are supplying this info as they sign up, but old users aren’t bothering to update their information. So, we need to build a text classification model to do it for them. Our data looks like this

job_title job_category
underwriter manager finance
mortgage data analyst finance
junior underwriter finance
sales manager sales
junior medical sales associate sales
senior sales manager sales
sales manager sales
senior data analyst technology
data analyst technology
data manager technology
data analyst manager NA
junior data analyst NA

Our goal is to estimate the probability those two unlabeled job titles should be categorized as Technology, Sales, or Finance, at which point we can make our best guess for the user. Formalizing this a bit, we want to find $ p(C_k\vert \text{job\_title}) $ where $ C_1, C_2, \text{and} C_3 $ are the classes Technology, Sales, and Finance. (Note: this type of problem is called Document Classification.)

How about that first unlabeled title, “data analyst manager”? We should probably label it as Technology, but how do we train a computer to figure that out? If we had trillions of training samples we might be able to estimate $ p(C_k \vert \textrm{job\_title}=\textrm{“data analyst manager”}) $ empirically (i.e. by measuring the relative frequency of each class for samples where job_title = “data analyst manager”). Unfortunately we only have 10 training samples (none of which have the title “data analyst manager”) so we’ll have to be a little more creative in our approach.

The word “data” seems pretty important. It occurs in all of the Technology samples, none of the Sales samples and only one of the Finance samples. On the other hand the word “manager” appears in every single category, so it’s probably not as useful. The big takeaway here is that we can use word occurrences to build a probabilistic model. Let’s start tracking words then…

Finance

job_title analyst associate data junior manager medical mortgage sales senior underwriter
underwriter manager 0 0 0 0 1 0 0 0 0 1
mortgage data analyst 1 0 1 0 0 0 1 0 0 0
junior underwriter 0 0 0 1 0 0 0 0 0 1

Sales

job_title analyst associate data junior manager medical mortgage sales senior underwriter
sales manager 0 0 0 0 1 0 0 1 0 0
junior medical sales associate 0 1 0 1 0 1 0 1 0 0
senior sales manager 0 0 0 0 1 0 0 1 1 0
sales manager 0 0 0 0 1 0 0 1 0 0

Technology

job_title analyst associate data junior manager medical mortgage sales senior underwriter
senior data analyst 1 0 1 0 0 0 0 0 1 0
data analyst 1 0 1 0 0 0 0 0 0 0
data manager 0 0 1 0 1 0 0 0 0 0

Unlabeled

job_title analyst associate data junior manager medical mortgage sales senior underwriter
data analyst manager 1 0 1 0 1 0 0 0 0 0
junior data analyst 1 0 1 1 0 0 0 0 0 0

Updating our model a bit, we want to find $ p(C_k \vert \mathbf{x}) $ where $ \mathbf{x} $ is our feature vector of word occurrences. In the case of “data analyst manager” $ \mathbf{x} = (x_1, x_2, … x_{10}) = (1,0,1,0,1,0,0,0,0,0) $.

Bayes' Theorem

Before we continue, let’s review some probability rules and Bayes' Theorem

$$ \begin{aligned} P(A\mid B) &= \frac{P(A \cap B)}{P(B)} \Rightarrow P(A \cap B) = P(A\mid B), P(B) \\ P(B\mid A) &= \frac{P(A \cap B)}{P(A)} \Rightarrow P(A \cap B) = P(B\mid A), P(A) \\ \Rightarrow P(A\mid B) &= \frac{P(B\mid A),P(A)}{P(B)}, \text{ if } P(B) \neq 0. \text{ (Bayes' Theorem)} \end{aligned} $$

For us this means

$$ \boxed{p(C_k \vert \mathbf{x}) = \frac{p(\mathbf{x} \vert C_k)p(C_k)}{p(\mathbf{x})}} $$

Let’s break down those pieces.

  • $ p(C_k) $ = frequency of class $ C_k $ / total number of samples
  • $ p(\mathbf{x} \vert C_k) $ = (frequency of $ \mathbf{x} $ / number of samples) where class = $ C_k $
  • $ p(\mathbf{x}) $ = frequency of $ \mathbf{x} $ / total number of samples

$ p(C_k) $

This is the easy part. To calculate $ p(C_k) $ we can use empirical relative frequencies given by our training data.

$$ \begin{aligned} p(\text{Technology}) &= \frac{3}{10} = 0.3 \\ p(\text{Sales}) &= \frac{4}{10} = 0.4 \\ p(\text{Finance}) &= \frac{3}{10} = 0.3 \end{aligned} $$

These probabilities are called priors. Our method of estimating them using the training data is common, but not necessary. Suppose we have reason to believe that the true priors for Technology, Sales, and Finance are 0.2, 0.4, and 0.4 respectively. With a Naive Bayes model, we can just plug those babies in for $ p(C_k) $ and our model will adjust accordingly. By contrast, using priors for tree based models like Random Forest is not nearly as easy.


$ p(\mathbf{x} \vert C_k) $

Now let’s consider $ p(\mathbf{x} \vert C_k) $ Using our training data we would calculate $ p(\mathbf{x} \vert \text{Technology}) $ to be 0. (Remember $ p(\mathbf{x} \vert \text{Technology}) $ represents the probability that only and all of the words “data”, “analyst”, and “manager” appear in a random job title given that it’s in the Technology class. In our training samples this never occurred, so our empirical probability estimate is 0.) This is a problem. We know $ p(\mathbf{x} \vert \text{Technology}) $ should be greater than 0, but we don’t have enough (or in this case, any) samples to accurately estimate it. The way we’ll get around this problem is to make a naive assumption - we’ll assume that the features (i.e. the occurrence of words in a job title) are independent variables. Obviously this is not true. When the word “data” appears in a job title, it immediately increases the probability that the word “analyst” appears in the title. However let’s assume the assumption is valid (or close to being valid). Then

$ p(\mathbf{x} \vert C_k) = p(x_1 | C_k)p(x_2 | C_k) \ldots p(x_{10} | C_k) $

For our example “data analyst manager” this means

$ {p(\mathbf{x}|\text{Technology}) = p(x_1=1|\text{Tech})p(x_2=0|\text{Tech}) \ldots p(x_{10}=0|\text{Tech})=} \frac{2}{3}\frac{3}{3} \ldots \frac{3}{3} = 0.1481 $

Similarly, we can calculate $ p(\mathbf{x}|\text{Sales})=0 $ and $ p(\mathbf{x}|\text{Finance})=0.005487 $


$ p(\mathbf{x}) $

We run into the same issue as before when we try to estimate $ p(\mathbf{x}) $. According to our training data $ p(\mathbf{x}) = 0 $, but the true value of $ p(\mathbf{x}) $ should obviously be > 0. However, consider our end-goal which is to determine $ p(\text{Technology} \vert \mathbf{x}) $, $ p(\text{Sales} \vert \mathbf{x}) $, and $ p(\text{Finance} \vert \mathbf{x}) $. Per Bayes' Theorem, these probabilities are equivalent to

$$ \begin{aligned} p(\text{Technology} \vert \mathbf{x}) &= \frac{p(\mathbf{x} \vert \text{Technology})p(\text{Technology})}{p(\mathbf{x})} \\ p(\text{Sales} \vert \mathbf{x}) &= \frac{p(\mathbf{x} \vert \text{Sales})p(\text{Sales})}{p(\mathbf{x})} \\ p(\text{Finance} \vert \mathbf{x}) &= \frac{p(\mathbf{x} \vert \text{Finance})p(\text{Finance})}{p(\mathbf{x})} \end{aligned} $$

Notice that $ p(\mathbf{x}) $ is just a scaling factor. It affects the values of $ p(\text{Technology} \vert \mathbf{x}) $, $ p(\text{Sales} \vert \mathbf{x}) $, and $ p(\text{Finance} \vert \mathbf{x}) $ equally, And since these probabilities must sum to 1 we can solve for $ p(\mathbf{x}) $:

$$ \begin{aligned} 1 &= p(\text{Technology} \vert \mathbf{x}) + p(\text{Sales} \vert \mathbf{x}) + p(\text{Finance} \vert \mathbf{x}) \\ \Rightarrow 1 &= \frac{p(\text{Technology})p(\mathbf{x} \vert \text{Technology})}{p(\mathbf{x})} + \frac{p(\text{Sales})p(\mathbf{x} \vert \text{Sales})}{p(\mathbf{x})} + \frac{p(\text{Finance})p(\mathbf{x} \vert \text{Finance})}{p(\mathbf{x})} \\ \Rightarrow p(\mathbf{x}) &= p(\text{Technology})p(\mathbf{x} \vert \text{Technology}) + p(\text{Sales})p(\mathbf{x} \vert \text{Sales}) + p(\text{Finance})p(\mathbf{x} \vert \text{Finance}) \end{aligned} $$

So, in our example we can calculate $ p(\mathbf{x}) $ to be 0.04609.


Putting it all together

we can generate probability estimates

$$ \begin{aligned} p(\text{Technology} \vert \mathbf{x}) &= \frac{p(\mathbf{x} \vert \text{Technology})p(\text{Technology})}{p(\mathbf{x})} = \frac{0.3 \times 0.1481}{0.04609} = 0.9643 \\ p(\text{Sales} \vert \mathbf{x}) &= \frac{p(\mathbf{x} \vert \text{Sales})p(\text{Sales})}{p(\mathbf{x})} = \frac{0.4 \times 0}{0.04609} = 0 \\ p(\text{Finance} \vert \mathbf{x}) &= \frac{p(\mathbf{x} \vert \text{Finance})p(\text{Finance})}{p(\mathbf{x})} = \frac{0.3 \times 0.005487}{0.04609} = 0.03571 \end{aligned} $$

Finally, if we want to turn this model into a classifier we just need to use a decision rule like Label the sample using the class with the highest probability.

Awesome! But there’s still a nagging issue… Suppose we try to classify the title “junior data analyst”. With our current model, we’d get

$$ \begin{aligned} p(\mathbf{x}|\text{Technology}) &= p(x_1=1|\text{Tech})p(x_2=0|\text{Tech}) \ldots p(x_{4}=1|\text{Tech}) \ldots = \frac{2}{3}\frac{3}{3} \ldots \frac{0}{3} \ldots = 0 \\ p(\mathbf{x}|\text{Sales}) &= p(x_1=1|\text{Sales})p(x_2=0|\text{Sales}) \ldots p(x_{4}=1|\text{Sales}) \ldots = \frac{2}{4}\frac{3}{4} \ldots \frac{1}{4} \ldots = 0 \\ p(\mathbf{x}|\text{Finance}) &= p(x_1=1|\text{Finance})p(x_2=0|\text{Finance}) \ldots p(x_{4}=1|\text{Finance}) \ldots = \frac{2}{3}\frac{3}{3} \ldots \frac{1}{3} \ldots = 0.005487 \end{aligned} $$

Our intuition tells us that “junior data analyst” is more likely to be a Technology job than a Finance job, but since the word “junior” never appeared in any of the Technology training samples, $ p(x_{4}=1|\text{Technology}) = 0 $ which will ultimately cause $ p(\text{Technology} \vert \mathbf{x}) $ to be 0. In order to deal with this issue, we’ll introduce something called Additive Smoothing which will ensure that $ p(x_i|C_k) $ is never 0 by making $ p(x_i|C_k) = \frac{frequency(x_i|C_k) + \alpha}{frequency(C_k) + \alpha n_i} $ where $ n_i $ is the number of possible values for $ x_i $ (in our example, 2). If $ \alpha = 1 $ this is known as Laplace Smoothing. In our example, Laplace Smoothing produces the following:

$$ \begin{aligned} p(\mathbf{x}|\text{Technology}) &= \frac{2+1}{3+1 \cdot 2}\frac{3+1}{3+1 \cdot 2} \ldots \frac{0+1}{3+1 \cdot 2} \ldots = 0.01132 \Rightarrow p(\text{Technology}|\mathbf{x}) = 0.7431 \\ p(\mathbf{x}|\text{Sales}) &= \frac{2+1}{4+1 \cdot 2}\frac{3+1}{4+1 \cdot 2} \ldots \frac{1+1}{4+1 \cdot 2} \ldots = 0.0001058 \Rightarrow p(\text{Sales}|\mathbf{x}) = 0.00926 \\ p(\mathbf{x}|\text{Finance}) &= \frac{2+1}{3+1 \cdot 2}\frac{3+1}{3+1 \cdot 2} \ldots \frac{1+1}{3+1 \cdot 2} \ldots = 0.003775 \Rightarrow p(\text{Finance}|\mathbf{x}) = 0.2477 \end{aligned} $$

Generalizing

Recall our model which is a direct translation of Bayes' Theorem

$$ p(C_k \vert \mathbf{x}) = \frac{p(\mathbf{x} \vert C_k)p(C_k)}{p(\mathbf{x})} $$

From our naive assumption, we transformed $ p(\mathbf{x} \vert C_k) $ into the product $ \prod_{i=1}^n p(x_i \vert C_k) $. Then we calculated $ p(x_i \vert C_k) $ using relative frequencies. The underlying argument for using relative frequencies is that $ x_i $ was a Bernoulli Random Variable (because $ x_i $ is either 0 or 1 for our example). However, in the general case $ x_i $ can be from any distribution and our Naive Bayes model will still work as long as we can closely estimate $ p(x_i \vert C_k) $. This includes multinomial random variables (commonly used in document classification problems like ours, but where word-repeats can occur) as well as continuous random variables. That means we can include things like a user’s bio (word frequencies) and age as additional features in our model.