The Engineering Placement Quiz was built by a group of five Management Engineering students for their Fourth Year Design Project.
The quiz is powered by a Naive Bayes classifier algorithm - one of the most efficient, intuitive and effective algorithms for applying machine learning to large data sets.
The Naive Bayes algorithm leverages Bayes’ Theorem to calculate the probability of a sample belonging to a certain category. Speaking in terms of The Quiz, it is calculating the probability that you (the sample) should belong to each of the fifteen engineering programs offered at the University of Waterloo (the categories). It is calculating these probabilities based on the ~2000 data points it has been trained with.
In machine learning we are often interested in selecting the best hypothesis (H) given a set of data (D). Bayes’ Theorem provides a way to calculate the probability of a hypothesis given the available data.
Bayes’ Theorem Equation: P(H | D) =P(D | H) x P(H)P(D)
Where:
P(H | D) is the probability of hypothesis H given the data D. This is called the posterior probability.
P(D | H) is the probability of data D given that the hypothesis H was true.
P(H) is the probability of hypothesis H being true (regardless of the data). This is called the prior probability of H.
P(D) is the probability of the data (regardless of the hypothesis).
Let’s say we have data on 1000 pieces of fruit. The fruit being a lemon, mango or some other fruit. Imagine we also know 3 features of each fruit - whether it’s sour or not, round or not and yellow or not. We’ve organized all this data in the table below.
Fruit |
Sour |
Round |
Yellow |
Total |
Lemon |
400 |
350 |
450 |
500 |
Mango |
0 |
150 |
300 |
300 |
Other |
100 |
150 |
50 |
200 |
Total |
500 |
650 |
800 |
1000 |
Just from looking at the table, we already know that:
50% of the total fruits are lemons
30% are mangos
20% are other fruits
Of the 500 lemons, 400 (80%) are sour, 350 (70%) are round and 450 (90%) are yellow
Of the 300 mangos, 0 are sour, 150 (50%) are round and 300 (100%) are yellow
Of the other 200 fruits, 100 (50%) are sour, 150 (75%) are round and 50 (25%) are yellow
Let’s say we’re given the features of an additional piece of fruit and we want to predict what type of fruit it is (it’s class). We’re told that the fruit is sour, round, and yellow. We can use Bayes’ Theorem to classify whether it’s a lemon, a mango or other fruit.
General Formula: P(A | B) =P(B | A) x P(A)P(B)
Lemon:P(Lemon | Sour, Round, Yellow) =P(Sour | Lemon) x P(Round | Lemon) x P(Yellow | Lemon) x P(Lemon)P(Sour) x P(Round) x P(Yellow)
P(Lemon | Sour, Round, Yellow) =(0.8) x (0.7) x (0.9) x (0.5)(0.25) x (0.33) x (0.41)
P(Lemon | Sour, Round, Yellow) =0.252
Mango: P(Mango | Sour, Round, Yellow)=P(Sour| Mango) x P(Round | Mango) x P(Yellow | Mango) x P(Mango)P(Sour) x P(Round) x P(Yellow)
P(Mango | Sour, Round, Yellow)=0
Other: P(Other | Sour, Round, Yellow) =P(Sour | Other) x P(Round | Other) x P(Yellow | Other) x P(Other)P(Sour) x P(Round) x P(Yellow)
P(Other | Sour, Round, Yellow) =0.018
Therefore, based on the highest score (~25.2% for lemon) we can assume this sour, round and yellow fruit is in fact, a lemon.