Naive Bayes classifier - Philosopher's Code

# Naïve Bayes Classifier ## Instructions Its called "naive" because it ignores word order ("dear friend give me money" is the same as "money give dear friend me"). This gives it a high bias, but low variance. there are two models of Naive Bayes classifier: ### Multinomial #### Model Training step 1: the model takes a training set of classified data (for example - spam and not spam emails). Then we calculate the Prior probability (what is the chance that a given message is a spam or not). For example - if 8 out of 10 messages are not spam, then that prior is 80%. spam: prior - 0.2 "friend" - 0.32 "dear" - 0.15 not spam: prior - 0.8 "friend" - 0.5 "dear" - 0.7 step 2: for each word in the emails, we calculate the Likelihood of it appearing, which is the amount of times that word appeared out of the total amount of words (histogram). Each word will have two Likelihoods - one for spam messages, and one for not spam. #### Testing We can now use the model to classify new messages. we calculate two probabilities. One is the likelihood of it being spam given the message's words, and the other is the likelihood of it being not spam giving the words. Whichever has a higher probability becomes our selected classification. the message "dear friend" will receive: if its spam: $ 0.2 * 0.32 * 0.15 = 0.0096 $ if its not spam: $ 0.8 * 0.5 * 0.7 = 0.28 $ since 0.28 > 0.0096, we will classify this message as not spam. #### Limitations some words might not appear at all in the training data. According to this, these words will have 0 likelihood, which will cause errors with the test data, since 0 times any value would still be zero. to solve this, we usually add *alpha*, which is a value of at least 1 for each word (so that it will never be zero) ### Gaussian this model is very similar in its logic to the multinomial, and works better with numeric values. for each value, it will take the avg value and the standard deviation to calculate the (Jump:: [[Gaussian curve]]), which is the curve of the distribution of values around the mean. #### Training For example: our population is divided into two groups, lord of the rings lovers, and star trek lovers. each group will have its own Gaussian curve for each of the numeric features (such as: how many times they watch sci-fi movies, how many rings do they have, and how much tv they watch). #### Testing for the test data, we would calculate the probability of it being in one group or the other, based on the likelihood of it being in either group based on its features. This likelihood is calculated using the Gaussian curve. Whichever group that got the highest probability, it will be classified as this group. #### Limitation it is common practice to use the *log* value of the probabilities to avoid underflow (values that are so close to zero that they cause errors). ## Overview 🔼Topic:: [[NLP (MOC)]] ◀Origin:: [[StatQuest]] 🔗Link:: https://www.youtube.com/embed/O2L2Uv9pdDA https://www.youtube.com/watch?v=H3EjCKtlVog <iframe width="560" height="315" src=https://www.youtube.com/embed/O2L2Uv9pdDA title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <iframe width="560" height="315" src=https://www.youtube.com/embed/H3EjCKtlVog title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>