0%

统计学习方法-朴素贝叶斯笔记

朴素贝叶斯属于生成模型,属于贝叶斯分类的算法,以贝叶斯定理为基础,因此如需看懂其公式推导方式,则需要了解什么是先验概率、条件概率以及后验概率;同时了解什么是条件概率公式,什么是全概率公式,进而可以推导出什么是贝叶斯公式 对于上述概念的描述,网上资源比较多,个人觉得比较易懂的描述有:
全概公式和贝叶斯公式的理解
条件概率/全概率/贝叶斯公式
贝叶斯公式的直观理解(先验概率/后验概率)

朴素贝叶斯的前提假设是分类特征分布独立,但是实际应用中这一假设不一定会成立,因此在特征比较多或者之间有联系的时候,分类效果比较差;但是正是由这个假设,才使得需要计算的条件概率数量大大减少,使得学习和预测的过程简化,从而提高了效率

并且尽管有前提假设的限制,但是朴素贝叶斯也有其优势所在(如:6 Easy Steps to Learn Naive Bayes Algorithm (with codes in Python and R)中所提到的):

4 Applications of Naive Bayes Algorithms

  • Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time
  • Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable
  • Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
  • Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not

简单的说:可以简单快速的进行预测,对于多分类预测有较好的表现,在文本分类/垃圾邮件过滤/情感分析中应用较好以及适合用于推荐系统

朴素贝叶斯三种常见模型:

  • 多项式模型,较为常见,一般用于特征值是离散型数据,会做拉普拉斯平滑(Laplace smoothing)处理(贝叶斯估计),因为用极大似然法会导致估计概率为0的情况出现
  • 高斯模型,一般用于特征值是连续型数据,比如身高、体重等等;高斯模型一般假设每个特征维度都满足高斯分布,因此需要计算均值和方差
  • 伯努利模型,一般用于特征值是离散型数据,但与多项式模型不同的是,其特征值一般为0或者1等布尔值,因此会多一步二值化的过程

从《统计学习方法》的公式推导可知,朴素贝叶斯相当于对所有可能性求概率,先求先验概率,然后再计算类确定下各个特征的后验概率,最后将后验概率最大的类作为输出

以算法4.1结合测试数据集MNIST(数字识别),简单的朴素贝叶斯算法代码实现如下(多项式模型):

class MultinomialNB:
    '''
    fit函数输入参数:
        X 测试数据集
        y 标记数据
        alpha 贝叶斯估计的正数λ
    predict函数输入参数:
        test 测试数据集
    '''
    def fit(self, X, y, alpha = 0):
        # 整理分类
        feature_data = defaultdict(lambda: [])
        label_data = defaultdict(lambda: 0)
        for feature, lab in zip(X, y):
            feature_data[lab].append(feature)
            label_data[lab] += 1

        # 计算先验概率
        self.label = y
        self.pri_p_label = {k: (v + alpha)/(len(self.label) + len(np.unique(self.label)) * alpha) for k,v in label_data.items()}

        # 计算不同特征值的条件概率
        self.cond_p_feature = defaultdict(lambda: {})
        for i,sub in feature_data.items():
            sub = np.array(sub)
            for f_dim in range(sub.shape[1]):
                for feature in np.unique(X[:,f_dim]):
                    self.cond_p_feature[i][(f_dim,feature)] = (np.sum(sub[:,f_dim] == feature) + alpha) / (sub.shape[0] + len(np.unique(X[:,f_dim])) * alpha)

    def predict(self, test):
        p_data = {}
        for sub_label in np.unique(self.label):
            # 对概率值取log,防止乘积时浮点下溢
            p_data[sub_label] = self.pri_p_label[sub_label]
            for i in range(len(test)):
                if self.cond_p_feature[sub_label].get((i,test[i])):
                    p_data[sub_label] *= self.cond_p_feature[sub_label][(i,test[i])]
        opt_label = max(p_data, key = p_data.get)
        return([opt_label, p_data.get(opt_label)])

然后使用上述算法来预测下数字识别的数据集,计算测试错误率,这里将数据集分割成80%的训练集和20%的测试集,并将非0值替换成1(其实这样来看应该用朴素贝叶斯的伯努利模型,但是先多项式模型用着吧)

import numpy as np
import pandas as pd
from collections import defaultdict
from sklearn.model_selection import train_test_split

dataset = pd.read_csv("train.csv")
dataset = np.array(dataset)
dataset[:,1:][dataset[:,1:] != 0] = 1
label = dataset[:,0]
# 分割训练集和测试集
train_dat, test_dat, train_label, test_label = train_test_split(dataset[:,1:], label, test_size = 0.2, random_state = 123456)
# 构建NB模型
model = MultinomialNB()
model.fit(X=train_dat, y=train_label, alpha=1)
# 使用NB模型进行预测
pl = {}
i = 0
for test in test_dat:
    temp = model.predict(test=test)
    pl[i] = temp
    i += 1
# 输出测试错误率%
error = 0
for k,v in pl.items():
    if test_label[k] != v[0]:
        error += 1
print(error/len(test_label)*100)

最后输出的测试错误率为16.86%,低于之前测试的KNN算法的准确度

参考资料:
朴素贝叶斯算法
统计学习方法|朴素贝叶斯原理剖析及实现

本文出自于http://www.bioinfo-scrounger.com转载请注明出处