1  贝叶斯定理之证明

第一章我们来证明大名鼎鼎的贝叶斯定理:

\[P(A|B) = \frac{P(A) P(B|A)}{P(B)} \]

我们用 Penguins 数据来证明它。

import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
df = df.dropna()
df.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 MALE
df.shape
(333, 7)
set(df.species), set(df.island), set(df.sex)
({'Adelie', 'Chinstrap', 'Gentoo'},
 {'Biscoe', 'Dream', 'Torgersen'},
 {'FEMALE', 'MALE'})

我们看到总共有三种企鹅 (‘Adelie’, ‘Chinstrap’, ‘Gentoo’),两种性别 (‘FEMALE’, ‘MALE’)。我们下面只用到这两个变量。

我们把 A 事件定义为:FEMALE。B 事件定义为 Adelie。

1.1 A & B

我们先来看一下 \(P(A \& B)\):

def prob_sex_and_species(df, sex_str, species_str):
    subset = df[(df['sex'] == sex_str) & (df['species'] == species_str)]
    return len(subset) / len(df)
prob_sex_and_species(df, sex_str='FEMALE', species_str='Adelie')
0.21921921921921922

1.2 A|B

我们直接计算 \(P(A|B)\) 也就是 P(Female|Adelie):

def prob_sex_given_species(df, sex_str, species_str):
    species_subset = df[df.species == species_str]
    sex_subset_within_species_subset = species_subset[species_subset.sex == sex_str]
    return len(sex_subset_within_species_subset)/len(species_subset)
prob_sex_given_species(df, 'FEMALE', 'Adelie')
0.5

1.3 贝叶斯定理

首先,我们看到

\[P(A|B) = \frac{P(A\&B)}{P(B)} \]

def prob_species(df, species_str):
    subset = df[df.species == species_str]
    return len(subset)/len(df)
prob_species(df, 'Adelie')
0.43843843843843844
prob_sex_given_species(
    df, 'FEMALE', 'Adelie') == prob_sex_and_species(
    df, 'FEMALE', 'Adelie')/prob_species(df, 'Adelie')
True

我们也知道

\[P(A\&B) = P(B\&A)\]

这个貌似不用证明

进而我们知道

\[P(A\&B) = P(A|B) P(B)\]

所以

\[P(B\&A) = P(B|A) P(A)\]

所以

\[ P(A|B) = \frac{P(A \& B)}{P(B)} = \frac{P(B\&A)}{P(B)} = \frac{ P(A) P(B|A) }{P(B)} \]

得证。