import pandas as pd
import numpy as np
1 贝叶斯定理之证明
第一章我们来证明大名鼎鼎的贝叶斯定理:
\[P(A|B) = \frac{P(A) P(B|A)}{P(B)} \]
我们用 Penguins 数据来证明它。
= pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
df = df.dropna()
df df.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | MALE |
df.shape
(333, 7)
set(df.species), set(df.island), set(df.sex)
({'Adelie', 'Chinstrap', 'Gentoo'},
{'Biscoe', 'Dream', 'Torgersen'},
{'FEMALE', 'MALE'})
我们看到总共有三种企鹅 (‘Adelie’, ‘Chinstrap’, ‘Gentoo’),两种性别 (‘FEMALE’, ‘MALE’)。我们下面只用到这两个变量。
我们把 A 事件定义为:FEMALE。B 事件定义为 Adelie。
1.1 A & B
我们先来看一下 \(P(A \& B)\):
def prob_sex_and_species(df, sex_str, species_str):
= df[(df['sex'] == sex_str) & (df['species'] == species_str)]
subset return len(subset) / len(df)
='FEMALE', species_str='Adelie') prob_sex_and_species(df, sex_str
0.21921921921921922
1.2 A|B
我们直接计算 \(P(A|B)\) 也就是 P(Female|Adelie):
def prob_sex_given_species(df, sex_str, species_str):
= df[df.species == species_str]
species_subset = species_subset[species_subset.sex == sex_str]
sex_subset_within_species_subset return len(sex_subset_within_species_subset)/len(species_subset)
'FEMALE', 'Adelie') prob_sex_given_species(df,
0.5
1.3 贝叶斯定理
首先,我们看到
\[P(A|B) = \frac{P(A\&B)}{P(B)} \]
def prob_species(df, species_str):
= df[df.species == species_str]
subset return len(subset)/len(df)
'Adelie') prob_species(df,
0.43843843843843844
prob_sex_given_species('FEMALE', 'Adelie') == prob_sex_and_species(
df, 'FEMALE', 'Adelie')/prob_species(df, 'Adelie') df,
True
我们也知道
\[P(A\&B) = P(B\&A)\]
这个貌似不用证明
进而我们知道
\[P(A\&B) = P(A|B) P(B)\]
所以
\[P(B\&A) = P(B|A) P(A)\]
所以
\[ P(A|B) = \frac{P(A \& B)}{P(B)} = \frac{P(B\&A)}{P(B)} = \frac{ P(A) P(B|A) }{P(B)} \]
得证。