如何在Python中执行Grubbs检验- 技术经验 -卓越飞翔博客

简介

格拉布斯检验是一种统计假设检验方法，用于检测数据集中的异常值。异常值是分配数据分布的观察结果，也称为异常。具有异常值的数据集往往比具有正态/高斯分布的数据更容易过度拟合。因此，在机器学习建模之前有必要解决异常值。在处理之前，我们必须检测并定位数据集中的异常值。最流行的异常值检测技术是 QQPlot、四分位距和 Grubbs 统计检验。然而，本文将仅讨论用于检测异常值的格鲁布斯检验。您将学习：什么是 Grubbs 测试以及如何在 Python 中实现它。

什么是异常值？

异常值是与其他数据值在数值上相距较远的数据观测值。这些值超出了正态分布数据的范围。数据集必须包含第一个标准差下的 67% 的记录、第二个标准差下的 95% 的数据以及第三个标准差下的 99.7% 的点，才能实现正态分布。换句话说，数据点应位于第一和第三四分位数范围之间。我们将第一四分位数以下和第三四分位数以上的记录视为异常值或异常值。

格拉布斯统计假设检验

与任何其他统计假设检验一样，格拉布斯检验也可以批准或拒绝原假设 (H0) 或替代假设 (H1)。 Grubbs 测试是检测数据集中异常值的测试。

我们可以通过两种方式执行格拉布斯检验：单面检验和双面检验，用于单变量数据集或几乎正态样本至少有七个变量的分布。该检验也称为极端学生化偏差检验或最大归一化残差检验。

格拉布斯检验使用以下假设 -

Null (H0)：数据集没有异常值。
替代 (H1)：数据集只有一个异常值。

Python 中的格拉布斯测试

Python 凭借其庞大的库集合可以应对任何编程挑战。这些库提供了内置方法，可直接用于执行任何操作、统计测试等。同样，Python 有一个库，其中包含执行 Grubbs 测试以检测异常值的方法。不过，我们将探索在 Python 中实现 Grubbs 测试的两种方法：库中的内置函数和从头开始实现公式。

异常值库和 Smirnov_grubbs

让我们首先使用以下命令安装 outlier_utils 库。

!pip install outlier_utils

现在让我们制作一个包含异常值的数据集并执行 Grubbs 测试。

双面格拉布检验

语法

grubbs.test(data, alpha=.05)

参数

data - 数据值的数值向量。

alpha - 测试的显着性水平。

说明

在此方法中，用户必须使用异常值包中的 smirnov_grubbs.test() 函数，并将必要的数据作为输入传递，以便运行 Grubb 的测试。

示例

import numpy as np
from outliers import smirnov_grubbs as grubbs
 
#define data
data = np.array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40])
 
#perform Grubbs' test
grubbs.test(data, alpha=.05)

输出

array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22,  8, 21, 28, 11,  9, 29])

上面的代码只是从加载库和数据开始，最后使用“test”方法对此数据执行 Grubbs 测试。此测试检测两侧（左侧和右侧）的异常值，或低于第一四分位数和高于第三四分位数的值。该数据只有 1 个异常值（40），已使用 Grubbs 检验删除。

单边格拉布斯检验

Synatx

grubbs.max_test(data, alpha=.05)

说明

在此方法中，用户必须调用 grubbs.min_test() 函数从提供的数据集中获取最小离群值，或者调用 grubbs.max_test()函数从提供的数据集中获取最大离群值，以获得单侧格拉布检验。

示例

import numpy as np
from outliers import smirnov_grubbs as grubbs
 
#define data
data = np.array([5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40])

#perform Grubbs' test for minimum value is an outlier
print(grubbs.min_test(data, alpha=.05)) 

#perform Grubbs' test for minimum value is an outlier
grubbs.max_test(data, alpha=.05)

输出

[ 5 14 15 15 14 19 17 16 20 22  8 21 28 11  9 29 40]
array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22,  8, 21, 28, 11,  9, 29])

单侧格拉布斯检验检测第一四分位数以下或第三四分位数以上的异常值。我们可以看到，min_test 方法从数据的最小侧删除异常值，而 max_test 方法从数据的顶部删除异常值。

公式实现

这里我们将用Python实现以下Grubbs测试公式。我们将使用 Numpy 和 Scipy 库来实现。

如何在Python中执行Grubbs检验

语法

g_calculated = numerator/sd_x
g_critical = ((n - 1) * np.sqrt(np.square(t_value_1))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value_1)))

算法

实施步骤如下 -

计算数据集值的平均值。
计算数据集值的标准差。
要实现格拉布斯检验公式，请通过从数据集中的每个值的平均值中减去其值来计算分子。
将分子值除以标准差即可得到计算的分数。
计算相同值的临界分数。
如果临界值大于计算值，则数据集中不存在异常值，否则存在异常值。

示例

import numpy as np
import scipy.stats as stats
## define data
x = np.array([12,13,14,19,21,23])
y = np.array([12,13,14,19,21,23,45])

## implement Grubbs test
def grubbs_test(x):
   n = len(x)
   mean_x = np.mean(x)
   sd_x = np.std(x)
   numerator = max(abs(x-mean_x))
   g_calculated = numerator/sd_x
   print("Grubbs Calculated Value:",g_calculated)
   t_value_1 = stats.t.ppf(1 - 0.05 / (2 * n), n - 2)
   g_critical = ((n - 1) * np.sqrt(np.square(t_value_1))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value_1)))
   print("Grubbs Critical Value:",g_critical)
   if g_critical > g_calculated:
      print("We can see from the Grubbs test that the calculated value is less than the crucial value. Recognize the null hypothesis and draw the conclusion that there are no outliers\n")
   else:
      print("We see from the Grubbs test that the estimated value exceeds the critical value. Reject the null theory and draw the conclusion that there are outliers\n")
grubbs_test(x)
grubbs_test(y)

输出

Grubbs Calculated Value: 1.4274928542926593
Grubbs Critical Value: 1.887145117792422
We can see from the Grubbs test that the calculated value is less than the crucial value. Recognize the null hypothesis and draw the conclusion that there are no outliers

Grubbs Calculated Value: 2.2765147221587774
Grubbs Critical Value: 2.019968507680656
We see from the Grubbs test that the estimated value exceeds the critical value. Reject the null theory and draw the conclusion that there are outliers

Grubb 测试的结果表明数组 x 没有任何异常值，但 y 有 1 个异常值。

结论

我们在本文中了解了 Python 中的离群值和 Grubbs 测试。让我们用一些要点来总结这篇文章。

异常值是指超出四分位数范围的记录。
异常值不符合数据集的正态分布。
我们可以使用格拉布斯假设统计检验来检测异常值。
我们可以使用 outlier_utils 库中提供的内置方法执行 Grubbs 测试。
双面格拉布斯检验可检测并删除左侧和右侧的异常值。
然而，单侧格拉布斯检验将检测任一侧的异常值。

简介