机器学习实战(4)：逻辑回归——分类问题的基础

软件开发
2025-09-03 10:09:01

第4集：逻辑回归——分类问题的基础

在机器学习中，逻辑回归（Logistic Regression）是解决分类问题的经典算法之一。尽管名字中有“回归”，但它实际上是一种分类模型，广泛应用于二分类任务（如垃圾邮件检测、疾病诊断等）。今天我们将深入探讨逻辑回归的数学原理，并通过实践部分使用 Iris 数据集进行二分类任务。

逻辑回归的数学原理什么是逻辑回归？

逻辑回归的核心思想是将线性回归的输出映射到 [0, 1] 区间，从而表示概率值。其公式如下： P ( y = 1 ∣ x ) = 1 1 + e − ( w 0 + w 1 x 1 + w 2 x 2 + . . . + w p x p ) P(y=1|x) = \frac{1}{1 + e^{-(w_0 + w_1x_1 + w_2x_2 + ... + w_px_p)}} P(y=1∣x)=1+e−(w0+w1x1+w2x2+...+wpxp)1 其中： P ( y = 1 ∣ x ) 表示给定输入特征 x 时，样本属于类别 1 的概率。 P(y=1|x) 表示给定输入特征 x 时，样本属于类别 1 的概率。 P(y=1∣x)表示给定输入特征x时，样本属于类别1的概率。 w 0 , w 1 , . . . , w p 是模型的参数。 w_0, w_1, ..., w_p 是模型的参数。 w0,w1,...,wp是模型的参数。 e 是自然对数的底数。 e 是自然对数的底数。 e是自然对数的底数。最终预测结果为： y ^ = { 1 if P ( y = 1 ∣ x ) ≥ 0.5 0 otherwise \hat{y} = \begin{cases} 1 & \text{if } P(y=1|x) \geq 0.5 \\ 0 & \text{otherwise} \end{cases} y^={10if P(y=1∣x)≥0.5otherwise

Sigmoid 函数的作用

逻辑回归的关键在于 Sigmoid 函数，它将线性回归的输出压缩到 [0, 1] 范围内。Sigmoid 函数的公式为： σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1}{1 + e^{-z}} σ(z)=1+e−z1 其中 z = w 0 + w 1 x 1 + w 2 x 2 + . . . + w p x p z = w_0 + w_1x_1 + w_2x_2 + ... + w_px_p z=w0+w1x1+w2x2+...+wpxp。

图1：Sigmoid 函数图像（图片描述：当 z 趋近于负无穷时，函数值趋近于 0；当 z 趋近于正无穷时，函数值趋近于 1。）

Sigmoid 函数的作用是将线性回归的连续输出转化为概率值，便于进行分类决策。

决策边界与概率输出决策边界

逻辑回归通过找到一个超平面（在二维空间中是一条直线），将数据分为两类。决策边界由以下方程定义： w 0 + w 1 x 1 + w 2 x 2 + . . . + w p x p = 0 w_0 + w_1x_1 + w_2x_2 + ... + w_px_p = 0 w0+w1x1+w2x2+...+wpxp=0 所有满足该方程的点构成了决策边界。

概率输出

逻辑回归不仅提供分类结果，还输出每个样本属于某一类的概率。例如：

如果 P ( y = 1 ∣ x ) = 0.8 ，说明该样本有 80 % 的概率属于类别 1 。如果 P(y=1|x) = 0.8 ，说明该样本有 80\% 的概率属于类别 1。如果P(y=1∣x)=0.8，说明该样本有80%的概率属于类别1。

分类模型的评价指标

为了评估分类模型的性能，我们通常使用以下指标：

1. 准确率（Accuracy）

准确率表示模型预测正确的比例： Accuracy = True Positives + True Negatives Total Samples \text{Accuracy} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Total Samples}} Accuracy=Total SamplesTrue Positives+True Negatives

2. 召回率（Recall）

召回率表示实际为正类的样本中被正确预测的比例： Recall = True Positives True Positives + False Negatives \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} Recall=True Positives+False NegativesTrue Positives

3. F1 分数

F1 分数是精确率和召回率的调和平均值： F 1 = 2 ⋅ Precision ⋅ Recall Precision + Recall F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} F1=2⋅Precision+RecallPrecision⋅Recall

实践部分：使用逻辑回归对 Iris 数据集进行二分类任务数据集简介

Iris 数据集包含 150 条记录，每条记录有 4 个特征（花萼长度、花萼宽度、花瓣长度、花瓣宽度）和 1 个标签（鸢尾花种类）。我们将只使用前两个类别（Setosa 和 Versicolor）进行二分类任务。

完整代码 import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, classification_report # 加载数据 iris = load_iris() data = pd.DataFrame(iris.data, columns=iris.feature_names) data['Species'] = iris.target # 只保留前两个类别（Setosa 和 Versicolor） data = data[data['Species'] != 2] # 提取特征和标签 X = data.iloc[:, :2] # 使用前两个特征（花萼长度和宽度） y = data['Species'] # 分割数据集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 构建逻辑回归模型 model = LogisticRegression() model.fit(X_train, y_train) # 预测 y_pred = model.predict(X_test) # 评估模型性能 accuracy = accuracy_score(y_test, y_pred) recall = recall_score(y_test, y_pred) f1 = f1_score(y_test, y_pred) conf_matrix = confusion_matrix(y_test, y_pred) print("模型评估结果：") print(f"Accuracy: {accuracy:.2f}") print(f"Recall: {recall:.2f}") print(f"F1 Score: {f1:.2f}") print("Confusion Matrix:") print(conf_matrix) # 绘制决策边界 plt.figure(figsize=(10, 6)) # 绘制训练集散点图 plt.scatter(X_train.iloc[:, 0], X_train.iloc[:, 1], c=y_train, cmap='coolwarm', edgecolor='k', s=100, label='Training Data') # 绘制测试集散点图 plt.scatter(X_test.iloc[:, 0], X_test.iloc[:, 1], c=y_test, cmap='coolwarm', marker='x', s=100, label='Testing Data') # 绘制决策边界 x_min, x_max = X.iloc[:, 0].min() - 0.5, X.iloc[:, 0].max() + 0.5 y_min, y_max = X.iloc[:, 1].min() - 0.5, X.iloc[:, 1].max() + 0.5 xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm') # 添加标题和标签 plt.title('Logistic Regression Decision Boundary', fontsize=16) plt.xlabel('Sepal Length (cm)', fontsize=12) plt.ylabel('Sepal Width (cm)', fontsize=12) plt.legend() plt.show()

运行结果输出结果（输出图片见图2所示）：模型评估结果： Accuracy: 1.00 Recall: 1.00 F1 Score: 1.00 Confusion Matrix: [[17 0] [ 0 13]]

图2：逻辑回归决策边界（图片描述：二维平面上展示了训练集（圆点）和测试集（叉号）的数据分布，背景颜色表示决策边界划分的区域。蓝色区域对应类别 0，红色区域对应类别 1。）

总结

本文介绍了逻辑回归的基本原理及其在分类任务中的应用。通过实践部分，我们成功使用逻辑回归对 Iris 数据集进行了二分类任务，并绘制了决策边界。希望这篇文章能帮助你更好地理解逻辑回归！

参考资料 Scikit-learn 文档: scikit-learn.org/stable/documentation.htmlIris 数据集: archive.ics.uci.edu/ml/datasets/iris

标签：

机器学习实战(4)：逻辑回归——分类问题的基础由讯客互联软件开发栏目发布，感谢您对讯客互联的认可，以及对我们原创作品以及文章的青睐，非常欢迎各位朋友分享到个人网站或者朋友圈，但转载请说明文章出处“机器学习实战(4)：逻辑回归——分类问题的基础”

上一篇
内部知识库：安全协作驱动数字化转型新路径

下一篇
Java设计模式之命令模式