【Python】数据挖掘与机器学习(二)

码农世界 2024-05-17 前端 59 次浏览 0个评论

【Python】数据挖掘与机器学习(二)

【实验1】小麦种子分类

【实验1】小麦种子分类（Softmax 回归）

Seeds 数据集存放了不同品种小麦种子的区域(Area)、周长(Perimeter)、压实度

(Compactness)、籽粒长度(Kernel.Length)、籽粒宽度(Kernel.Width)、不对称系数

(Asymmetry.Coeff)、籽粒腹沟长度(Kernel.Groove)以及类别数据(Type)。该数据集总

共210 条记录、7 个特征、1 个类别，分为3 类标签，分别是1,2,3。（数据文件在seeds.csv）

请采用Softmax 回归给出线性回归模型：

(1)训练集与测试集按7:3 划分，分给出模型的权重系数;

(2)给出测试集的混淆矩阵与分类报告;

(3)选做：画出pairplot;

(4)根据训练的模型，给出以下小麦种子的类别分类:

14.56 14.39 0.88 5.57 3.27 2.27 5.22

18.68 16.23 0.89 6.23 3.72 3.22 6.08

12.47 13.55 0.85 5.34 2.91 4.37 5.14

12.97 13.73 0.86 5.39 3.01 4.79 5.28

14.48 14.49 0.85 5.71 3.14 5.30 5.62

代码实现

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# 0.数据加载
data = pd.read_csv('seeds.csv')
data = data.values
xx = data[:, 0:-1]
yy = data[:, 7]  # y 的取值是 1,2,3
np.random.seed(2022)  # 设置固定随机种子
# 1.训练集与测试集划分
x_train, x_test, y_train, y_test = train_test_split(xx, yy, test_size=0.3, random_state=2022)
clf = LogisticRegression(max_iter=500).fit(x_train, y_train)  # 定义并训练逻辑回归模型,max_iter 默认是 100，但本例子还没收敛需要设置 500.
print('coef:', clf.coef_)  # 输出回归系数，查看权重
print('intercept:', clf.intercept_)  # 获得截距
# 2.混淆矩阵与分类报告
y_pred = clf.predict(x_test)
cm = confusion_matrix(y_test, y_pred)  # 获得混淆矩阵
print('混淆矩阵：\n', cm)
report = metrics.classification_report(y_test, y_pred)  ##获得分类报告
print('输出分类报告：\n', report)
# 3.pairplot
data = pd.DataFrame(data)
data.rename(
    columns={0: 'Area', 1: 'Perimeter', 2: 'Compactness', 3: 'Kernel.Length', 4: 'Kernel.Width', 5: 'Asymmetry.Coeff',
             6: 'Kernel.Groove', 7: 'Type'}, inplace=True)
kind_dict = {0: "0", 1: "1", 2: "2", 3: "3"}
data['Type'] = data['Type'].map(kind_dict)
sns.pairplot(data, hue='Type')  # OK
# plt.show()
# 给出以下小麦种子的类别分类;
predict = [[14.56, 14.39, 0.88, 5.57, 3.27, 2.27, 5.22],
           [18.68, 16.23, 0.89, 6.23, 3.72, 3.22, 6.08],
           [12.47, 13.55, 0.85, 5.34, 2.91, 4.37, 5.14],
           [12.97, 13.73, 0.86, 5.39, 3.01, 4.79, 5.28],
           [14.48, 14.49, 0.85, 5.71, 3.14, 5.30, 5.62]
           ]
predict = np.array(predict)  # 把 predict 转换为矩阵
def softmax(x):
    e_x = np.exp(x - np.max(x))  # 防止exp()数值溢出
    return e_x / e_x.sum(axis=0)
yy = [np.argmax(softmax(np.dot(clf.coef_, predict[i, :]) + clf.intercept_))
      for i in range(len(predict))]
result = [x + 1 for x in yy]
print( result)
plt.show()

【实验2】 XX 肿瘤分类（LDA）

从sklearn 加载数据集，

from sklearn.datasets import load_breast_cancer

data= load_breast_cancer()

XX 腺癌数据集一共有569 个CT 样本，对疑似肿瘤区域提取了30 个特征，标签为二分类，

分类个数如下表，其中data 有31 列数据，第31 列是样本标签：0-benign，1-malignant：

类型个数

良性 benign 357

恶性 malignant 212

30 个属性分别是疑似肿瘤区域的半径、纹理灰度、周长、面积、平滑度等10 参数的平

均值(mean)、标准差(standard)与最大值(worst)：

(1) radius (mean of distances from center to points on the perimeter)

(2) texture (standard deviation of gray-scale values)

(3) perimeter

(4) area

(5) smoothness (local variation in radius lengths)

(6) compactness (perimeter^2 / area - 1.0)

(7) concavity (severity of concave portions of the contour)

(8) concave points (number of concave portions of the contour)

(9) symmetry

(10) fractal dimension (“coastline approximation” - 1)

以下是各个属性在两类样本中的统计值：

属性良性恶性

radius (mean): 6.981 28.11

texture (mean): 9.71 39.28

perimeter (mean): 43.79 188.5

area (mean): 143.5 2501.0

smoothness (mean): 0.053 0.163

compactness (mean): 0.019 0.345

concavity (mean): 0.0 0.427

concave points (mean): 0.0 0.201

symmetry (mean): 0.106 0.304

fractal dimension (mean): 0.05 0.097

radius (standard error): 0.112 2.873

texture (standard error): 0.36 4.885

perimeter (standard error): 0.757 21.98

area (standard error): 6.802 542.2

smoothness (standard error): 0.002 0.031

compactness (standard error): 0.002 0.135

concavity (standard error): 0.0 0.396

concave points (standard error): 0.0 0.053

symmetry (standard error): 0.008 0.079

fractal dimension (standard error): 0.001 0.03

radius (worst): 7.93 36.04

texture (worst): 12.02 49.54

perimeter (worst): 50.41 251.2

area (worst): 185.2 4254.0

smoothness (worst): 0.071 0.223

compactness (worst): 0.027 1.058

concavity (worst): 0.0 1.252

concave points (worst): 0.0 0.291

symmetry (worst): 0.156 0.664

fractal dimension (worst): 0.055 0.208

你的任务：

(1)采用LDA 线性判别分析，完成肿瘤良性与恶性的二分类模型;

(2)拆分数据集为训练集与测试集（6:4 分拆），计算测试集的Accuracy，给出混淆矩阵

与分类报告。

(3)选做：由数据集给出4 折交叉检验的平均结果

(4)选做：现有新测试数据ceshi.csv，请给出数据文件中的预测结果。

代码实现

import numpy as np
from sklearn import metrics
from sklearn.datasets import load_breast_cancer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score
# 数据获取
data = load_breast_cancer()
y_data = data.target
x_data = data.data
label_name = data['target_names']
feature_name = data['feature_names']
np.random.seed(2022)  # 设置固定随机种子
# 1、采用 LDA 线性判别分析，完成肿瘤良性与恶性的二分类模型;
lda = LDA(n_components=1)  # 由于是二分类，LDA 只能投影到 1 维直线上。
lda.fit(x_data, y_data)
y_pred = lda.predict(x_data)  # 作预测分类标签
print('accurancy:\n', metrics.accuracy_score(y_data, y_pred))  # 精度
print('precision:\n', metrics.precision_score(y_data, y_pred))
print('recall:\n', metrics.recall_score(y_data, y_pred))
# 2、拆分数据集为训练集与测试集（6:4 分拆），计算测试集的 Accuracy，给出混淆矩阵与分类报告;
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.4, random_state=2022)
clf = LDA(n_components=1).fit(x_train, y_train)
print('coef:', clf.coef_)
print('intercept:', clf.intercept_)
print('Train Accuary:', clf.score(x_train, y_train))  # 检查是否存在差异
print('Test Accuary:', clf.score(x_test, y_test))  # 检查是否存在差异
y_predict = lda.predict(x_test)  # 混淆矩阵,用测试数据去计算预测值
cm = confusion_matrix(y_test, y_predict)  # 获得混淆矩阵
print('混淆矩阵：\n', cm)
report = metrics.classification_report(y_test, y_predict)  ##获得分类报告
print('分类报告: \n', report)

看到这里的小伙伴，恭喜你又掌握了一个技能👊
希望大家能取得胜利，坚持就是胜利💪
我是寸铁！我们下期再见💕

往期好文💕

保姆级教程

【保姆级教程】Windows11下go-zero的etcd安装与初步使用

【保姆级教程】Windows11安装go-zero代码生成工具goctl、protoc、go-zero

【Go-Zero】手把手带你在goland中创建api文件并设置高亮

报错解决

【Go-Zero】Error: user.api 27:9 syntax error: expected ‘:‘ | ‘IDENT‘ | ‘INT‘, got ‘(‘ 报错解决方案及api路由注意事项

【Go-Zero】Error: only one service expected goctl一键转换生成rpc服务错误解决方案

【Go-Zero】【error】 failed to initialize database, got error Error 1045 (28000):报错解决方案

【Go-Zero】Error 1045 (28000): Access denied for user ‘root‘@‘localhost‘ (using password: YES)报错解决方案

【Go-Zero】type mismatch for field “Auth.AccessSecret“, expect “string“, actual “number“报错解决方案

【Go-Zero】Error: user.api 30:2 syntax error: expected ‘)‘ | ‘KEY‘, got ‘IDENT‘报错解决方案

【Go-Zero】Windows启动rpc服务报错panic:context deadline exceeded解决方案

Go面试向

【Go面试向】defer与time.sleep初探

【Go面试向】defer与return的执行顺序初探

【Go面试向】Go程序的执行顺序

【Go面试向】rune和byte类型的认识与使用

【Go面试向】实现map稳定的有序遍历的方式

转载请注明来自码农世界，本文标题：《【Python】数据挖掘与机器学习(二)》

码农世界 37815篇文章站点微博

每一天，每一秒，你所做的决定都会改变你的人生！

发表评论取消回复

评论列表（暂无评论，59人围观）参与讨论

码农世界管理员

【Python】数据挖掘与机器学习(二)

【Python】数据挖掘与机器学习(二)

【实验1】小麦种子分类

代码实现

【实验2】 XX 肿瘤分类（LDA）

代码实现

往期好文💕

保姆级教程

报错解决

Go面试向

发表评论取消回复

还没有评论，来说两句吧...

文章目录

码农世界管理员

【Python】数据挖掘与机器学习(二)

【Python】数据挖掘与机器学习(二)

【实验1】 小麦种子分类

代码实现

【实验2】 XX 肿瘤分类（LDA）

代码实现

往期好文💕

保姆级教程

报错解决

Go面试向

Visual Studio 连接 MySQL 数据库 实现数据库的读写（C++）

[工业自动化-1]：PLC架构与工作原理

网络编程（六）TCP并发服务器

Vue3、Element Plus使用v-for循环el-form表单进行校验

Ceph: vdbench 测试ceph存储rbd块设备

linux如何部署前端项目和安装nginx

Windows OpenVPN的安装之服务器自动启动连接

服务器数据恢复—KVM虚拟机被误删除如何恢复虚拟磁盘文件？

发表评论取消回复

还没有评论，来说两句吧...

文章目录

【实验1】小麦种子分类

Visual Studio 连接 MySQL 数据库实现数据库的读写（C++）