0%

如何使用python进行kmeans聚类(详细案例讲解,附源代码) - 知乎

Excerpt

写在前面最近帮同学做了一个kmeans实现与测试案例,特意把它记下来。 Python版本python 3.8 pandas版本:1.2.4作业要求自己编写kMeans方法,并使用下面的数据来做聚类: 数据文件是:dataset_circles.csv,其中数…


写在前面

最近帮同学做了一个kmeans实现与测试案例,特意把它记下来。

python 3.8
pandas版本:1.2.4

作业要求

自己编写kMeans方法,并使用下面的数据来做聚类:

数据文件是:dataset_circles.csv,其中

  • 数据的第一列是x坐标,
  • 第二列是y坐标,
  • 第三列是样本点的类别。

要求:

  1. 使用自己编写的聚类方法对数据进行聚类
  2. 将数据可视化出来,自己分析数据的特点,找到一种方法将数据进行某种变换,在变换后的空间上使用自己编写的kMeans方法对数据进行聚类处理

实现过程

csv数据

首先用vscode打开csv文件看一下:

每一行就是一个数据点,第一列是x坐标值,第二列是y坐标值,第三列是类别信息。
首先用pandas库读取一下内容:

1
import pandas as pd all_data=pd.read_csv("dataset_circles.csv") print(all_data.head())

输出为:

9.062095432950300733e+00 8.410568609122412553e+00 0.000000000000000000e+00 0 -0.134341 9.609815 0.0 1 -11.767389 0.048031 0.0 2 -0.818793 10.492547 0.0 3 4.191121 -4.859337 0.0 4 7.278099 4.440088 0.0

ok,可以读取也能正常输出。

kmeans的实现过程

首先定义主函数:k_means(dataset,k)
其中,dataset就是数据集了,k是要聚类的类别数。该部分代码如下:

1
def k_means(dataset, k): k_points = generate_k(dataset, k) assignments = assign_points(dataset, k_points) old_assignments = None while assignments != old_assignments: new_centers = update_centers(dataset, assignments) old_assignments = assignments assignments = assign_points(dataset, new_centers) return assignments, dataset

首先该函数会调用generate_k函数来随机生成k个中心点,接下来会调用assign_point函数来根据这k个中心点把数据分为k份(assignmens是一个list类型,元素是每个点的所属的类别)。接下来是一个循环(只有当分配的结果不再变化时才会终止),每次循环都会更新中心点,直到整体分布不再变化为止。

generate_key函数的内容如下:

1
def generate_k(data_set, k): """ Given `data_set`, which is an array of arrays, find the minimum and maximum for each coordinate, a range. Generate `k` random points between the ranges. Return an array of the random points within the ranges. """ centers = [] dimensions = len(data_set[0]) min_max = defaultdict(int) for point in data_set: for i in range(dimensions): val = point[i] min_key = 'min_%d' % i max_key = 'max_%d' % i if min_key not in min_max or val < min_max[min_key]: min_max[min_key] = val if max_key not in min_max or val > min_max[max_key]: min_max[max_key] = val for _k in range(k): rand_point = [] for i in range(dimensions): min_val = min_max['min_%d' % i] max_val = min_max['max_%d' % i] rand_point.append(uniform(min_val, max_val)) centers.append(rand_point) return centers

简单解释一下,其实就是找到整个数据集的x轴最大最小值和y轴的最大最小值。然后以最大最小值为范围在这里面随机生成k个点作为初始中心点。

assign_points的内容如下:

1
def assign_points(data_points, centers): """ Given a data set and a list of points betweeen other points, assign each point to an index that corresponds to the index of the center point on it's proximity to that point. Return a an array of indexes of centers that correspond to an index in the data set; that is, if there are N points in `data_set` the list we return will have N elements. Also If there are Y points in `centers` there will be Y unique possible values within the returned list. """ assignments = [] for point in data_points: shortest = float("inf") # positive infinity shortest_index = 0 for i in range(len(centers)): val = distance(point, centers[i]) if val < shortest: shortest = val shortest_index = i assignments.append(shortest_index) return assignments

解释一下,首先计算一个点到每个中心点的距离,假如该点距离第i个中心点的距离最近,那么它的类别标签就是i。

再看一下update_centers的函数内容:

1
def update_centers(data_set, assignments): """ Accepts a dataset and a list of assignments; the indexes of both lists correspond to each other. Compute the center for each of the assigned groups. Return `k` centers where `k` is the number of unique assignments. """ new_means = defaultdict(list) centers = [] for assignment, point in zip(assignments, data_set): new_means[assignment].append(point) for key,points in new_means.items(): centers.append(point_avg(points)) return centers

解释一下,计算每个类别下数据点的x轴坐标加和和y轴坐标加和,加起来之后计算个平均值作为新的中心点。

另外,距离使用的是欧式距离

将数据进行聚类

代码很简单了:

1
import pandas as pd from my_kmeans import k_means def get_data(csv_file): all_data=pd.read_csv(csv_file,header=None) return all_data def pd2list(pd_data): points=[] for i in range(len(pd_data[0])): points.append(pd_data.iloc[i,0:2].values.tolist()) return points pd_data=get_data("dataset_circles.csv") data=pd2list(pd_data) assign,_=k_means(data,2) print(assign)

接下来是可视化。

对聚类结果进行可视化

1
import pandas as pd from my_kmeans import k_means import matplotlib.pyplot as plt def get_data(csv_file): all_data=pd.read_csv(csv_file,header=None) return all_data def pd2list(pd_data): points=[] for i in range(len(pd_data[0])): points.append(pd_data.iloc[i,0:2].values.tolist()) return points pd_data=get_data("dataset_circles.csv") data=pd2list(pd_data) assign,_=k_means(data,2) # print(assign) color=["red","green"] for point,label in zip(data,assign): plt.scatter(point[0],point[1],color=color[label]) plt.show()

转换数据空间

很明显,我们的结果与原图结果并不一,那么可否转换空间呢?于是我想到了极坐标,取所有点的中心点为原点,然后计算所有点的极坐标。代码如下:

1
def repre(points,center_point): new_data=[] for point in points: x=point[0]-center_point[0] y=point[1]-center_point[1] new_data.append(xy2pole(x,y)) return new_data

其他内容不变,我们看一下效果:

完美。


应热情网友私信要求,我决定把完整代码放出:

main.py

1
import pandas as pd from my_kmeans import k_means import matplotlib.pyplot as plt def get_data(csv_file): all_data=pd.read_csv(csv_file,header=None) return all_data def pd2list(pd_data): points=[] for i in range(len(pd_data[0])): points.append(pd_data.iloc[i,0:2].values.tolist()) return points pd_data=get_data("dataset_circles.csv") data=pd2list(pd_data) assign,_=k_means(data,2) # print(assign) color=["red","green"] for point,label in zip(data,assign): plt.scatter(point[0],point[1],color=color[label]) plt.show()

my_kmeans.py

1
from collections import defaultdict from random import uniform from math import sqrt def point_avg(points): """ Accepts a list of points, each with the same number of dimensions. NB. points can have more dimensions than 2 Returns a new point which is the center of all the points. """ dimensions = len(points[0]) new_center = [] for dimension in range(dimensions): dim_sum = 0 # dimension sum for p in points: dim_sum += p[dimension] # average of each dimension new_center.append(dim_sum / float(len(points))) return new_center def update_centers(data_set, assignments): """ Accepts a dataset and a list of assignments; the indexes of both lists correspond to each other. Compute the center for each of the assigned groups. Return `k` centers where `k` is the number of unique assignments. """ new_means = defaultdict(list) centers = [] for assignment, point in zip(assignments, data_set): new_means[assignment].append(point) for key,points in new_means.items(): centers.append(point_avg(points)) return centers def assign_points(data_points, centers): """ Given a data set and a list of points betweeen other points, assign each point to an index that corresponds to the index of the center point on it's proximity to that point. Return a an array of indexes of centers that correspond to an index in the data set; that is, if there are N points in `data_set` the list we return will have N elements. Also If there are Y points in `centers` there will be Y unique possible values within the returned list. """ assignments = [] for point in data_points: shortest = float("inf") # positive infinity shortest_index = 0 for i in range(len(centers)): val = distance(point, centers[i]) if val < shortest: shortest = val shortest_index = i assignments.append(shortest_index) return assignments def distance(a, b): """ """ dimensions = len(a) _sum = 0 for dimension in range(dimensions): difference_sq = (a[dimension] - b[dimension]) ** 2 _sum += difference_sq return sqrt(_sum) def generate_k(data_set, k): """ Given `data_set`, which is an array of arrays, find the minimum and maximum for each coordinate, a range. Generate `k` random points between the ranges. Return an array of the random points within the ranges. """ centers = [] dimensions = len(data_set[0]) min_max = defaultdict(int) for point in data_set: for i in range(dimensions): val = point[i] min_key = 'min_%d' % i max_key = 'max_%d' % i if min_key not in min_max or val < min_max[min_key]: min_max[min_key] = val if max_key not in min_max or val > min_max[max_key]: min_max[max_key] = val for _k in range(k): rand_point = [] for i in range(dimensions): min_val = min_max['min_%d' % i] max_val = min_max['max_%d' % i] rand_point.append(uniform(min_val, max_val)) centers.append(rand_point) return centers def k_means(dataset, k): k_points = generate_k(dataset, k) assignments = assign_points(dataset, k_points) old_assignments = None while assignments != old_assignments: new_centers = update_centers(dataset, assignments) old_assignments = assignments assignments = assign_points(dataset, new_centers) return assignments, dataset # points = [ # [1, 2], # [2, 1], # [3, 1], # [5, 4], # [5, 5], # [6, 5], # [10, 8], # [7, 9], # [11, 5], # [14, 9], # [14, 14], # ] # assign,dataset=k_means(points, 3) # print(assign) # print(dataset)

new_representation.py

1
import pandas as pd from my_kmeans import k_means import matplotlib.pyplot as plt import math def get_data(csv_file): all_data=pd.read_csv(csv_file,header=None) return all_data def pd2list(pd_data): points=[] for i in range(len(pd_data[0])): points.append(pd_data.iloc[i,0:2].values.tolist()) return points def get_avg_point(points): x_sum=0 y_sum=0 for point in points: x_sum+=point[0] y_sum+=point[1] return [x_sum/len(points),y_sum/len(points)] def xy2pole(x,y): r=math.sqrt(x**2+y**2) theta=math.asin(y/r) return [r,theta] def repre(points,center_point): new_data=[] for point in points: x=point[0]-center_point[0] y=point[1]-center_point[1] new_data.append(xy2pole(x,y)) return new_data pd_data=get_data("dataset_circles.csv") data=pd2list(pd_data) center_point=get_avg_point(data) new_data=repre(data,center_point) assign,_=k_means(new_data,2) # print(assign) color=["red","green"] for point,label in zip(data,assign): plt.scatter(point[0],point[1],color=color[label]) plt.show()