0%

【时间序列聚类】KMedoids聚类+DTW算法_soft dtw-based k medoids-CSDN博客

Excerpt

文章浏览阅读1.6w次,点赞30次,收藏182次。前言KMedoids的聚类有时比KMeans的聚类效果要好。手上正好有一批时序数据,今天用KMedoids试下聚类效果安装KMedoids可以使用sklearn的拓展聚类模块scikit-learn-extra,模块需要保证Python (>=3.6)scikit-learn(>=0.22)安装 scikit-learn-extraPyPi: pip install scikit-learn-extraConda: conda install -c _soft dtw-based k medoids


前言

KMedoids的聚类有时比KMeans的聚类效果要好。手上正好有一批时序数据,今天用KMedoids试下聚类效果

安装

KMedoids可以使用sklearn的拓展聚类模块scikit-learn-extra,模块需要保证

安装 scikit-learn-extra

1
2
3
4
5
PyPi: pip install scikit-learn-extra

Conda: conda install -c conda-forge scikit-learn-extra

Git: pip install https://github.com/scikit-learn-contrib/scikit-learn-extra/archive/master.zip

安装 tslearn

1
2
3
4
5
PyPi: python -m pip install tslearn

Conda: conda install -c conda-forge tslearn

Git: python -m pip install https://github.com/tslearn-team/tslearn/archive/master.zip

为什么要使用scikit-learn-extra和tslearn两个模块?因为sklearn里面没有自带的KMedoids,也没有类似DTW的时序度量算法,但组合两者恰好能解决问题

测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np

from sklearn_extra.cluster import KMedoids

import tslearn.metrics as metrics

import data_process

from tslearn.clustering import silhouette_score

from tslearn.preprocessing import TimeSeriesScalerMeanVariance

from tslearn.generators import random_walks

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

X = np.loadtxt("top100.txt",dtype=np.float,delimiter=",")

X = data_process.downsample(X,30)

seed = 0

def test_elbow():

global X,seed

distortions = []

dists = metrics.cdist_dtw(X)

for i in range ( 2 , 15 ):

km = KMedoids(n_clusters=i,random_state=seed,metric="precomputed")

km.fit(dists)

distortions.append(km.inertia_)

plt.plot(range ( 2 , 15 ), distortions, marker= 'o' )

plt.xlabel( 'Number of clusters' )

plt.ylabel( 'Distortion' )

plt.show()

def test_kmedoids():

num_cluster = 5

km = KMedoids(n_clusters= num_cluster, random_state=0,metric="precomputed")

dists = metrics.cdist_dtw(X)

y_pred = km.fit_predict(dists)

np.fill_diagonal(dists,0)

score = silhouette_score(dists,y_pred,metric="precomputed")

print(X.shape)

print(y_pred.shape)

print("silhouette_score: " + str(score))

for yi in range(num_cluster):

plt.subplot(3, 2, yi + 1)

for xx in X[y_pred == yi]:

plt.plot(xx.ravel(), "k-", alpha=.3)

plt.plot(X[km.medoid_indices_[yi]], "r-")

plt.text(0.55, 0.85,'Cluster %d' % (yi + 1),

transform=plt.gca().transAxes)

if yi == 1:

plt.title("KMedoids" + " + DBA-DTW")

plt.tight_layout()

plt.show()

test_kmedoids()

采用KMedoids + DBA-DTW聚类效果  # 轮廓系数silhouette_score: 0.5465097470777784

采用KMedoids + SoftDTW聚类效果   # 轮廓系数silhouette_score: 0.6528261125440392

直接采用欧氏距离的聚类效果    # 轮廓系数silhouette_score: 0.5209641775604567

相比采用KMeans,KMedoids在的聚类中心(红线部分)从视觉上要似乎要更好(KMeans+DTW聚类效果可见该文章最后一节),但轮廓系数却不如KMeans。但仅欧氏距离而言,KMedoids的轮廓系数要比KMeans更好那么一点