Skip to content

Commit c040753

Browse files
committed
Manifold Learning for data visualizing
1 parent 73f3c9b commit c040753

File tree

3 files changed

+288
-0
lines changed

3 files changed

+288
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
##1.流形学习的概念
2+
流形学习方法(Manifold Learning),简称流形学习,自2000年在著名的科学杂志《Science》被首次提出以来,已成为信息科学领域的研究热点。在理论和应用上,流形学习方法都具有重要的研究意义。
3+
4+
假设数据是均匀采样于一个高维欧氏空间中的低维流形,流形学习就是从高维采样数据中恢复低维流形结构,即找到高维空间中的低维流形,并求出相应的嵌入映射,以实现维数约简或者数据可视化。它是从观测到的现象中去寻找事物的本质,找到产生数据的内在规律。
5+
6+
>以上选自[百度百科](http://baike.baidu.com/link?url=vQmr30kzWc3gXfZM-6ANTtPdWJ1JyUsJR0pzoOWfjG79QK4zVZ_PvFN8BRfgHeGkqFPR-HZGsguaYuZrSTEcwK)
7+
8+
简单地理解,流形学习方法可以用来对高维数据降维,如果将维度降到2维或3维,我们就能将原始数据可视化,从而对数据的分布有直观的了解,发现一些可能存在的规律。
9+
10+
##2.流形学习的分类
11+
可以将流形学习方法分为线性的和非线性的两种,线性的流形学习方法如我们熟知的主成份分析(PCA),非线性的流形学习方法如等距映射(Isomap)、拉普拉斯特征映射(Laplacian eigenmaps,LE)、局部线性嵌入(Locally-linear embedding,LLE)。
12+
13+
当然,流形学习方法不止这些,因学识尚浅,在此我就不展开了,对于它们的原理,也不是一篇文章就能说明白的。对各种流形学习方法的介绍,网上有一篇不错的读物(原作已找不到): [流形学习 (Manifold Learning)](http://blog.csdn.net/zhulingchen/article/details/2123129)
14+
15+
##3.高维数据降维与可视化
16+
对于数据降维,有一张图片总结得很好(同样,我不知道原始出处):
17+
18+
![这里写图片描述](http://img.blog.csdn.net/20150522194801297)
19+
20+
21+
图中基本上包括了大多数流形学习方法,不过这里面没有t-SNE,相比于其他算法,t-SNE算是比较新的一种方法,也是效果比较好的一种方法。t-SNE是深度学习大牛Hinton和lvdmaaten(他的弟子?)在2008年提出的,lvdmaaten对t-SNE有个主页介绍:[tsne](http://lvdmaaten.github.io/tsne/),包括论文以及各种编程语言的实现。
22+
23+
接下来是一个小实验,对MNIST数据集降维和可视化,采用了十多种算法,算法在sklearn里都已集成,画图工具采用matplotlib。大部分实验内容都是参考sklearn这里的[example](http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html),稍微做了些修改。
24+
25+
Matlab用户可以使用lvdmaaten提供的工具箱: [drtoolbox](http://lvdmaaten.github.io/drtoolbox/)
26+
27+
###**- 加载数据**
28+
29+
30+
MNIST数据从sklearn集成的datasets模块获取,代码如下,为了后面观察起来更明显,我这里只选取`n_class=5`,也就是0~4这5种digits。每张图片的大小是8*8,展开后就是64维。
31+
32+
33+
digits = datasets.load_digits(n_class=5)
34+
X = digits.data
35+
y = digits.target
36+
print X.shape
37+
n_img_per_row = 20
38+
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
39+
for i in range(n_img_per_row):
40+
ix = 10 * i + 1
41+
for j in range(n_img_per_row):
42+
iy = 10 * j + 1
43+
img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
44+
plt.imshow(img, cmap=plt.cm.binary)
45+
plt.title('A selection from the 64-dimensional digits dataset')
46+
47+
48+
运行代码,获得X的大小是(901,64),也就是901个样本。下图显示了部分样本:
49+
50+
![这里写图片描述](http://img.blog.csdn.net/20150522195128952)
51+
52+
53+
54+
55+
###**- 降维**
56+
以t-SNE为例子,代码如下,n_components设置为3,也就是将64维降到3维,init设置embedding的初始化方式,可选random或者pca,这里用pca,比起random init会更stable一些。
57+
58+
59+
print("Computing t-SNE embedding")
60+
tsne = manifold.TSNE(n_components=3, init='pca', random_state=0)
61+
t0 = time()
62+
X_tsne = tsne.fit_transform(X)
63+
plot_embedding_2d(X_tsne[:,0:2],"t-SNE 2D")
64+
plot_embedding_3d(X_tsne,"t-SNE 3D (time %.2fs)" %(time() - t0))
65+
66+
67+
降维后得到X_ tsne,大小是(901,3),plot_ embedding_ 2d()将前2维数据可视化,plot_ embedding_ 3d()将3维数据可视化。
68+
69+
70+
函数plot_ embedding_ 3d定义如下:
71+
72+
73+
def plot_embedding_3d(X, title=None):
74+
#坐标缩放到[0,1]区间
75+
x_min, x_max = np.min(X,axis=0), np.max(X,axis=0)
76+
X = (X - x_min) / (x_max - x_min)
77+
#降维后的坐标为(X[i, 0], X[i, 1],X[i,2]),在该位置画出对应的digits
78+
fig = plt.figure()
79+
ax = fig.add_subplot(1, 1, 1, projection='3d')
80+
for i in range(X.shape[0]):
81+
ax.text(X[i, 0], X[i, 1], X[i,2],str(digits.target[i]),
82+
color=plt.cm.Set1(y[i] / 10.),
83+
fontdict={'weight': 'bold', 'size': 9})
84+
if title is not None:
85+
plt.title(title)
86+
87+
88+
###**- 看看效果**
89+
90+
十多种算法,结果各有好坏,总体上t-SNE表现最优,但它的计算复杂度也是最高的。下面给出PCA、LDA、t-SNE的结果:
91+
![这里写图片描述](http://img.blog.csdn.net/20150522195334439)
92+
![这里写图片描述](http://img.blog.csdn.net/20150522195314420)
93+
![这里写图片描述](http://img.blog.csdn.net/20150522195347336)
94+
![这里写图片描述](http://img.blog.csdn.net/20150522195443173)
95+
![这里写图片描述](http://img.blog.csdn.net/20150522195502751)
96+
![这里写图片描述](http://img.blog.csdn.net/20150522195440501)
97+
98+
99+
100+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
#coding:utf-8
2+
"""
3+
CreatedCreated on Fri May 22 2015
4+
@author: wepon
5+
@blog;
6+
7+
"""
8+
from time import time
9+
import numpy as np
10+
import matplotlib.pyplot as plt
11+
from mpl_toolkits.mplot3d.axes3d import Axes3D
12+
from sklearn import (manifold, datasets, decomposition, ensemble, lda,random_projection)
13+
14+
#%%
15+
#加载数据,显示数据
16+
digits = datasets.load_digits(n_class=5)
17+
X = digits.data
18+
y = digits.target
19+
print X.shape
20+
n_img_per_row = 20
21+
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
22+
for i in range(n_img_per_row):
23+
ix = 10 * i + 1
24+
for j in range(n_img_per_row):
25+
iy = 10 * j + 1
26+
img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
27+
plt.imshow(img, cmap=plt.cm.binary)
28+
plt.title('A selection from the 64-dimensional digits dataset')
29+
30+
#LLE,Isomap,LTSA需要设置n_neighbors这个参数
31+
n_neighbors = 30
32+
33+
34+
#%%
35+
# 将降维后的数据可视化,2维
36+
def plot_embedding_2d(X, title=None):
37+
#坐标缩放到[0,1]区间
38+
x_min, x_max = np.min(X,axis=0), np.max(X,axis=0)
39+
X = (X - x_min) / (x_max - x_min)
40+
41+
#降维后的坐标为(X[i, 0], X[i, 1]),在该位置画出对应的digits
42+
fig = plt.figure()
43+
ax = fig.add_subplot(1, 1, 1)
44+
for i in range(X.shape[0]):
45+
ax.text(X[i, 0], X[i, 1],str(digits.target[i]),
46+
color=plt.cm.Set1(y[i] / 10.),
47+
fontdict={'weight': 'bold', 'size': 9})
48+
49+
if title is not None:
50+
plt.title(title)
51+
52+
#%%
53+
#将降维后的数据可视化,3维
54+
def plot_embedding_3d(X, title=None):
55+
#坐标缩放到[0,1]区间
56+
x_min, x_max = np.min(X,axis=0), np.max(X,axis=0)
57+
X = (X - x_min) / (x_max - x_min)
58+
59+
#降维后的坐标为(X[i, 0], X[i, 1],X[i,2]),在该位置画出对应的digits
60+
fig = plt.figure()
61+
ax = fig.add_subplot(1, 1, 1, projection='3d')
62+
for i in range(X.shape[0]):
63+
ax.text(X[i, 0], X[i, 1], X[i,2],str(digits.target[i]),
64+
color=plt.cm.Set1(y[i] / 10.),
65+
fontdict={'weight': 'bold', 'size': 9})
66+
67+
if title is not None:
68+
plt.title(title)
69+
70+
71+
#%%
72+
#随机映射
73+
print("Computing random projection")
74+
rp = random_projection.SparseRandomProjection(n_components=2, random_state=42)
75+
X_projected = rp.fit_transform(X)
76+
plot_embedding_2d(X_projected, "Random Projection")
77+
78+
#%%
79+
#PCA
80+
print("Computing PCA projection")
81+
t0 = time()
82+
X_pca = decomposition.TruncatedSVD(n_components=3).fit_transform(X)
83+
plot_embedding_2d(X_pca[:,0:2],"PCA 2D")
84+
plot_embedding_3d(X_pca,"PCA 3D (time %.2fs)" %(time() - t0))
85+
86+
#%%
87+
#LDA
88+
print("Computing LDA projection")
89+
X2 = X.copy()
90+
X2.flat[::X.shape[1] + 1] += 0.01 # Make X invertible
91+
t0 = time()
92+
X_lda = lda.LDA(n_components=3).fit_transform(X2, y)
93+
plot_embedding_2d(X_lda[:,0:2],"LDA 2D" )
94+
plot_embedding_3d(X_lda,"LDA 3D (time %.2fs)" %(time() - t0))
95+
96+
97+
98+
#%%
99+
#Isomap
100+
print("Computing Isomap embedding")
101+
t0 = time()
102+
X_iso = manifold.Isomap(n_neighbors, n_components=2).fit_transform(X)
103+
print("Done.")
104+
plot_embedding_2d(X_iso,"Isomap (time %.2fs)" %(time() - t0))
105+
106+
107+
#%%
108+
#standard LLE
109+
print("Computing LLE embedding")
110+
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,method='standard')
111+
t0 = time()
112+
X_lle = clf.fit_transform(X)
113+
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
114+
plot_embedding_2d(X_lle,"Locally Linear Embedding (time %.2fs)" %(time() - t0))
115+
116+
117+
#%%
118+
#modified LLE
119+
print("Computing modified LLE embedding")
120+
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,method='modified')
121+
t0 = time()
122+
X_mlle = clf.fit_transform(X)
123+
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
124+
plot_embedding_2d(X_mlle,"Modified Locally Linear Embedding (time %.2fs)" %(time() - t0))
125+
126+
127+
#%%
128+
# HLLE
129+
print("Computing Hessian LLE embedding")
130+
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,method='hessian')
131+
t0 = time()
132+
X_hlle = clf.fit_transform(X)
133+
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
134+
plot_embedding_2d(X_hlle,"Hessian Locally Linear Embedding (time %.2fs)" %(time() - t0))
135+
136+
137+
#%%
138+
# LTSA
139+
print("Computing LTSA embedding")
140+
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,method='ltsa')
141+
t0 = time()
142+
X_ltsa = clf.fit_transform(X)
143+
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
144+
plot_embedding_2d(X_ltsa,"Local Tangent Space Alignment (time %.2fs)" %(time() - t0))
145+
146+
#%%
147+
# MDS
148+
print("Computing MDS embedding")
149+
clf = manifold.MDS(n_components=2, n_init=1, max_iter=100)
150+
t0 = time()
151+
X_mds = clf.fit_transform(X)
152+
print("Done. Stress: %f" % clf.stress_)
153+
plot_embedding_2d(X_mds,"MDS (time %.2fs)" %(time() - t0))
154+
155+
#%%
156+
# Random Trees
157+
print("Computing Totally Random Trees embedding")
158+
hasher = ensemble.RandomTreesEmbedding(n_estimators=200, random_state=0,max_depth=5)
159+
t0 = time()
160+
X_transformed = hasher.fit_transform(X)
161+
pca = decomposition.TruncatedSVD(n_components=2)
162+
X_reduced = pca.fit_transform(X_transformed)
163+
164+
plot_embedding_2d(X_reduced,"Random Trees (time %.2fs)" %(time() - t0))
165+
166+
#%%
167+
# Spectral
168+
print("Computing Spectral embedding")
169+
embedder = manifold.SpectralEmbedding(n_components=2, random_state=0,eigen_solver="arpack")
170+
t0 = time()
171+
X_se = embedder.fit_transform(X)
172+
plot_embedding_2d(X_se,"Spectral (time %.2fs)" %(time() - t0))
173+
174+
#%%
175+
# t-SNE
176+
print("Computing t-SNE embedding")
177+
tsne = manifold.TSNE(n_components=3, init='pca', random_state=0)
178+
t0 = time()
179+
X_tsne = tsne.fit_transform(X)
180+
print X_tsne.shape
181+
plot_embedding_2d(X_tsne[:,0:2],"t-SNE 2D")
182+
plot_embedding_3d(X_tsne,"t-SNE 3D (time %.2fs)" %(time() - t0))
183+
184+
plt.show()

README.md

+4
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,10 @@ CSDN:[wepon的专栏](http://blog.csdn.net/u012162613)
4343
- **logistic regression**
4444

4545
基于python+numpy实现了logistic回归(二类别),详细的介绍:[文章链接](http://blog.csdn.net/u012162613/article/details/41844495)
46+
47+
- **ManifoldLearning**
48+
49+
[DimensionalityReduction_DataVisualizing]() 运用多种流形学习方法将高维数据降维,并用matplotlib将数据可视化(2维和3维)
4650
4751
- **SVM**
4852

0 commit comments

Comments
 (0)