Skip to content

Commit 440d2e6

Browse files
committed
added code and dataset
1 parent 2dce0b9 commit 440d2e6

17 files changed

+2528
-4
lines changed

.env

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
source activate compsumm > /dev/null 2>&1

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
.DS_Store
2+
__pycache__

README.md

Lines changed: 34 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,44 @@
11
# compsumm
22
Code, Datasets and Supplementary Appendix for AAAI paper **Comparative Document Summarisation via Classification**
33

4-
**Code**: Coming Soon
4+
**Supplementary Appendix**: [pdf](/appendix.pdf)
55

6-
**Dataset**: Coming Soon
6+
## How to use this repository ?
77

8-
**Supplementary Appendix**: [pdf](/appendix.pdf)
8+
### Installing
9+
If you have miniconda or anaconda, please use `install.sh` to install new env `compsumm` that has all dependencies; otherwise the dependencies are listed in `environment.yml`
910

10-
If you use this dataset, please cite this work at
11+
### 1. Dataset
12+
The dataset are in directory `dataset` in `HDF5` format. There are three files for each of the three news topics used in paper. Each file has following structure:
13+
```
14+
-- data: Averaged GLOVE vectors of title and first 3 sentences, 300 dimensional
15+
-- y: labels created by dividing timeranges into two groups
16+
-- yn: labels created using month for beefban and wee for capital punishment and guncontrol.
17+
-- title: title of article
18+
-- text: first three sentences
19+
-- datetime: date of publication
20+
21+
The dataset was split 70-20-10 as train-test-val sets several times.
22+
-- train_idxs: Matrix with each row i containing training indexes of split i.
23+
-- test_idxs: Matrix with each row i containing test indexes of split i.
24+
-- val_idxs: Matrix with each row i containing val indexes of split i.
25+
```
26+
Please see `news.py` for example loading of this dataset.
1127

28+
### 2. Code
29+
Please see [demo notebook](/demo.ipynb) for example use of `subm.py` and `grad.py`
30+
- `subm.py` has utility functions and greedy optimiser for discrete optimisation.
31+
- `grad.py` has utility functions and SGD optimiser for continuous optimisation. SGD optimised wasn't used, LBFGS from scipy was used instead.
32+
33+
`models.py` has several models for summarisation as classifiers. Models were abstracted into `Summ` class. This is particularly useful in creating common pattern for different summariser methods and in tuning hyperparameters. Please see [models notebook](/models.ipynb) for demo of `news.py` and `models.py`.
34+
35+
`utils.py` has common functions such as `balanced accuracy`, which is used for evaluation.
36+
37+
### Crowd-sourced evaluations results
38+
The crowd-sourced evaluations results is in file `crowdflower.csv`. The design and settings for this experiment is explained in the paper.
39+
40+
### 4. Citing
41+
If you use this dataset, please cite this work at
1242
```
1343
@inproceedings{bista2019compsumm,
1444
title={Comparative Document Summarisation via Classification},

crowdflower.csv

Lines changed: 555 additions & 0 deletions
Large diffs are not rendered by default.

datasets/beefban.h5

4.27 MB
Binary file not shown.

datasets/capital.h5

25.3 MB
Binary file not shown.

datasets/guncontrol.h5

20.5 MB
Binary file not shown.

demo.ipynb

Lines changed: 320 additions & 0 deletions
Large diffs are not rendered by default.

environment.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
name: compsumm
2+
dependencies:
3+
- python=3
4+
- numpy
5+
- joblib
6+
- pandas
7+
- scipy
8+
- jupyter
9+
- matplotlib
10+
- scikit-learn
11+
- h5py
12+
- seaborn
13+
- pip :
14+
- mypy
15+
- profilehooks

grad.py

Lines changed: 306 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,306 @@
1+
import scipy as sp
2+
import numpy as np
3+
from sklearn.metrics.pairwise import pairwise_kernels
4+
from typing import Tuple, Callable, Any, List
5+
from functools import partial
6+
from profilehooks import profile
7+
from math import ceil
8+
import time
9+
10+
### MMD-Grad ###
11+
# @profile
12+
def ekxs_cost_grad(A: np.ndarray, X: np.ndarray, gamma: float, **kwargs: Any) -> Tuple[float, np.ndarray]:
13+
n, d = X.shape
14+
m = A.shape[0] // d
15+
A = A.reshape(m, d)
16+
Kxa = pairwise_kernels(X, A, metric = "rbf", gamma = gamma)
17+
cost = -2 * Kxa.mean()
18+
Grad = np.zeros((m, d))
19+
for l in range(A.shape[0]):
20+
Grad[l] -= ((X - A[l]).T * Kxa[:,l]).T.mean(axis = 0)
21+
return cost, 4 * gamma / m * Grad.flatten()
22+
23+
def mmd_cost_grad(A: np.ndarray, X: np.ndarray, gamma: float, **kwargs: Any) -> Tuple[float, np.ndarray]:
24+
n, d = X.shape
25+
m = A.shape[0] // d
26+
A = A.reshape(m, d)
27+
Kxa = pairwise_kernels(X, A, metric = "rbf", gamma = gamma)
28+
Kaa = pairwise_kernels(A, A, metric = "rbf", gamma = gamma)
29+
cost = -2 * Kxa.mean() + Kaa.mean()
30+
Grad = np.zeros((m, d))
31+
for l in range(A.shape[0]):
32+
Grad[l] -= ((X - A[l]).T * Kxa[:,l]).T.mean(axis = 0)
33+
Grad[l] += ((A - A[l]).T * Kaa[:,l]).T.mean(axis = 0)
34+
return cost, 4 * gamma / m * Grad.flatten()
35+
36+
def ekxs_cost(A: np.ndarray, X: np.ndarray, gamma: float, **kwargs: Any) -> Tuple[float, np.ndarray]:
37+
n, d = X.shape
38+
return -2 * pairwise_kernels(X, A.reshape(A.shape[0] // d, d), metric = "rbf", gamma = gamma).mean()
39+
40+
def mmd_cost(A: np.ndarray, X: np.ndarray, gamma: float, **kwargs: Any) -> Tuple[float, np.ndarray]:
41+
n, d = X.shape
42+
m = A.shape[0] // d
43+
A = A.reshape(m, d)
44+
Kxa = pairwise_kernels(X, A, metric = "rbf", gamma = gamma)
45+
Kaa = pairwise_kernels(A, A, metric = "rbf", gamma = gamma)
46+
return -2 * Kxa.mean() + Kaa.mean()
47+
48+
#### MMD_grad with labels: different thank other data
49+
# @profile
50+
def mmd_cost_grad_labels(A: np.ndarray, X: np.ndarray, y: np.ndarray,
51+
gamma: float, gamma2: float, lambdaa: float = 0.0, diff = "data", **kwargs: Any) -> Tuple[float, np.ndarray]:
52+
cost = 0.0
53+
n, d = X.shape
54+
m = A.shape[0] // d
55+
A = A.reshape(m, d)
56+
Grad = np.zeros((m, d))
57+
classes = sorted(set(y))
58+
perclass = m // len(classes)
59+
60+
for c, k in enumerate(classes):
61+
Ac = A[c*perclass:(c+1)*perclass, :].flatten()
62+
Xc = X[np.where(y == k)[0], :]
63+
Xk = X[np.where(y != k)[0], :]
64+
cost_c, Grad_c = mmd_cost_grad(Ac, Xc, gamma)
65+
cost = cost + cost_c
66+
Grad[c*perclass:(c+1)*perclass, :] = Grad_c.reshape(perclass, d)
67+
if lambdaa > 0:
68+
cost_k, Grad_k = 0.0, 0.0
69+
if diff == "data":
70+
cost_k, Grad_k = mmd_cost_grad(Ac, Xk, gamma2)
71+
Grad_k = Grad_k.reshape(perclass, d)
72+
elif diff == "EKxs":
73+
cost_k, Grad_k = ekxs_cost_grad(Ac, Xk, gamma2)
74+
Grad_k = Grad_k.reshape(perclass, d)
75+
cost -= lambdaa * cost_k
76+
Grad[c*perclass:(c+1)*perclass, :] -= lambdaa * Grad_k
77+
if diff == "params" and lambdaa > 0:
78+
cost_k, Grad_k = mmd_cost_grad_params(A.reshape(-1), d, len(classes), gamma2)
79+
cost -= lambdaa * cost_k
80+
Grad -= lambdaa * Grad_k.reshape(m, d)
81+
return cost, Grad.flatten()
82+
83+
### cost only
84+
def mmd_cost_params(A: np.ndarray, d: int, K: int, gamma: float) -> float:
85+
# print(A.shape, d, K, gamma)
86+
m = A.shape[0] // d
87+
A = A.reshape(m, d)
88+
mk = m // K
89+
cost = 0
90+
Kaa = pairwise_kernels(A, metric = "rbf", gamma = gamma)
91+
92+
for k in np.arange(K):
93+
r = np.arange(k*mk,(k+1)*mk,1)
94+
msk = np.zeros(m, dtype = np.bool)
95+
msk[r] = True
96+
cost_k = Kaa[msk, :][:, msk].mean() + Kaa[~msk, :][:, ~msk].mean() -2 * Kaa[msk, :][:, ~msk].mean()
97+
cost += cost_k
98+
return cost
99+
100+
def mmd_cost_labels(A: np.ndarray, X: np.ndarray, y: np.ndarray,
101+
gamma: float, gamma2: float, lambdaa: float = 0.0, diff = "data", **kwargs: Any) -> Tuple[float, np.ndarray]:
102+
cost = 0.0
103+
_, d = X.shape
104+
m = A.shape[0] // d
105+
A = A.reshape(m, d)
106+
classes = sorted(set(y))
107+
perclass = m // len(classes)
108+
109+
for c, k in enumerate(classes):
110+
Ac = A[c*perclass:(c+1)*perclass, :].flatten()
111+
Xc = X[np.where(y == k)[0], :]
112+
Xk = X[np.where(y != k)[0], :]
113+
cost_c, Grad_c = mmd_cost_grad(Ac, Xc, gamma)
114+
cost = cost + cost_c
115+
cost_k = 0.0
116+
if diff == "data" and lambdaa > 0:
117+
cost_k = mmd_cost(Ac, Xk, gamma2)
118+
elif diff == "EKxs":
119+
cost_k = ekxs_cost(Ac, Xk, gamma2)
120+
cost -= lambdaa * cost_k
121+
if diff == "params" and lambdaa > 0:
122+
cost_k = mmd_cost_params(A.reshape(-1), d, len(classes), gamma2)
123+
cost -= lambdaa * cost_k
124+
return cost
125+
126+
## MMD with labels: different prototypes
127+
# @profile
128+
def mmd_cost_grad_params(A: np.ndarray, d: int, K: int, gamma: float) -> float:
129+
# print(A.shape, d, K, gamma)
130+
m = A.shape[0] // d
131+
A = A.reshape(m, d)
132+
mk = m // K
133+
cost = 0
134+
Grad = np.zeros((m, d))
135+
Kaa = pairwise_kernels(A, metric = "rbf", gamma = gamma)
136+
137+
for k in np.arange(K):
138+
r = np.arange(k*mk,(k+1)*mk,1)
139+
msk = np.zeros(m, dtype = np.bool)
140+
msk[r] = True
141+
cost_k = Kaa[msk, :][:, msk].mean() + Kaa[~msk, :][:, ~msk].mean() -2 * Kaa[msk, :][:, ~msk].mean()
142+
cost += cost_k
143+
144+
for l in range(m):
145+
if l in r:
146+
Grad[l] += (4 * gamma / mk * ((A[msk, :] - A[l]).T * Kaa[msk, :][:,l]).T.mean(axis = 0) )
147+
Grad[l] -= (4 * gamma / mk * ((A[~msk, :] - A[l]).T * Kaa[~msk, :][:,l]).T.mean(axis = 0) )
148+
if l not in r:
149+
Grad[l] += (4 * gamma / (m - mk) * ((A[~msk, :] - A[l]).T * Kaa[~msk, :][:,l]).T.mean(axis = 0) )
150+
Grad[l] -= (4 * gamma / (m - mk) * ((A[msk, :] - A[l]).T * Kaa[msk, :][:,l]).T.mean(axis = 0) )
151+
return cost, Grad.flatten()
152+
153+
154+
####################################################################################################
155+
########################################## OPTIMIZATION ############################################
156+
def step_decay(epochs: int, drop: float = 0.5, epochs_drop: int = 10) -> float:
157+
'''
158+
step_decay of learning rate
159+
'''
160+
return drop ** ((1 + epochs) // epochs_drop)
161+
162+
def gd(func: Callable[[np.ndarray, Any], Tuple[np.ndarray, List[float]]],
163+
param0: np.ndarray, lr: float = 0.1, beta: float = 0.9,
164+
decay: Callable[[int], float] = partial(step_decay, drop = 0.5, epochs_drop = 10),
165+
max_epochs: int = 100, tol: float = 1e-6, **kwargs) -> Tuple[np.ndarray, List[float]]:
166+
'''
167+
Gradient descent with momentum
168+
args:
169+
- func => f: (param, *args) -> cost, grad
170+
- param0 => initial guess
171+
- kwargs => optional arguments to the cost/grad function
172+
- lr => learning rate
173+
- beta => momentum parameter
174+
- max_epochs => maximum number of epochs
175+
- tol => tolerance of param for stopping criteria
176+
returns:
177+
- param: optimized parameter
178+
- cost: list of costs evaluated in each iterations
179+
'''
180+
costs = []
181+
V = np.zeros_like(param0)
182+
for epoch in range(max_epochs):
183+
cost, grad = func(param0, **kwargs)
184+
V = beta * V + grad
185+
lr_ = (lr * decay(epoch))
186+
param = param0 - lr_ * V
187+
costs.append(cost)
188+
if np.abs(param - param0).sum() <= tol:
189+
# if i >= 1 and np.abs(costs[-1] - costs[-2]) <= tol:
190+
break
191+
param0 = param
192+
return param, costs
193+
194+
195+
def sgd(func, param0, X, y = None, batch_size = 100, lr = 0.1, beta = 0.9,
196+
decay = partial(step_decay, drop = 0.5, epochs_drop = 5), tol = 1e-6,
197+
max_epochs = 100, **kwargs):
198+
'''
199+
Stochastic Gradient descent with momentum
200+
args:
201+
- func => f: (param, *args) -> cost, grad
202+
- param0 => initial guess
203+
- args => optional arguments to the cost/grad function
204+
- lr => learning rate
205+
- beta => momentum parameter
206+
- max_epochs => maximum number of epochs
207+
- tol => tolerance of param for stopping criteria
208+
returns:
209+
- param: optimized parameter
210+
- cost: list of costs evaluated in each iterations
211+
'''
212+
costs = []
213+
N, _ = X.shape
214+
V = np.zeros_like(param0)
215+
num_batches = ceil(N/batch_size)
216+
print("starting sgd with {} batches".format(num_batches))
217+
for epoch in range(max_epochs):
218+
for i in range(num_batches):
219+
if y is not None:
220+
cost, grad = func(param0.flatten(),
221+
X = X[i * batch_size: (i+1) * batch_size],
222+
y = y[i * batch_size: (i+1) * batch_size],
223+
**kwargs
224+
)
225+
else:
226+
cost, grad = func(param0.flatten(),
227+
X = X[i * batch_size: (i+1) * batch_size],
228+
**kwargs
229+
)
230+
231+
V = beta * V + grad.reshape(param0.shape)
232+
lr_ = (lr * decay(epoch))
233+
param = param0 - lr_ * V
234+
costs.append(cost)
235+
if np.abs(param - param0).sum() <= tol:
236+
break
237+
param0 = param
238+
# print("epoch {} => {:.4f}".format(epoch +1, cost))
239+
# print("sgd costs:", costs)
240+
return param, costs
241+
242+
243+
#############################################################################
244+
#################### Just some tests from here #########################
245+
import h5py
246+
def hdf5(path):
247+
248+
with h5py.File(path, 'r') as hf:
249+
X = hf.get("data_pca85")[:]
250+
y = np.array(hf.get("target")[:], dtype = np.uint8)
251+
train_idxs = hf.get("train_idxs")[:]
252+
test_idxs = hf.get('test_idxs')[:]
253+
254+
return X[train_idxs[0]], y[train_idxs[0]], X[test_idxs[0]], y[test_idxs[0]]
255+
256+
257+
def eval_sgd():
258+
from scipy.optimize import check_grad, approx_fprime, minimize
259+
from sklearn.cluster import KMeans
260+
import os
261+
X_tr, y_tr, X_te, y_te = hdf5(
262+
os.environ["HOME"] + "/Nextcloud/datasets/usps.h5"
263+
)
264+
265+
m = 4
266+
gamma = 0.04
267+
A0 = []
268+
for c in sorted(set(y_tr)):
269+
kmeans = KMeans(n_clusters = 2, init = "k-means++", random_state = 29)
270+
kmeans.fit(X_tr)
271+
A0.append(kmeans.cluster_centers_)
272+
A0 = np.concatenate(A0, axis = 0)
273+
print("kmeans completed")
274+
A, costs = sgd(mmd_cost_grad_labels, A0, X_tr, y_tr, gamma = gamma, gamma2 = gamma,lambdaa = 0.1)
275+
print("sgd", costs)
276+
277+
def eval_lbfgs():
278+
from scipy.optimize import check_grad, approx_fprime, minimize
279+
from sklearn.cluster import KMeans
280+
import os
281+
X_tr, y_tr, X_te, y_te = hdf5(
282+
os.environ["HOME"] + "/Nextcloud/datasets/usps.h5"
283+
)
284+
285+
m = 4
286+
gamma = 0.04
287+
A0 = []
288+
for c in sorted(set(y_tr)):
289+
kmeans = KMeans(n_clusters = 2, init = "k-means++", random_state = 29)
290+
kmeans.fit(X_tr)
291+
A0.append(kmeans.cluster_centers_)
292+
A0 = np.concatenate(A0, axis = 0)
293+
294+
print("kmeans completed")
295+
opt = sp.optimize.minimize(mmd_cost_grad_labels, A0.flatten(), args = (X_tr, y_tr, gamma, gamma, 0.1),
296+
method='L-BFGS-B', jac = True, tol = 1e-6,
297+
options={'maxiter': 100, 'disp': True})
298+
# A = opt.x.reshape(m, X_tr.shape[1])
299+
print("LBFGS", opt.x.shape())
300+
301+
def main():
302+
eval_sgd()
303+
eval_lbfgs()
304+
305+
if __name__ == '__main__':
306+
main()

install.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/bash
2+
echo "Installing/updating conda environment and dependencies"
3+
conda env create -f environment.yml || conda env update
4+
echo "creating .env for autoenv, follow README.md for its installation"
5+
rm .env
6+
echo "source activate compsumm > /dev/null 2>&1" >> .env
7+
echo "Installation completed !!!"

0 commit comments

Comments
 (0)