This is an open-source framework for self-supervised/unsupervised graph embedding implemented by PyTorch, migrated from the earlier version implemented by Tensorflow.
-
A unified framework: We provide a unified framework for self-supervised/unsupervised node representation learning. Our models include unsupervised network embedding (NE) methods (DeepWalk, Node2vec, HOPE, GraRep, LLE, Lap, TADW, GF, LINE, SDNE) and recent self-supervised graph embedding methods (GAE, VGAE).
-
More datasets: We provide both unattributed datasets (Wiki, BlogCatalog, Flickr, Wikipedia, PPI) and attributed datasets (Cora, CiteSeer, Pubmed) of all sizes.
-
Efficiency: We provide faster and more efficient models and better default hyper-parameter settings than those in the previous version. The table below shows performances of OpenNE-PyTorch models on Cora Dataset as compared with the previous version, where "F1/Accuracy" refers to accuracy in GCN and micro F1-scores in other models, and "Time" refers to training time. Hyperparameters are set to default values unless specified in "Remarks". We also list results of our new models, GAE and VGAE.
method | Time | F1/Accuracy | Remarks | ||
---|---|---|---|---|---|
OpenNE(old) | OpenNE-Pytorch | OpenNE(old) | OpenNE-Pytorch | ||
DeepWalk | 85.85 | 74.98 | .832 | .832 | - |
Node2vec | 143.67 | 38.18 | .814 | .807 | - |
HOPE | 2.66 | 2.45 | .634 | .743 | - |
GraRep | 44.27 | 4.04 | .770 | .776 | - |
TADW | 43.42 | 59.12 | .852 | .843 | - |
GF | 15.01 | 19.53 | .546 | .775 | default # epochs changed |
LINE | 86.75 | 98.69 | .417 | .722 | default # epochs changed |
SDNE | 195.02 | 10.22 | .532 | .742 | - |
GCN | 17.4 | 11.22 | .857 | .861 | --sparse |
GAE | - | 55.97 | - | .788 | |
VGAE | - | 124.03 | - | .809 |
See Experimental Results for performances on Wiki and BlogCatalog.
- Modularity: We entangle the codes into three parts: Dataloader, Model and Task. Users can easily customize the datasets, methods and tasks. It is also easy to define their specific datasets and methods.
We plan to add more models and tasks in our framework. Our future plan includes:
-
More self-supervised models such as ARGA/ARVGA, GALA and AGE.
-
New tasks for link prediction, graph clustering and graph classification.
You are welcomed to add your own datasets and methods by proposing new pull requests.
- Clone this repo.
- Enter the directory where you clone it, and run the following code:
pip install -r requirements.txt cd src
- You can start using OpenNE by simply changing directory to OpenNE/src.
If instead you want to install OpenNE as a site-package, run the following command in OpenNE/src:
python setup.py install
It is easy to get started with OpenNE. Here are some commands for basic usages with default values:
python -m openne --model gf --dataset blogcatalog
python -m openne --model gcn --dataset cora
Parameters like --sparse
have action store_true
, which means they are False
by default,
and should be specified if you want to assign True
. Run GCN with sparsed matrices by the following command:
python -m openne --model gcn --dataset cora --sparse
You can use store_false
parameters, eg. --no-save
, in a similar way:
python -m openne --model gcn --dataset cora --sparse --no-save
OpenNE saves your models and training results to file by default, which may cost longer time. The above command is used when you wish not to save the results.
Use --local-dataset
(which is also a store_true
parameter!) and specify
--root-dir
, --edgefile
/--adjfile
,
--labelfile
, --features
and --status
to import dataset from file.
Optionally, specify store_true
parameters --weighted
and --directed
to
view the graph as weighted and/or directed.
If you wish to use your dataset in "~/mydataset", which includes edges.txt, an edgelist file, and labels.txt, a label file, input the following:
python -m openne --model gf --local-dataset --root-dir ~/mydataset --edgefile edges.txt --labelfile labels.txt
While all parameter names must be provided in lower case, string input values are case insensitive:
python -m openne --model SDnE --dataset coRA
The way to provide a Python list (as of --encoder-layer-list
in SDNE and --hiddens
in GCN) is input each elements
separated by spaces:
python -m openne --model sdne --dataset cora --encoder-layer-list 1000 128
OpenNE uses CUDA by default if torch.cuda.is_available() == True
. To disable CUDA, use --cpu
.
When CUDA is enabled, you can select multiple GPU devices by using --devices [device_ids]
. [device_ids]
includes
a number of integers, on the first of which your model and input are stored. Use --data-parallel
to utilize
data parallelism on the chosen devices.
You can check out the other options available to use with OpenNE using:
python -m openne --help
--model {deepwalk, line, node2vec, grarep, tadw, gcn, lap, gf, hope and sdne}
the specified NE model;--dataset {ppi, wikipedia, flickr, blogcatalog, wiki, pubmed, cora, citeseer}
standard dataset as provided by OpenNE;
If instead you want to create a dataset from file, you can provide your own graph by using switch
--local-dataset
(actionstore_true
; mutually exclusive with--dataset
)
and the following arguments:
--root-dir
, root directory of input files. If empty, you should provide absolute paths for graph files;--edgefile
, description of input graph in edgelist format;--adjfile
, description of input graph in adjlist format (mutually exclusive with--edgefile
);--label-file
, node label file;--features
, node feature file for certain models (optional);--name
, dataset name, "SelfDefined" by default;--weighted
, view graph as weighted (actionstore_true
);--directed
, view graph as directed (actionstore_true
);
For general training options:
--dim
, dimension of node representation, 128 by default;--clf-ratio
, the ratio of training data for node classification, 0.5 by default;--no-save
, choose not to save the result (actionstore_false
, dest=save);--output
, output file for vectors, which will be saved to "results" by default;--sparse
, calculate by sparse matrices (actionstore_true
) (only supports lle & gcn);
For models with multiple epochs:
--epochs
, number of epochs;--validate
,True
if validation is needed; by default it isFalse
except with GCN;--validation-interval
, number of epochs between two validations, 5 by default;--debug-output-interval
, number of epochs between two debug outputs, 5 by default;
For device options:
--cpu
, force OpenNE to run on CPU. Ignored iftorch.cuda.is_available() == False
.--devices
, specify CUDA devices for OpenNE to run on (default 0). Devices other thandevice_id[0]
are ignored except with--data-parallel
. Ignored iftorch.cuda.is_available() == False
.--data-parallel
, split input batch and perform data parallelism (actionstore_true
). Only works for methods with--batch-size
(i.e. line, sdne).
GraphFactorization:
--weight-decay
, weight for l2-loss of embedding matrix (1.0 by default);--lr
, learning rate (0.003 by default)
GraRep:
--kstep
, use k-step transition probability matrix(requiresdim % kstep == 0
).
HOPE:
--measurement {katz, cn, rpr, aa}
mesurement matrix,katz
by default;--beta
, parameter with katz measurement, 0.02 by default;--alpha
, parameter with rpr measurement, 0.5 by default;
LINE:
--lr
, learning rate, 0.001 by default;--batch-size
, 1024 by default;--negative-ratio
, 5 by default;--order
, 1 for the 1st-order, 2 for the 2nd-order and 3 for 1st + 2nd, 3 by default;
SDNE:
--encoder-layer-list
, list of neuron numbers at each encoder layer. In SDNE, the last number--encoder-layer-list
, instead of--dim
, is the dimension of the output node representation. [128] by default;--alpha
, parameter that controls the first-order proximity loss, 1e-6 by default;--beta
, parameter used for construct matrix B, 5 by default;--nu1
, parameter that controls l1-loss of weights in autoencoder, 1e-8 by default;--nu2
, parameter that controls l2-loss of weights in autoencoder, 1e-5 by default;--bs
, batch size, 200 by default;--lr
, learning rate, 0.001 by default;--decay
, allow decay in learning rate (action store_true);
TADW: (requires attributed graph, eg. cora, pubmed, citeseer)
--lamb
, parameter that controls the weight of regularization terms, 0.4 by default;
GCN: (requires attributed graph)
--lr
, learning rate, 0.01 by default;--dropout
, dropout rate, 0.5 by default;--weight-decay
, weight for l2-loss of embedding matrix, 0.0001 by default;--hiddens
, list of neuron numbers in each hidden layer, [16] by default;--max-degree
, maximum Chebyshev polynomial degree. 0 (disable Chebyshev polynomial) by default;
GAE and VGAE: (requires attributed graph) shares the same parameter list with GCN.
--lr
, default 0.01;--dropout
, default 0.0;--weight-decay
, default 1e-4;--early-stopping
, default 100;--hiddens
, default [32];--max-degree
, default 0;
DeepWalk and node2vec:
--num-paths
, number of random walks that starts at each node, 10 by default;--path-length
, length of random walk started at each node, 80 by default;--window
, window size of skip-gram model; 10 by default;--q
(only node2vec), 1.0 by default;--p
(only node2vec), 1.0 by default.
We provide experimental results of OpenNE models on Wiki and BlogCatalog datasets. For performances on Cora, checkout section "Overview" - "New Features".
Wiki
Algorithm | F1-micro | F1-macro | Time | Remarks |
---|---|---|---|---|
HOPE | 0.613 | 0.432 | 1.89 | - |
GF | 0.618 | 0.432 | 61.22 | - |
GraRep | 0.608 | 0.42 | 4.33 | - |
Node2vec | 0.656 | 0.535 | 49.18 | - |
DeepWalk | 0.662 | 0.522 | 97.47 | - |
SDNE | 0.655 | 0.522 | 81.19 | - |
LINE | 0.631 | 0.488 | 234.12 | epochs=40 |
BlogCatalog
Algorithm | F1-micro | F1-macro | Time | Remarks |
---|---|---|---|---|
HOPE | 0.336 | 0.157 | 96.63 | - |
GF | 0.235 | 0.066 | 800.02 | - |
GraRep | 0.399 | 0.233 | 103.27 | - |
Node2Vec | 0.396 | 0.26 | 1962.93 | - |
DeepWalk | 0.398 | 0.261 | 516.64 | - |
SDNE | 0.372 | 0.232 | 1323.93 | - |
LINE | 0.384 | 0.235 | 4739.79 | - |
If you find OpenNE is useful for your research, please consider citing the following papers:
@InProceedings{perozzi2014deepwalk,
Title = {Deepwalk: Online learning of social representations},
Author = {Perozzi, Bryan and Al-Rfou, Rami and Skiena, Steven},
Booktitle = {Proceedings of KDD},
Year = {2014},
Pages = {701--710}
}
@InProceedings{tang2015line,
Title = {Line: Large-scale information network embedding},
Author = {Tang, Jian and Qu, Meng and Wang, Mingzhe and Zhang, Ming and Yan, Jun and Mei, Qiaozhu},
Booktitle = {Proceedings of WWW},
Year = {2015},
Pages = {1067--1077}
}
@InProceedings{grover2016node2vec,
Title = {node2vec: Scalable feature learning for networks},
Author = {Grover, Aditya and Leskovec, Jure},
Booktitle = {Proceedings of KDD},
Year = {2016},
Pages = {855--864}
}
@article{kipf2016semi,
Title = {Semi-Supervised Classification with Graph Convolutional Networks},
Author = {Kipf, Thomas N and Welling, Max},
journal = {arXiv preprint arXiv:1609.02907},
Year = {2016}
}
@InProceedings{cao2015grarep,
Title = {Grarep: Learning graph representations with global structural information},
Author = {Cao, Shaosheng and Lu, Wei and Xu, Qiongkai},
Booktitle = {Proceedings of CIKM},
Year = {2015},
Pages = {891--900}
}
@InProceedings{yang2015network,
Title = {Network representation learning with rich text information},
Author = {Yang, Cheng and Liu, Zhiyuan and Zhao, Deli and Sun, Maosong and Chang, Edward},
Booktitle = {Proceedings of IJCAI},
Year = {2015}
}
@Article{tu2017network,
Title = {Network representation learning: an overview},
Author = {TU, Cunchao and YANG, Cheng and LIU, Zhiyuan and SUN, Maosong},
Journal = {SCIENTIA SINICA Informationis},
Volume = {47},
Number = {8},
Pages = {980--996},
Year = {2017}
}
@inproceedings{ou2016asymmetric,
title = {Asymmetric transitivity preserving graph embedding},
author = {Ou, Mingdong and Cui, Peng and Pei, Jian and Zhang, Ziwei and Zhu, Wenwu},
booktitle = {Proceedings of the 22nd ACM SIGKDD},
pages = {1105--1114},
year = {2016},
organization = {ACM}
}
@inproceedings{belkin2002laplacian,
title = {Laplacian eigenmaps and spectral techniques for embedding and clustering},
author = {Belkin, Mikhail and Niyogi, Partha},
booktitle = {Advances in neural information processing systems},
pages = {585--591},
year = {2002}
}
@inproceedings{ahmed2013distributed,
title = {Distributed large-scale natural graph factorization},
author = {Ahmed, Amr and Shervashidze, Nino and Narayanamurthy, Shravan and Josifovski, Vanja and Smola, Alexander J},
booktitle = {Proceedings of the 22nd international conference on World Wide Web},
pages = {37--48},
year = {2013},
organization = {ACM}
}
@inproceedings{wang2016structural,
title = {Structural deep network embedding},
author = {Wang, Daixin and Cui, Peng and Zhu, Wenwu},
booktitle = {Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining},
pages = {1225--1234},
year = {2016},
organization = {ACM}
}
@inproceedings{kipf2016variational,
title = {Variational graph auto-encoders},
author = {Kipf, Thomas N and Welling, Max},
booktitle = {NIPS Workshop on Bayesian Deep Learning},
numpages = {3},
year = {2016}
}
The OpenNE-pytorch Project is contributed by Yufeng Du, Ganqu Cui and Jie Zhou.
-
Zhiyuan Liu
- Tsinghua University
- Homepage
-
Cheng Yang
- Beijing University of Posts and Telecommunications
- Homepage
This research is supported by Tencent.