Skip to content

Commit 706c372

Browse files
Merge pull request #201 from vgteam/odgi_sort_psgd_k_default_val_eval
Odgi sort psgd k default val eval
2 parents 1396501 + b9ff2f3 commit 706c372

File tree

2 files changed

+161
-219
lines changed

2 files changed

+161
-219
lines changed

docs/asciidocs/odgi_sort.adoc

Lines changed: 21 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -26,15 +26,13 @@ determine the node order:
2626
next node in the prior graph order that has not been sorted, yet. The cycle breaking algorithm applies a DFS sort until
2727
a cycle is found. We break and start a new DFS sort phase from where we stopped.
2828
- A random sort: The graph is randomly sorted. The node order is randomly shuffled from http://www.cplusplus.com/reference/random/mt19937/[Mersenne Twister pseudo-random] generated numbers.
29-
- A sparse matrix mondriaan sort: We can partition a hypergraph with integer weights and uniform hyperedge costs using the http://www.staff.science.uu.nl/~bisse101/Mondriaan/[Mondriaan] partitioner.
3029
- A 1D linear SGD sort: Odgi implements a 1D linear, variation graph adjusted, multi-threaded version of the https://arxiv.org/abs/1710.04626[Graph Drawing
3130
by Stochastic Gradient Descent] algorithm. The force-directed graph drawing algorithm minimizes the graph's energy function
3231
or stress level. It applies stochastic gradient descent (SGD) to move a single pair of nodes at a time.
33-
- A path guided, 1D linear SGD sort: The major bottleneck of the 1D linear SGD sort is that the memory allocation is quadratic
34-
in number of nodes. So it does not scale for large graphs. This issue is tackled by the path guided, 1D linear SGD sort.
35-
Instead of precalculating all terms, it can use a path index to pick the terms to move stochastically. If ran with 1 thread only,
36-
the resulting order of the graph is deterministic. Ony can vary the seed.
37-
- An eades algorithmic sort: Use http://www.it.usyd.edu.au/~pead6616/old_spring_paper.pdf[Peter Eades' heuristic for graph drawing].
32+
- A path guided, 1D linear SGD sort: Odgi implements a 1D linear, variation graph adjusted, multi-threaded version of the https://arxiv.org/abs/1710.04626[Graph Drawing
33+
by Stochastic Gradient Descent] algorithm. The force-directed graph drawing algorithm minimizes the graph's energy function
34+
or stress level. It applies stochastic gradient descent (SGD) to move a single pair of nodes at a time. The path index is used to pick the terms to move stochastically. If ran with 1 thread only,
35+
the resulting order of the graph is deterministic. The seed is adjustable.
3836

3937
Sorting the paths in a graph my refine the sorting process. For the users' convenience, it is possible to specify a whole
4038
pipeline of sorts within one parameter.
@@ -80,62 +78,19 @@ pipeline of sorts within one parameter.
8078
*-r, --random*::
8179
Randomly sort the graph.
8280

83-
=== Mondriaan Sort
84-
85-
*-m, --mondriaan*::
86-
Use the sparse matrix diagonalization to sort the graph.
87-
88-
*-N, --mondriaan-n-parts*=_N_::
89-
Number of partitions for the mondriaan sort.
90-
91-
*-E, --mondriaan-epsilon*=_N_::
92-
Set the epsilon parameter for the mondriaan sort.
93-
94-
*-W, --mondriaan-path-weight*::
95-
Weight the mondriaan input matrix by the path coverage of edges.
96-
97-
=== 1D Linear SGD Sort
98-
99-
*-S, --linear-sgd*::
100-
Apply 1D linear SGD algorithm to sort the graph.
101-
102-
*-O, --sgd-bandwidth*=_sgd-bandwidth_::
103-
Bandwidth of linear SGD model. The default value is _1000_.
104-
105-
*-Q, --sgd-sampling-rate*=_sgd-sampling-rate_::
106-
Sample pairs of nodes with probability distance between them divided by the sampling rate. The default value is _20_.
107-
108-
*-K, --sgd-use-paths*::
109-
Use the paths to structure the distances between nodes in SGD.
110-
111-
*-T, --sgd-iter-max*=_sgd_iter-max_::
112-
The maximum number of iterations for the linear SGD model. The default value is _30_.
113-
114-
*-V, --sgd-eps*=_sgd-eps_::
115-
The final learning rate for the linear SGD model. The default value is _0.01_.
116-
117-
*-C, --sgd-delta*=_sgd-delta_::
118-
The threshold of the maximum node displacement, approximately in base pairs, at which to stop SGD.
119-
12081
=== Path Guided 1D Linear SGD Sort
12182

12283
*-Y, --path-sgd*::
12384
Apply path guided 1D linear SGD algorithm to organize the graph.
12485

125-
*-J, --path-sgd-sample-from-paths*::
126-
Instead of sampling the first node from all nodes we sample from all nucleotide positions of the paths. Default value is _FALSE_.
127-
128-
*-l, --path-sgd-sample-from-path-steps*::
129-
Instead of sampling the first node from all nodes we sample from all path steps of the paths. Default value is _FALSE_.
130-
131-
*-I, --path-sgd-deterministic*::
132-
Run the path guided 1D linear SGD in deterministic mode. Will automatically set the number of threads to 1, multithreading is not supported in this mode. Default value is _FALSE_.
86+
*-X, --path-index*=_FILE_::
87+
Load the path index from this _FILE_.
13388

13489
*-f, --path-sgd-use-paths*=FILE::
13590
Specify a line separated list of paths to sample from for the on the fly term generation process in the path guided linear 1D SGD. The default value are _all paths_.
13691

13792
*-G, --path-sgd-min-term-updates-paths*=_N_::
138-
The minimum number of terms to be updated before a new path guided linear 1D SGD iteration with adjusted learning rate eta starts, expressed as a multiple of total path length. The default value is _0.1_. Can be overwritten by _-U, -path-sgd-min-term-updates-nodes=N_.
93+
The minimum number of terms to be updated before a new path guided linear 1D SGD iteration with adjusted learning rate eta starts, expressed as a multiple of total path steps. The default value is _1.0_. Can be overwritten by _-U, -path-sgd-min-term-updates-nodes=N_.
13994

14095
*-U, --path-sgd-min-term-updates-nodes*=_N_::
14196
The minimum number of terms to be updated before a new path guided linear 1D SGD iteration with adjusted learning rate eta starts, expressed as a multiple of the number of nodes. Per default, the argument is not set. The default of _-G, path-sgd-min-term-updates-paths=N_ is used).
@@ -147,19 +102,28 @@ pipeline of sorts within one parameter.
147102
The final learning rate for path guided linear 1D SGD model. The default value is _0.01_.
148103

149104
*-v, --path-sgd-eta-max*=_N_::
150-
The first and maximum learning rate for path guided linear 1D SGD model. The default value is _number of nodes in the graph_.
105+
The first and maximum learning rate for path guided linear 1D SGD model. The default value is _squared steps of longest path in graph_.
151106

152107
*-a, --path-sgd-zipf-theta*=_N_::
153108
The theta value for the Zipfian distribution which is used as the sampling method for the second node of one term in the path guided linear 1D SGD model. The default value is _0.99_.
154109

155110
*-x, --path-sgd-iter-max*=_N_::
156-
The maximum number of iterations for path guided linear 1D SGD model. The default value is 30.
111+
The maximum number of iterations for path guided linear 1D SGD model. The default value is _30_.
157112

158-
*-F, --iteration-max-learning-rate::
159-
The iteration where the learning rate is max for path guided linear 1D SGD model. The default value is 0.
113+
*-F, --iteration-max-learning-rate*=_N_::
114+
The iteration where the learning rate is max for path guided linear 1D SGD model. The default value is _0_.
160115

161116
*-k, --path-sgd-zipf-space*=_N_::
162-
The maximum space size of the Zipfian distribution which is used as the sampling method for the second node of one term in the path guided linear 1D SGD model. The default value is the _maximum path lengths_.
117+
The maximum space size of the Zipfian distribution which is used as the sampling method for the second node of one term in the path guided linear 1D SGD model. The default value is the _longest path length_.
118+
119+
*-I, --path-sgd-zipf-space-max*=_N_::
120+
The maximum space size of the Zipfian distribution beyond which quantization occurs. Default value is _100_.
121+
122+
*-l, --path-sgd-zipf-space-quantization-step*=_N_::
123+
Quantization step size when the maximum space size of the Zipfian distribution is exceeded. Default value is _100_.
124+
125+
*-y, --path-sgd-zipf-max-num-distributions*=_N_::
126+
Approximate maximum number of Zipfian distributions to calculate. The default value is _100_.
163127

164128
*-q, --path-sgd-seed*=_N_::
165129
Set the seed for the deterministic 1-threaded path guided linear 1D SGD model. The default value is _pangenomic!_.
@@ -168,11 +132,6 @@ pipeline of sorts within one parameter.
168132
Set the prefix to which each snapshot graph of a path guided 1D SGD iteration should be written to. This is turned off per default.
169133
This argument only works when _-Y, --path-sgd_ was specified. Not applicable in a pipeline of sorts.
170134

171-
=== Eades Sort
172-
173-
*-e, --eades*::
174-
Use eades algorithm.
175-
176135
=== Path Sorting Options
177136

178137
*-L, --paths-min*::

0 commit comments

Comments
 (0)