Chapter_4_matthew.lyx

#LyX 2.0 created this file. For more info see http://www.lyx.org/
\lyxformat 413
\begin_document
\begin_header
\textclass article
\begin_preamble
\usepackage{amsmath}
\newcommand{\argmax}{\operatornamewithlimits{argmax}}
\newcommand{\gini}{\mathtt{gini}}
\newcommand{\rce}{\mathtt{rce}}
\end_preamble
\use_default_options true
\maintain_unincluded_children false
\language english
\language_package default
\inputencoding auto
\fontencoding global
\font_roman default
\font_sans default
\font_typewriter default
\font_default_family default
\use_non_tex_fonts false
\font_sc false
\font_osf false
\font_sf_scale 100
\font_tt_scale 100

\graphics default
\default_output_format default
\output_sync 0
\bibtex_command default
\index_command default
\paperfontsize default
\spacing single
\use_hyperref false
\papersize a4paper
\use_geometry true
\use_amsmath 1
\use_esint 1
\use_mhchem 1
\use_mathdots 1
\cite_engine basic
\use_bibtopic false
\use_indices false
\paperorientation portrait
\suppress_date false
\use_refstyle 0
\index Index
\shortcut idx
\color #008000
\end_index
\leftmargin 3cm
\topmargin 3cm
\rightmargin 3cm
\bottommargin 3cm
\headheight 3cm
\headsep 3cm
\footskip 3cm
\secnumdepth 3
\tocdepth 3
\paragraph_separation skip
\defskip bigskip
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\html_math_output 0
\html_css_as_file 0
\html_be_strict false
\author 1 "" 
\author 3 "Eleftherios Garyfallidis,,," 
\end_header

\begin_body

\begin_layout Section
Highly Efficient Tractography Clustering 
\end_layout

\begin_layout Subsection
Overview
\end_layout

\begin_layout Standard
Current tractography propagation algorithms can generate massive tractographies
 which are difficult to interpret and visualize.
 A clustering of some kind seems to be a solution to simplify the complexity
 of these datasets and provide a useful segmentation; however most proposed
 clustering algorithms are very slow and often need to calculate pairwise
 distances of size 
\begin_inset Formula $N\times N$
\end_inset

 where 
\begin_inset Formula $N$
\end_inset

 is the number of tracks.
 This amount of comparisons puts a massive load on clustering algorithms
 forcing them to be inefficient and therefore impractical for everyday analysis
 as it is difficult to compute all these distances or even store them in
 memory.
 This adds a further overhead to the use of tractography for clinical applicatio
ns but also puts a barrier on understanding and interpreting the quality
 of diffusion datasets.
 We show in this chapter that a stable overall linear time clustering algorithm
 exists and that we can generate meaningful clusters in seconds with minimum
 memory consumption.
 We can use this algorithm to simplify tractographies, identify hidden structure
s, find landmarks, create atlases and compare and register tractographies.
 In our approach we don't need to calculate all pairwise distances as most
 of the other existing methods do and furthermore we can update our clustering
 online or in parallel.
 We can show that we can generate meaningful clusters 
\begin_inset Formula $\sim1000$
\end_inset

 times faster than any other available methods even without parallelism.
 Moreover our method is multipurpose and it can be used as an input to other
 algorithms with higher order complexity which can now become more efficient.
 We show results from a few hundred to many millions of tracks.
\begin_inset Note Note
status open

\begin_layout Plain Layout
General comment - would it be fair to call this also GordianKnotBundles
 ? (Maybe GNB for Gordian_kNot_Bundles and GaryfallidisNimmo-SmithBrett
 :)).
 I mean that the method is a sort of cut-the-crap clustering.
 There is no reason to think this method will do a better job of clustering
 a set of close bundles than the other methods - correct? But it will do
 it faster.
 If it were me I would emphasize this more - like doing PCA before doing
 ICA or something, it is a crude but very fast first pass simplification
 of the data that can be used as input to a large variety of other methods,
 including, I suppose, itself, in order to find a global clustering (the
 most likely clustering for an infinite number of starting conditions).
\end_layout

\begin_layout Plain Layout
Maybe an emphasis on the advantage of being able to do clustering in real
 time for interactive applications and for display.
 
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
\begin_inset CommandInset label
LatexCommand label
name "sub:track-distances"

\end_inset

Track distances and preprocessing
\end_layout

\begin_layout Standard
For clarity we first give brief details of various metrics for distances
 between tracks as they are integral to an understanding of the track clustering
 literature.
 Numerous distance metrics between two trajectories have been proposed in
 the literature, such as in 
\begin_inset CommandInset citation
LatexCommand cite
key "Ding2003"

\end_inset

,
\begin_inset CommandInset citation
LatexCommand cite
key "MaddahIPMI2007"

\end_inset

,
\begin_inset CommandInset citation
LatexCommand cite
key "zhang2005dti"

\end_inset

 with most common the Hausdorff distance found in 
\begin_inset CommandInset citation
LatexCommand cite
key "corouge2004towards"

\end_inset

 and many other papers.
 We are using mainly a very simple distance proposed in 
\begin_inset CommandInset citation
LatexCommand cite
key "Visser2010"

\end_inset

 and by us 
\begin_inset CommandInset citation
LatexCommand cite
key "EGMB10"

\end_inset

 which we call minimum average direct-flip (MDF) distance 
\begin_inset Formula $d_{mdf}(s_{A},s_{B})$
\end_inset

 between track 
\begin_inset Formula $s_{A}$
\end_inset

 and track 
\begin_inset Formula $s_{b}$
\end_inset

, see (
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:direct_flip_distance"

\end_inset

).
 This distance can be applied only when both tracks have the same number
 of points.
 Therefore, an initial downsampling of tracks where all segments have the
 same length on a track and all tracks have the same number of segments
 is necessary.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
Is there somewhere else a discussion of what kind of characteristics the
 different distance methods might have? For example being sensitive to length,
 or allowing the tracks fairly short portions where they are close and thereafte
r allowing divergence as in track fanning etc? A discussion here of the
 problem of short vs long tracks?
\end_layout

\end_inset


\end_layout

\begin_layout Standard

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none
\begin_inset Formula 
\begin{eqnarray}
d_{mdf} & = & \min(d_{d},d_{f})\label{eq:direct_flip_distance}\\
d_{d}(s_{A},s_{B}) & = & \frac{1}{K}\sum_{i=1}^{K}||x_{i}^{A}-x_{i}^{B}||_{2}\nonumber \\
d_{f}(s_{A},s_{B}) & = & \frac{1}{K}\sum_{i=1}^{K}||x_{i}^{A}-x_{K-i}^{B}||_{2}\nonumber 
\end{eqnarray}

\end_inset


\end_layout

\begin_layout Standard
where 
\begin_inset Formula $K$
\end_inset

 is the number of points 
\begin_inset Formula $x$
\end_inset

 on both tracks 
\begin_inset Formula $A$
\end_inset

 and 
\begin_inset Formula $B$
\end_inset

.
\end_layout

\begin_layout Standard
In some cases it is still valid to use a type of Hausdorff distance which
 for simplicity denote (MAM) distance - shortcut for {minimum, maximum or
 mean} average minimum distance (MAM).
 We mostly use the mean version of that family of distances, see (
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:mean_average_distance"

\end_inset

) but the others are useful too as they could stress different properties
 of the datasets.
 These distances are slower to compute but they can work with different
 number of segments on tracks that is useful for some applications.
\end_layout

\begin_layout Standard
\begin_inset Formula 
\begin{eqnarray}
d_{avg}(s_{A},s_{B}) & = & \frac{1}{K_{A}}\sum_{i=1}^{K_{A}}d(x_{i}^{A},s_{B})\nonumber \\
d_{min}(s_{A},s_{B}) & = & \min_{j=1,...,K_{B}}d(x_{i}^{A},s_{B})\label{eq:mininum_distance}\\
d_{max}(s_{A},s_{B}) & = & \max_{j=1,...,K_{B}}d(x_{i}^{A},s_{B})\label{eq:maximum distance}\\
d(x_{i}^{A},s_{B}) & = & \min_{j=1,...,K_{B}}||x_{i}^{A}-x_{j}^{B}||_{2}\nonumber \\
m_{in} & = & \min(d_{avg}(s_{A},s_{B}),d_{avg}(s_{B},s_{A}))\label{eq:min_average_distance}\\
m_{ax} & = & \max(d_{avg}(s_{A},s_{B}),d_{avg}(s_{B},s_{A}))\nonumber \\
m_{ean} & = & (d_{avg}(s_{A},s_{B})+d_{avg}(s_{B},s_{A}))/2\label{eq:mean_average_distance}
\end{eqnarray}

\end_inset


\end_layout

\begin_layout Standard

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none
where here 
\begin_inset Formula $K_{A}$
\end_inset

 can be different that 
\begin_inset Formula $K_{B}$
\end_inset

.
 Finally, other distances than the average minimum based on the minimum
 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:mininum_distance"

\end_inset

 maximum distance see (
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:maximum distance"

\end_inset

) can be used.
 However, we have not investigated them in this work in relation to clustering
 algorithms.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset Graphics
	filename ../dev_trees/didaktoriko/distances.png
	scale 60
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\align center
\begin_inset CommandInset label
LatexCommand label
name "Flo:Distances_used"

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
Distances used in this work.
 The main distance used is minimum average direct flip (MDF) distance 
\begin_inset Formula $d_{df}=\min(d_{d},d_{f})$
\end_inset

 which is a symmetric distance can deal with the track direction problem
 and works on tracks which have the same number of points.
 Another distance is the mean average distance which is again symmetric
 but does not need for the tracks to have the same number of points 
\begin_inset Formula $m_{ean}=(d_{avg}(s_{A},s_{B})+d_{avg}(s_{B},s_{A}))/2$
\end_inset

.
 In this figure the components of both distances are shown; with solid lines
 we draw the tracks and then with dash lines we connect the points of the
 two tracks which provide a useful distance 
\begin_inset Formula $d$
\end_inset

 on the overall metrics.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
Coming back to the MDF distance (
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:direct_flip_distance"

\end_inset

), its main advantages is that it is fast to compute, it takes account of
 track direction issues, and that it is easy to understand what it will
 do from the simplest case of parallel equi-length tracks to the most complicate
d with very divergent tracks.
 Another advantage is that it will separate short tracks from long tracks
 and as we will see this will be a good way to find broken or erroneous
 tracks.
 Finally, an advantage of having tracks with the same number of points is
 that we can easily do statistical operations on them; for example add them
 together.
 We will see in the next section that track addition is a key property of
 our clustering algorithm.
\begin_inset Note Note
status open

\begin_layout Plain Layout
Discussion here of the advantages and disadvantages of different numbers
 of points allowed in tracks? For example, short tracks having the same
 number of points as long tracks means that more of the curvature etc data
 from the long tracks will be lost relative to the short tracks - I suppose.
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Related Work
\end_layout

\begin_layout Standard
During the last 
\begin_inset Formula $10$
\end_inset

 years there have been numerous efforts from many researchers to address
 the unsupervised and supervised learning problems of brain tractography.
 All these methods suffer from low efficiency, however they do provide many
 useful ideas.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
I got a bit lost in this section.
 Is there a way of summarizing the papers under different themes such as
 distance metric used, cluster number finding, clustering method or something
 like that?
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Moberts et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "moberts2005evaluation"

\end_inset

 evaluated different hierarchical clustering methods including a less common
 one, shared nearest neighbor (SNN), against a gold standard segmentation
 by physicians.
 Wang et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "wang2010tractography"

\end_inset

 proposed a nonparametric Bayesian framework using hierarchical Dirichlet
 processes mixture model (HDPM).
 This is one of the very few methods not based on distances.
 A track is modeled here as a discrete distribution over a codebook of discretiz
ed orientations and voxel regions.
 In this paper the authors explain that calculating pairwise distances is
 very time consuming and therefore they try to avoid that.
 Their approach automatically learns the number of clusters from data with
 Dirichlet processes priors.
 Visser et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Visser2010"

\end_inset

 used hierarchical clustering and fuzzy c-means together with recombination
 of subsets of the same tractography to reduce the effect of the large datasets
 on the distance matrix based on the minimum average flip distance (see
 section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:track-distances"

\end_inset

).
 The algorithm that we present in this chapter also uses the minimum average
 flip (MDF) function as a measure of distance between tracks.
 Gerig et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "gerig2004analysis"

\end_inset

 also used hierarchical clustering with minimum, closest and Hausdorff distances.
\end_layout

\begin_layout Standard
Guevara et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Guevara2010"

\end_inset

 used a cascade of different algorithms from hierarchical clustering together
 to 3d watershed on fibre extremities.
 They first divide the tractography into left-right hemisphere, inter-hemispheri
c and cerebellum subset, then create subsets of different track length,
 use hierarchical clustering based on the random voxel parcels, then watershed
 over extremities and finally use hierarchical clustering to merge the different
 sub-bundles using the Hausdorff distance (see section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:track-distances"

\end_inset

).
 Tsai et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Tsai2007"

\end_inset

 used a combination of cluster methods based on minimum spanning trees,
 locally linear embedding and k-means.
 They showed that they could incorporate both local and global structures
 by changing a few parameters.
 The advantage of this method was that it showed a way to merge a chain
 of neighbouring structures into one cluster.
 Zhang and Laidlaw 
\begin_inset CommandInset citation
LatexCommand cite
key "zhang2005dti"

\end_inset

 used an agglomerative hierarchical clustering using the same distance as
 in 
\begin_inset CommandInset citation
LatexCommand cite
key "zhang2003visualizing"

\end_inset

 and later in 
\begin_inset CommandInset citation
LatexCommand cite
key "zhang2008identifying"

\end_inset

 combined distance-based single linkage hierarchical clustering with expert
 labeling of specific bundles.
\end_layout

\begin_layout Standard
Wakana et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Wakana2007NeuroImage"

\end_inset

 used regions of interest to include or exclude tracks generated by FACT
 and Hua et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Hua2008NeuroImage"

\end_inset

 used regions of interest together with probabilistic tractography.
 Zvitia et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "zvitia2008adaptive"

\end_inset

, 
\begin_inset CommandInset citation
LatexCommand cite
key "Zvitia2010"

\end_inset

 used adaptive mean shift so that they do not need to provide the number
 of clusters, they also used this approach for registration of datasets
 from the same subject.
 El Kouby et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "ElKouby2005"

\end_inset

 used ROI-based connectivity parcellation and k-means.
\end_layout

\begin_layout Standard
Brun et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "brun2004clustering"

\end_inset

 used the mean and covariance of the track as the feature space and normalized
 cuts based on a graph theoretic approach for the segmentation.
 Ding et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Ding2003a"

\end_inset

 used k-nearest neighbours, another agglomerative approach, applied to correspon
ding track segments.
 Corouge et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "corouge2004towards"

\end_inset

 used different types of track distances, e.g.
 Hausdorff distances, and other geometric properties such as torsion and
 curvature, and in 
\begin_inset CommandInset citation
LatexCommand cite
key "Corouge2004"

\end_inset

 and 
\begin_inset CommandInset citation
LatexCommand cite
key "Corouge2006"

\end_inset

 she used Generalized Procrustes Analysis and Principal Components Analysis
 (PCA) to analyze the shape of bundles.
\end_layout

\begin_layout Standard
O'Donnell et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "ODonnell_IEEETMI07"

\end_inset

 generated a tractographic atlas using spectral embedding and expert anatomical
 labeling and then automatically segmented using further spectral clustering
 and embedding the tracks from the new tractographies as points in the embedded
 space to the closest existing atlas clusters.
 The full affinity matrix was too big to compute therefore they used the
 Nystrom approximation: working on a subset and avoid generating the complete
 affinity/distance matrix.
 Later in 
\begin_inset CommandInset citation
LatexCommand cite
key "o2009tract"

\end_inset

 they tried group analysis on prespecified bundles.
\end_layout

\begin_layout Standard
Maddah et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Maddah_MICCA2005"

\end_inset

 used B-spline representations of fibre tracts referenced to an atlas, and
 then the subject’s fibre tracts were clustered based on the labeled atlas
 of the fibre tracts.
 Later Maddah et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "maddah2006statistical"

\end_inset

 using a similar track representation (quintic B-splines) calculated a model
 for each bundle as the average and standard deviation of that parametric
 representation.
 In that way creates an atlas which is used as a prior for expectation maximizat
ion (EM) clustering of the corpus callosal tracks into Witelson subdivisions
 
\begin_inset CommandInset citation
LatexCommand cite
key "witelson1989hand"

\end_inset

 using population averages.
 Later in 
\begin_inset CommandInset citation
LatexCommand cite
key "Maddah_IEEEBI2008"

\end_inset

 Maddah et al.
\begin_inset space ~
\end_inset

showed that they could combine spatial priors with metrics for the shape
 of the tracks to guide tractography clustering.
\end_layout

\begin_layout Standard
Jonasson et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "jonasson2005fiber"

\end_inset

 measured the similarity between two fibres by counting the number of points
 sharing the same voxel together with spectral clustering.
\end_layout

\begin_layout Standard
Jianu et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "jianu2009exploring"

\end_inset

 presented a new method for visualizing and navigating through tractography
 data combining dendrograms from hierarchical clustering, 3d- and 2d-embeddings
 using the approximation that Chalmers 
\begin_inset CommandInset citation
LatexCommand cite
key "chalmers1996linear"

\end_inset

 gives for the technique of Eades 
\begin_inset CommandInset citation
LatexCommand cite
key "eades1984heuristic"

\end_inset

.
\end_layout

\begin_layout Standard
Durrleman et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "Durrleman2009"

\end_inset

 introduced electrical current models of fibre bundles where a fibre is
 seen as a set of wires sending information in one direction at constant
 rate.
 Currents have good diffeomorphic properties and can be used for registration
 of bundles as shown in 
\begin_inset CommandInset citation
LatexCommand cite
key "Durrleman2009"

\end_inset

 and later in 
\begin_inset CommandInset citation
LatexCommand cite
key "durrleman2010registration"

\end_inset

.
 This methodology does not impose point-to-point or fibre-to-fibre correspondenc
es, however it is sensitive to the fibre density and orientation of the
 bundles.
 In common with all the methods above it is also computationally expensive.
\end_layout

\begin_layout Standard
Leemans and Jones 
\begin_inset CommandInset citation
LatexCommand cite
key "leemans17new"

\end_inset

 used affinity propagation (section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Affinity-Propagation"

\end_inset

) to cluster the fronto-occipital fibres, cingulum and arcuate fasciculus
 using additional frontal and occipital AND masks and a NOT mask on the
 right cerebrum.
 However, the authors worked on a very small part of the entire tractography
 were clustering is a much easier problem.
 Later Malcolm et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "malcolm2009filtered"

\end_inset

 used affinity propagation to cluster a full brain tractography created
 using filtered tractography and suggested that affinity propagation is
 not suitable for group clustering.
\end_layout

\begin_layout Standard
Ziyan et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "ziyan2009consistency"

\end_inset

 introduced a probabilistic registration and clustering algorithm based
 on EM algorithm which creates a sharper atlas from a set of subjects on
 three bundles: Corpus Callosum, Cingulate and Fornix.
 This work used an initial spectral clustering 
\begin_inset CommandInset citation
LatexCommand cite
key "ODonnell_IEEETMI07"

\end_inset

 to label the bundles and then updated these labels iteratively while performing
 bundle-wise registration combined using polyaffine integration.
 
\end_layout

\begin_layout Standard
In summary, most researchers had to use complete inter-track distance matrices
 or/and simplify the datasets by length biased filtering, or selections,
 or with masks, or extensive ROIs, or avoid track distances entirely because
 of CPU and memory constraints.
 We address these capacity problems by taking a different approach.
\end_layout

\begin_layout Subsection
Materials and methods
\end_layout

\begin_layout Standard

\series bold
[THIS NEEDS SPELLING OUT IN DETAIL.
 Is this the best place? Could the information on datasets go in an Appendix?]
\change_inserted 3 1319125536
 Yes this can all go to the appendix.
 The Appendix of the thesis has already most of the information
\change_unchanged

\end_layout

\begin_layout Standard

\series bold
Real subjects.
 
\series default
We collected data from 10 healthy subjects at the MRC-CBU 3T scanner (TIM
 Trio, Siemens), using Siemens advanced diffusion work-in-progress sequence,
 and STEAM 
\begin_inset CommandInset citation
LatexCommand cite
key "merboldt1992diffusion,MAB04"

\end_inset

 as the diffusion preparation method.
 The field of view was 
\begin_inset Formula $240x240mm^{2}$
\end_inset

, matrix size 
\begin_inset Formula $96x96$
\end_inset

, and slice thickness 
\begin_inset Formula $2.5mm$
\end_inset

 (no gap).
 55 slices were acquired to achieve full brain coverage, and the voxel resolutio
n was 
\begin_inset Formula $2.5x2.5x2.5mm{}^{3}$
\end_inset

.
 A 102-point half grid acquisition with a maximum b-value of 
\begin_inset Formula $4000\, s/mm^{2}$
\end_inset

 was used.
 The total acquisition time was
\begin_inset Formula $14\, min\,21s$
\end_inset

 with TR=
\begin_inset Formula $8200ms$
\end_inset

 and TE=
\begin_inset Formula $69ms$
\end_inset

.
 XXXX Ethics Number - ASK Marta XXXX
\end_layout

\begin_layout Standard
For the reconstruction of the real data sets we used Generalized Q-samping
 with diffusion sampling length 
\begin_inset Formula $1.2$
\end_inset

 and for the tractography propagation we used EuDX (euler integration with
 trilinear interpolation) with 
\begin_inset Formula $1$
\end_inset

 million random seeds, angular threshold 
\begin_inset Formula $60^{\circ}$
\end_inset

, total weighting 
\begin_inset Formula $0.5$
\end_inset

, propagation step size 
\begin_inset Formula $0.5$
\end_inset

 and anisotropy stopping threshold 
\begin_inset Formula $0.0239$
\end_inset

.
\end_layout

\begin_layout Standard

\series bold
PBC real subjects
\series default
.
 We used as well a few labeled data sets from the freely available tractography
 database used in the Pittsburgh Brain Competion Fall 2009 ICDM 
\begin_inset ERT
status open

\begin_layout Plain Layout

http://pbc.lrdc.pitt.edu
\end_layout

\end_inset

.
\end_layout

\begin_layout Standard

\series bold
Simulated trajectories.

\series default
 Generated 
\begin_inset Formula $3$
\end_inset

 different bundles of orbits of 
\begin_inset Formula $200$
\end_inset

 time points.
 The orbits were made from different combinations of sinusoidal and helicoidal
 functions.
 In total this data set contained 
\begin_inset Formula $450$
\end_inset

 orbits see fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:simulated_orbits"

\end_inset

 .
\end_layout

\begin_layout Subsection
Linear-time QuickBundles (QB) Clustering 
\end_layout

\begin_layout Subsubsection
The QB Algorithm
\end_layout

\begin_layout Standard
QuickBundles (QB) is a remarkably
\begin_inset Note Note
status open

\begin_layout Plain Layout
Might want to omit sales-pitchey words here.
 The risk will be that your reviewer will find themselves trying to find
 fault, rather than enjoying this new addition to the list of tools they
 can use
\end_layout

\end_inset

 fast algorithm which can simplify tractography representation in an accessible
 structure in a time that is linear in the number of tracks 
\begin_inset Formula $N$
\end_inset

.
 QB is a linear time 
\begin_inset Formula $O(N)$
\end_inset

 distance based clustering algorithm that we created in order to simplify
 huge trajectory datasets as those produced by current tractography generation
 algorithms.
 In general, there are very few linear time clustering algorithms.
 Just two are well known: CLARANS 
\begin_inset CommandInset citation
LatexCommand cite
key "ng2002clarans"

\end_inset

 and BIRCH 
\begin_inset CommandInset citation
LatexCommand cite
key "zhang1997birch"

\end_inset

.
 QB is different from either of these methods; we will motivate it by describing
 some aspects of BIRCH as a starting point for the presentation of QB.
\end_layout

\begin_layout Standard
BIRCH has two key components: first is relatively simple and involves the
 use and updating of clustering features; second is the construction of
 a tree structure in which the accumulated clusters are held.
 This second component is aimed at maintaining efficient searchability of
 the database while balancing what is kept in memory and what is on disc
 for very large databases.
 BIRCH uses clustering features which are available for each item in the
 dataset; these are specific vectors of numerical values.
 Each cluster in turn has a clustering feature which is an aggregate of
 the clustering features of the items that belong to it (e.g.
 the sum or mean of the individual clustering feature vectors).
 Proceeding by a single sweep through the dataset, items are adjoined to
 clusters on the basis of their proximity to the clusters, subject to a
 maximum cluster size, or they are added as new leaves into the hierarchical
 tree structure in which the evolving clusters are held.
 There then follow updating steps which can involve the merging off previously
 created clusters in a k-means fashion.
\end_layout

\begin_layout Standard
It is the linear nature of BIRCH combined with the fixed dimensionality
 of its cluster features that makes it quite fast.
 However the further steps involving reorganisation of the accumulated tree
 do add some major overheads to BIRCH's performance.
 QB capitalises on these positive features but does not try to create any
 kind of hierarchical structure for the clusters.
 In QB each item is either added to an existing cluster on the basis of
 a distance between the cluster feature of the item and the cluster features
 of the current set of clusters.
 Clusters are held in a list which is extended according to need.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
More here on the comparison of BIRCH to QB? Can BIRCH be considered a superset
 of QB? What can BIRCH do that QB cannot? How about the other way round?
\end_layout

\end_inset


\end_layout

\begin_layout Standard
QB creates an online list of cluster nodes.
 The cluster node is defined as 
\begin_inset Formula $c=\{I,\mathbf{h},n\}$
\end_inset

 where 
\begin_inset Formula $I$
\end_inset

 is the list of the integer indices of the tracks in that cluster, 
\begin_inset Formula $\mathbf{h}$
\end_inset

 is an 
\begin_inset Formula $p\times3$
\end_inset

 matrix which the most important feature of the cluster and 
\begin_inset Formula $n$
\end_inset

 is the number of tracks on that cluster.
 
\begin_inset Formula $\mathbf{h}$
\end_inset

 is a matrix which can be updated online when a track is added on a cluster
 and is equal to
\begin_inset Formula 
\begin{equation}
\mathbf{h}=\sum_{i=1}^{n}s_{i}
\end{equation}

\end_inset

where 
\begin_inset Formula $s_{i}$
\end_inset

 is the 
\begin_inset Formula $p\times3$
\end_inset

 matrix representing track 
\begin_inset Formula $i$
\end_inset

, 
\begin_inset Formula $\Sigma$
\end_inset

 represents here matrix addition along the second axis and 
\begin_inset Formula $n$
\end_inset

 is the number of tracks in the cluster.
 QB assumes that all tracks have the same number of points 
\begin_inset Formula $p$
\end_inset

 therefore an equidistant downsampling of tracks is necessary before QB
 starts.
 A short summary of the algorithm goes as follows.
 
\end_layout

\begin_layout Standard
Select the first track 
\begin_inset Formula $s_{0}$
\end_inset

and place it in the first cluster 
\begin_inset Formula $c_{0}\leftarrow\{0,s_{0},1\}$
\end_inset

.
 Then for all remaining tracks i) goto next track 
\begin_inset Formula $s_{i}$
\end_inset

 ii) calculate MDF distance between this track and virtual tracks of all
 existing clusters 
\begin_inset Formula $c_{k}$
\end_inset

, where a virtual track is defined on the fly as 
\begin_inset Formula $\mathbf{v}=\mathbf{h}/n$
\end_inset

 iii) if the minimum MDF distance is smaller than a distance threshold 
\begin_inset Formula $thr$
\end_inset

 add the track to the cluster 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $c_{j}=\{I,\mathbf{h},N\}$
\end_inset


\family default
\series default
\shape default
\size default
\emph default
\bar default
\noun default
\color inherit
 with the minimum distance and update 
\begin_inset Formula $c_{j}\leftarrow\{I\cup\{i\},\mathbf{h}+s,N+1\}$
\end_inset

; otherwise create a new cluster 
\begin_inset Formula $c_{|C|+1}\leftarrow\{0,s_{i},1\}$
\end_inset

, 
\begin_inset Formula $|C|\leftarrow|C|+1$
\end_inset

 where 
\begin_inset Formula $|C|$
\end_inset

 denotes the current total number of clusters.
 
\end_layout

\begin_layout Standard
\begin_inset Float algorithm
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
textbf{Input} tracks $T=
\backslash
{s_{0},...,s_{i},...,s_{|T|-1}
\backslash
}$, threshold
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
textbf{Output} clustering $C=
\backslash
{c_{0},...,c_{k},...,c_{|C|-1}
\backslash
}$ where class $c=
\backslash
{I,
\backslash
mathbf{h},N
\backslash
}$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$c_{0}=
\backslash
left
\backslash
{0,s_{0},1
\backslash
right
\backslash
}$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$C=
\backslash
left
\backslash
{c_{0}
\backslash
right
\backslash
}$, $|C|=1$ 
\backslash
# the first track becomes the first class
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$
\backslash
textbf{For}$ $i$ $
\backslash
textbf{From}$ $1$ $
\backslash
textbf{To}$ $|T|-1$ $
\backslash
textbf{Do}$ 
\backslash
# all tracks
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
textbf{t}=T_{i}$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
texttt{alld}=
\backslash
textbf{0}$ 
\backslash
# distance buffer
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
texttt{flip}=
\backslash
textbf{0}$ 
\backslash
# flipping check buffer
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
textbf{For}$ $k$ $
\backslash
textbf{From}$ $0$ $
\backslash
textbf{To}$ $|C|-1$ 
\backslash
# all classes
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $
\backslash
mathbf{v}=C_{k}.
\backslash
mathbf{h}/C_{k}.N$
\backslash

\backslash
 
\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $d$=$d_{d}(
\backslash
mathbf{t},
\backslash
mathbf{v})$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $f$=$d_{f}(
\backslash
mathbf{t},
\backslash
mathbf{v})$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $
\backslash
textbf{If}$ $f < d$ $
\backslash
textbf{Then}$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{6em} $d = f$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{6em} $
\backslash
texttt{flip}_{k} = 1$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $
\backslash
texttt{alld}_{k} = d$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$m=
\backslash
min(
\backslash
texttt{alld})$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$l=
\backslash
mathrm{arg min}(
\backslash
texttt{alld})$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$
\backslash
textbf{If}$ $m <$ threshold 
\backslash
# append in current class 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
textbf{If}$ $
\backslash
texttt{flip}_{l}=1$ $
\backslash
textbf{Then}$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $C_{l}.
\backslash
mathbf{h}+=t'$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
textbf{Else}$ 
\backslash
# create new class 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{4em} $C_{l}.
\backslash
mathbf{h}+=t$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $C_{l}.N+=1$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $C_{l}.I$.append($i$)
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$
\backslash
textbf{Else}$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $|C|+=1$ 
\backslash
# number of classes increases
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $C_{|C|-1}.I_{0}=l$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $C_{|C|-1}.
\backslash
mathbf{h}=
\backslash
mathbf{t}$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $C_{|C|-1}.N=1$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
caption{QuickBundles}
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset CommandInset label
LatexCommand label
name "Alg:QuickBundles"

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard

\change_inserted 3 1319710123
\begin_inset Note Note
status collapsed

\begin_layout Plain Layout

\change_inserted 3 1319710123
\begin_inset Float algorithm
wide false
sideways false
status open

\begin_layout Plain Layout

\change_inserted 3 1319710123
\begin_inset listings
lstparams "basicstyle={\small\rmfamily},breaklines=true,mathescape=true,numbers=left,numberstyle={\footnotesize},showstringspaces=false,tabsize=4"
inline false
status open

\begin_layout Plain Layout

\change_inserted 3 1319710123

$
\backslash
textrm{
\backslash
#the first track becomes the first class}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

$c_{0}=
\backslash
left
\backslash
{0,s_{0},1
\backslash
right
\backslash
}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

$C=
\backslash
left
\backslash
{c_{0}
\backslash
right
\backslash
}$, $|C|=1$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

$
\backslash
textrm{
\backslash
#for all the following tracks}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

$
\backslash
textbf{for}$ $i$ $
\backslash
textbf{from}$ $1$ $
\backslash
textbf{to}$ $|T|-1$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
mathbf{t}=T_{i}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
textrm{
\backslash
#store distances with cluster features}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
mathrm{alld}=
\backslash
mathrm{zeros}(|C|)$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
textsc{store flipping if needed}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
mathrm{flip}= 
\backslash
mathrm{zeros}(|C|)$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
textrm{
\backslash
#for all the clusters}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
textbf{for}$ $k$ $
\backslash
textbf{from}$ $0$ $
\backslash
textbf{to}$ $|C|-1$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$
\backslash
mathbf{v}=C_{k}.
\backslash
mathbf{h}/C_{k}.N$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$d$=$d_{d}(
\backslash
mathbf{t},
\backslash
mathbf{v})$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$f$=$d_{f}(
\backslash
mathbf{t},
\backslash
mathbf{v})$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$
\backslash
textsc{if flip distance is smaller than direct distance}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$
\backslash
textbf{if}$ $f < d$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

			$d = f$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

			$
\backslash
mathrm{flip}_{k}=1$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$
\backslash
mathrm{alld}_{k}=d$	
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
textsc{find minimum distance and index}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$m=
\backslash
min(
\backslash
mathrm{alld})$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$l=
\backslash
mathrm{arg min}(
\backslash
mathrm{alld})$	
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
textsc{If smaller than any class threshold then}$ 
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
textsc{add track to that class}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
textbf{if}$ $m < thr$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$
\backslash
textbf{if}$ $
\backslash
mathrm{flip}_{l}=1$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

			$C_{l}.
\backslash
mathbf{h}+=t'$ $
\backslash
textsc{reverse direction}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$
\backslash
textbf{else}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

			$C_{l}.
\backslash
mathbf{h}+=t$ $
\backslash
textsc{same direction}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$C_{l}.N+=1$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$C_{l}.I$.append(i)
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
textsc{Otherwise create a new class}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	$
\backslash
textbf{else}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$|C|+=1$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$C_{|C|-1}.I_{0}=l$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$C_{|C|-1}.
\backslash
mathbf{h}=
\backslash
mathbf{t}$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

		$C_{|C|-1}.N=1$
\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123

	
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout

\change_inserted 3 1319710123
\begin_inset Caption

\begin_layout Plain Layout

\change_inserted 3 1319710123

\series bold
\begin_inset CommandInset label
LatexCommand label
name "alg:LSC-1"

\end_inset


\series default
QuickBundles 
\begin_inset Newline newline
\end_inset


\series bold
Input
\series default
: tractography 
\begin_inset Formula $T=\{s_{0},...,s_{i},...,s_{|T|-1}\}$
\end_inset

, distance threshold 
\begin_inset Formula $\texttt{thr}$
\end_inset


\begin_inset Newline newline
\end_inset


\series bold
Output
\series default
: clustering 
\begin_inset Formula $C=\{c_{0},...,c_{k},...,c_{|C|-1}\}$
\end_inset

, where cluster 
\begin_inset Formula $c=\{I,\mathbf{h},N\}$
\end_inset


\end_layout

\end_inset


\end_layout

\end_inset


\change_unchanged

\end_layout

\end_inset


\change_unchanged

\end_layout

\begin_layout Standard
Flipping can be of an issue when using the MDF distance and adding tracks
 together, because tracks do not have a preferred direction.
 A step in QB takes account of the possibility of needing to perform a flip
 of a track before adding it to a representative track.
 The complete QB algorithm is described in detail in 
\begin_inset CommandInset ref
LatexCommand ref
reference "Alg:QuickBundles"

\end_inset

 and a simple step by step visual example is given in 
\begin_inset CommandInset ref
LatexCommand ref
reference "Fig:LSC_simple"

\end_inset

.
 One the of the reasons why QB is a linear time algorithm is because on
 the structure of the cluster node; we only save the sum of current tracks
 in the cluster.
 By contrast if we were using k-means at every iteration we would have to
 recalculate the average and that is computationally much more intensive.
 Another nice property of QB is that it goes through the tracks only once
 and that a track belongs to only one cluster.
 Further still, we can extend the cluster node to contain more information
 for example we could redefine 
\begin_inset Formula $c=\{I,\mathbf{h},n,\mathbf{v}\}$
\end_inset

 where the virtual track is saved as well but we could also add more interesting
 information for example we could redefine 
\begin_inset Formula $c=\{I,\mathbf{h},n,\mathbf{h}^{(2)}\}$
\end_inset

 to obtain second order information and that way we could calculate the
 variance of the cluster where 
\begin_inset Formula $\mathbf{h}^{(2)}\leftarrow\{\sum x_{i}^{2},\sum y_{i}^{2},\sum z_{i}^{2},\sum x_{i}y_{i},\sum y_{i}z_{i},\sum x_{i}z_{i}\}$
\end_inset

.
 Although this alternative would be very useful, as even more refined cluster
 distances could be used which take into account the additional information,
 this is not addressed in this document.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/LSC_algorithm.png
	scale 27
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset CommandInset label
LatexCommand label
name "Fig:LSC_simple"

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
QB step by step visual example.
 Initially in panel (i) 6 unclustered tracks (A-F) are presented; imagine
 that the distance threshold used is the MDF distance (eq.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:direct_flip_distance"

\end_inset

) between B and E.
 The algorithm starts and in (ii) we see that track A was selected, so no
 other clusters exist so track A becomes the first cluster (labeled with
 purple color) and the virtual track of that cluster is identical with A
 as seen in (iii) , next in (iv) track B is selected and we calculate the
 MDF distance between B and the virtual track of the other clusters.
 For the moment there is only one cluster to compare so QB calculates MDF(B,virt
ual-purple) and this is obviously bigger than threshold (that being MDF(B,E)
 therefore a new cluster is assigned for B and B becomes the virtual track
 of that cluster as shown in (v).
 In (vi) the next track is selected and this is again far away from both
 purple and blue virtuals therefore another cluster is created and B is
 the virtual of the blue cluster as shown in (vii).
 In (viii) track D is current and after we have calculated MDF(D,purple),MDF(D,B
lue) and MDF(D,green) it is obvious that D belongs to the purple cluster
 as MDF(D,purple) is smaller and lower than threshold as shown in (ix).
 However we can now see in (x) that things change for the purple cluster
 because the virtual track is not anymore made by only one track but it
 is the average of D and A shown with dashline.
 In (xi) E is the current track and will be assigned at the green cluster
 as shown in (xii) because MDF(E,virtual green)=MDF(E,B)=thr and in (xiii)
 we see the updated virtual track for the green cluster which is equal to
 (B+E)/2 where + means track addition.
 In (xiv) the last track is picked and compared with the virtual tracks
 of the other 3 clusters obviously MDF(F,purple) is the only with smaller
 threshold and F is assigned to the purple cluster in (xv).
 Finally, in (xvi) the virtual purple track is refined as (D+A+F)/3.
 As there are no more tracks to select the algorithm stops.
 We can see all three clusters have been found and all tracks have been
 assigned successfully.
 
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
One of the disadvantages of most clustering algorithms is that they give
 different results with different initial conditions; for example this is
 recognised with k-means, expectation-maximization and k-centers where it
 is common practice to try a number of different random initial configurations.
 The same holds for QB so if there is not a single distance which can separate
 all clusters then with different permutations of the same tractography
 we will see similar number of clusters but different underlying clusters.
 
\end_layout

\begin_layout Subsubsection
QB's powerful simplifications
\end_layout

\begin_layout Standard
One of the major benefits of applying QB to tractographies is that it can
 provide meaningful simplifications and find structures that were previously
 unseen or difficult to found because of the high density of the tractography.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
Is this not true for any automated clusting technique? Emphasize need for
 speed in interactive applications?
\end_layout

\end_inset

For example we used QB to cluster the corticospinal tract (CST) selected
 by an expert.
 This tract was part of the datasets provided by the Pittsburgh Brain Competitio
n (PBC2009-ICDM).
 The result is clearly shown in fig.
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:cst_pbc"

\end_inset

 where every partition is represented by virtual tracks.
 To generate those clusters we used a tight threshold of 
\begin_inset Formula $10$
\end_inset

mm.
 We can notice that only a few tracks travel from bottom to top and that
 they are many tracks that are broken (i.e.
 shorter than what was initially expected) or highly divergent.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset CommandInset label
LatexCommand label
name "Flo:cst_pbc"

\end_inset


\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/cst_simplification.png
	lyxscale 10
	scale 30
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
This is a part of the CST bundle consisting of 
\begin_inset Formula $11041$
\end_inset

 tracks merged by an expert (PBC2009 data) shown with red color.
 Visually it looks like all tracks have similar shape and possibly merge
 towards the bottom and then spread when going upper.
 However, this is only an illusion
\begin_inset Note Note
status open

\begin_layout Plain Layout
Too strong? Maybe 'is a simplification of the underlying data'.
\end_layout

\end_inset

.
 QB can help us see the real structure of the bundle and identify its elements.
 Here on the right side we see a simplification (virtual tracks) of the
 red CST by running QB with distance threshold of 
\begin_inset Formula $10$
\end_inset

 mm and downsampling of 
\begin_inset Formula $12$
\end_inset

 points.
 We can easily perceive that lots of parts which looked homogeneous are
 actually broken bundles e.g.
 dark green (bottom), light blue (bottom) or bundles with very different
 shape e.g.
 light green virtual track up.
 To cluster this bundle took 
\begin_inset Formula $135$
\end_inset

 ms 
\begin_inset Formula $\simeq$
\end_inset

 
\begin_inset Formula $0.14$
\end_inset

 seconds.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
Another interesting QB asset is that it can be used to merge or divide together
 different structures with changing the distance threshold.
 This becomes clear in fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:simulated_orbits"

\end_inset

; on the left we see simulated orbits made from simple sinusoidal and helicoidal
 functions and colourcoded to distinguish the three different structures.
 With a lower threshold the three different structures keep remain separated
 but when we use a higher threshold the red and blue bundles are represented
 by only one cluster; represented by a purple virtual.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset CommandInset label
LatexCommand label
name "Flo:simulated_orbits"

\end_inset


\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/helix_phantom.png
	scale 80
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
On the left we see 3 bundles of simulated trajectories; red, blue and green
 consisting of 150 tracks each.
 All 450 tracks are clustered together using QB and the virtual tracks are
 shown when threshold 1 was used shown in the middle and 8 on the right.
 We can see that when the threshold is low enough the underlying structure
 is a good representation of the underlying geometry.
 However when the distance threshold is higher closer bundles could merge
 together as seen in the result on the right side where the red and blue
 bundle have merged together in one cluster represented by the purple virtual
 track.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
Similarly, with the simulations shown on fig.
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:simulated_orbits"

\end_inset

 we can see the same effect on real tracks as those of the fornix shown
 at the left panel of fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:QB_fornix"

\end_inset

 where we can obtain different number of clusterings at different thresholds.
 In that way we can stress thinner or larger sub-bundles inside other bigger
 bundles.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset CommandInset label
LatexCommand label
name "Flo:QB_fornix"

\end_inset


\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/LSC_simple.png
	lyxscale 30
	scale 60

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
Left - Here we see how QB clustered the fornix bundle with dataset from
 the PBC2009 competition.
 The initial fornix shown with black colour consists of 
\begin_inset Formula $1076$
\end_inset

 tracks.
 All tracks were equidistantly downsampled at 
\begin_inset Formula $3$
\end_inset

 points in this example.
 With a 
\begin_inset Formula $5$
\end_inset

mm threshold our method generates 
\begin_inset Formula $22$
\end_inset

 clusters (top right).
 With 
\begin_inset Formula $10$
\end_inset

mm generates 
\begin_inset Formula $7$
\end_inset

 and with 
\begin_inset Formula $20$
\end_inset

mm the whole fornix is determined by one cluster only (bottom right).
 Right - an example of a full tractography 
\begin_inset Formula $250,000$
\end_inset

 tracks being clustered using QB with a distance threshold of 
\begin_inset Formula $10$
\end_inset

mm.
 Here you see only 
\begin_inset Formula $763$
\end_inset

 virtual tracks depicted which produce a useful simplification of the initial
 tractography.
 Small bundles have been removed.
 Every track shown here represents an entire cluster from 
\begin_inset Formula $10$
\end_inset

 to 
\begin_inset Formula $5000$
\end_inset

 tracks each.
 These can be thought as fast access points to explore the entire dataset.
 The colour here just encodes track orientation.
 Therefore, you can click on a track and obtain the entire cluster/bundle
 back.
 Visualizing an entire dataset of that size is impossible on standard graphic
 cards and most visualization tools can only show you a random sample of
 the tractography at real time.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
A full tractography containing 
\begin_inset Formula $250,000$
\end_inset

 tracks was clustered using QB with a distance threshold of 
\begin_inset Formula $10$
\end_inset

mm (fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:QB_fornix"

\end_inset

).
 We produced a useful reduction of the initial tractography leaving only
 
\begin_inset Formula $763$
\end_inset

 virtual tracks.
 Short bundles smaller than 
\begin_inset Formula $10$
\end_inset

 are also removed.
 Every track shown here represents an entire cluster containing from 
\begin_inset Formula $10$
\end_inset

 to 
\begin_inset Formula $5000$
\end_inset

 tracks each.
 
\end_layout

\begin_layout Standard
The virtual tracks can be thought as fast access points to explore the entire
 dataset.
 Therefore, you can click on a virtual and obtain the entire cluster/bundle
 that it represents.
 Visualizing an entire dataset of that size is impossible on standard graphic
 cards and most medical visualization tools e.g.
 Trackvis or DSI studio will only show a random sample of the tractography
 at real time because of the video memory size limitations.
 
\end_layout

\begin_layout Subsubsection
Complexity and timings
\end_layout

\begin_layout Standard
To apply QB to a tractography we need to specify three key parameters: 
\begin_inset Formula $p$
\end_inset

, the fixed number of downsampled points per track; 
\begin_inset Formula $d$
\end_inset

 the distance threshold, which controls the heterogeneity of clusters; and
 
\begin_inset Formula $N$
\end_inset

 the size of the subset of the tractography on which the clustering will
 be performed.
 When 
\begin_inset Formula $d$
\end_inset

 is higher, fewer more heterogeneous clusters are assembled, and conversely
 when 
\begin_inset Formula $d$
\end_inset

 is low, more clusters of greater homogeneity are created.
\end_layout

\begin_layout Standard
\begin_inset Float table
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Tabular
<lyxtabular version="3" rows="4" columns="5">
<features tabularvalignment="middle">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<column alignment="center" valignment="top" width="0">
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Number of tracks
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Algorithms
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Timings (secs)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
QB (secs)
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Speedup
\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $1000$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Wang et al.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $30$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $0.07$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $429$
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $60,000$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Wang et al.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $14400$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $14.7$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $980$
\end_inset


\end_layout

\end_inset
</cell>
</row>
<row>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $400,000$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
Visser et al.
\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $75000$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $160.1$
\end_inset


\end_layout

\end_inset
</cell>
<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none">
\begin_inset Text

\begin_layout Plain Layout
\begin_inset Formula $468$
\end_inset


\end_layout

\end_inset
</cell>
</row>
</lyxtabular>

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
QB run on 
\begin_inset Formula $12$
\end_inset

 point tracks and distance threshold at 
\begin_inset Formula $10$
\end_inset

mm compared with some timings reported from other state of the art methods
 found in the literature.
 Unfortunately timings were very rarely reported until today as most algorithms
 were very slow on full datasets.
 Nonetheless, the speedup that QB offers is obviously of great importance;
 allowing even for real-time clustering on subsets containing less than
 
\begin_inset Formula $20,000$
\end_inset

 tracks.
 QB was run on a simple PC using only one core.
 
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
The complexity of QB is on average linear time 
\begin_inset Formula $O(N)$
\end_inset

 with the number of tracks 
\begin_inset Formula $N$
\end_inset

 and worst case 
\begin_inset Formula $O(N^{2})$
\end_inset

 when every cluster constains
\begin_inset Note Note
status open

\begin_layout Plain Layout
sp
\end_layout

\end_inset

 only one track.
 We compared QB with QB run on 
\begin_inset Formula $12$
\end_inset

 point tracks and distance threshold at 
\begin_inset Formula $10$
\end_inset

mm compared with some timings reported from other state of the art methods
 found in the literature.
 Unfortunately timings were very rarely reported until today as most algorithms
 were very slow on full data sets.
 However the speedup that QB offers is obviously of great importance; allowing
 even for real-time clustering on subsets less than 
\begin_inset Formula $20,000$
\end_inset

 tracks.
 QB was run on a single thread of Intel Xeon(R) CPU E5420 @ 2.50GHz to generate
 these timings.
 
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset CommandInset label
LatexCommand label
name "Flo:Speed1"

\end_inset


\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/speed_3_6.png
	lyxscale 40
	scale 80
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
QB is a super-efficient 
\begin_inset Note Note
status open

\begin_layout Plain Layout
'very efficient' maybe?
\end_layout

\end_inset

algorithm 
\begin_inset Note Note
status open

\begin_layout Plain Layout
in that
\end_layout

\end_inset

that its performance is influenced from a few parameters.
 The first is the number of tracks, a second is the distance threshold in
 millimeters - shown with different colours and another is the amount of
 initial downsampling of the initial trajectories.
 A last parameter not shown in these diagrams is the underlying structure
 of the data which is expressed by the number of final clusters.
 We used a full tractography to generate these figures without removing
 or preselecting any parts.
 This results run of a single thread Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
 of a simple PC.
 
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
Furthermore, QB is 
\begin_inset Formula $O(M)$
\end_inset

 i.e.
 on memory usage where 
\begin_inset Formula $M$
\end_inset

 is the number of clusters and because this is usually much smaller than
 
\begin_inset Formula $N$
\end_inset

 we consider memory consumption to be negligible.
 Because in QB we store on the indices of the tracks even for very large
 tractographies 
\begin_inset Formula $20$
\end_inset

 or more clusterings can be stored simultaneously the RAM of simple notebook
 without any problems.
 Therefore, another asset of QB is memory efficiency.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\noindent
\align center
\begin_inset CommandInset label
LatexCommand label
name "Flo:Speed2"

\end_inset


\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/speed_12_18.png
	lyxscale 50
	scale 80

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
Time comparisons of QB using different number of points per track, different
 distance thresholds and different number of tracks.
 Same as Fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:Speed1"

\end_inset

.
 Notice how the linearity only reduces slightly when we use a very low threshold
 of 
\begin_inset Formula $10$
\end_inset

mm.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Subsubsection
Virtual tracks, exemplar tracks and other features.
\end_layout

\begin_layout Standard
The virtual tracks created by QB have very nice properties as they represent
 an average track which can stand as the most important feature of the cluster
 that they belong to.
 However, now that we have segmented our tractography into small bundles
 we can calculate many more important features for the cluster.
 The first thing 
\begin_inset Note Note
status open

\begin_layout Plain Layout
to keep
\end_layout

\end_inset

in mind is the cluster variance which now can be computed for every cluster
 
\begin_inset Formula $c_{i}$
\end_inset

 as 
\begin_inset Formula $Var(c_{i})_{j}=Var(x_{j}-v{}_{j}))$
\end_inset

 where 
\begin_inset Formula $x_{j}$
\end_inset

 is the 
\begin_inset Formula $j$
\end_inset

-th point in a track in that cluster and 
\begin_inset Formula $v_{j}$
\end_inset

 is the corresponding point of the virtual track.
 Many other similar or higher order statistics can be easily computed in
 an analogous fashion.
 One of the most useful feature is the calculation of exemplars!
\end_layout

\begin_layout Standard

\series bold
Exemplars
\series default
.
 Another fruitful idea relating to the virtual track is to try to identify
 a corresponding feature for the bundle which actually belongs to the tractograp
hy.
 In other words to find an exemplar or centroid track.
 Remember that the virtual tracks do not exist as real tracks as they are
 but are just the outcome of massive amalgamations.
 There are many strategies for how to select good exemplars for the bundles.
 A very fast procedure that we use in this work is to find which real track
 from the cluster is closest (by MDF distance) to the virtual track.
 Lets call this exemplar track 
\begin_inset Formula $e_{0}$
\end_inset

 such that 
\begin_inset Formula $e_{0}=arg$
\end_inset

minMDF
\begin_inset Formula $_{x}$
\end_inset

(
\begin_inset Formula $v,x$
\end_inset

).
 Finding 
\begin_inset Formula $e_{0}$
\end_inset

 is still linear and that will be very useful if we have created clusters
 with more than 
\begin_inset Formula $\sim5000$
\end_inset

 tracks depending on your system memory.
 
\end_layout

\begin_layout Standard
Another exemplar could be defined as the most similar track among all tracks
 in the bundle which we denote with 
\begin_inset Formula $e_{2}=\mathrm{arg}\min_{x\in C}\sum_{y\in C}||y-x||_{2}$
\end_inset

 or if we want to work with tracks of different size we could use the 
\begin_inset Formula $e_{3}=arg$
\end_inset

minMAM
\begin_inset Formula $_{x,y\in C}$
\end_inset

(
\begin_inset Formula $y,x$
\end_inset

).
 if we want to work with
\begin_inset Note Note
status open

\begin_layout Plain Layout
sentence fragment?
\end_layout

\end_inset

.
 Identification of exemplar tracks of type 
\begin_inset Formula $e_{2}$
\end_inset

 and 
\begin_inset Formula $e_{3}$
\end_inset

 will be efficient only for small bundles of less than 
\begin_inset Formula $5000$
\end_inset

 tracks because we need to calculate all pairwise distances in the bundle
 and then choose that which has most neighbours.
 We will see in the section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:The-bi-directionality-problem"

\end_inset

 how the exemplars can be used to simplify the bi-directionality problem
 when merging clusters.
\end_layout

\begin_layout Subsubsection
The bi-directionality problem
\begin_inset CommandInset label
LatexCommand label
name "sub:The-bi-directionality-problem"

\end_inset


\end_layout

\begin_layout Standard
Because a track is a sequence of points without a preferred direction, it
 has two possible orientations when comparing it with another track.
 Most tractography generation methods will create tracks with arbitrary
 directions; meaning that close and similar tracks can have opposite directions.
 Of course the tracks do not really carry directional information.
 By direction we mean the encoding of the sequence of points which define
 the track.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
Thus a track may be ordered p_a, p_b ..
 p_y, p_z, or p_z, p_y ..
 p_b, p_a.
 
\end_layout

\end_inset

We call this the bi-directionality problem.
 Using the MDF distance we have already found a way in QB to reduce this
 problem locally in the clusters.
 However, if we want to merge clusters together we need to have a way to
 further reduce this difficult problem.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
I don't understand why this problem is not solved by just using the MDF
 distance between bundle virtuals or examplars.
 
\end_layout

\end_inset


\end_layout

\begin_layout Standard
For this purpose we devised the following technique.
 Chose a fixed point or pole 
\begin_inset Formula $P$
\end_inset

 in the 3d space of the tractography, possibly away from the mid-sagittal
 plane.
 Then re-direct all tracks so that the first point of every track is the
 end closer to 
\begin_inset Formula $P$
\end_inset

.
 If the tractography is in native space it suffices to have the origin 
\begin_inset Formula $(0,0,0)$
\end_inset

 as the pole point; in MNI space we can use the point 
\begin_inset Formula $(100,100,100)$
\end_inset

.
 It is our empirical experience that this method will redirect correctly
 most tracks in the sense that similar tracks will have the same direction.
 However there will still be a small percentage for which the bi-directionality
 problem persists.
 We can correct for these by using exemplars rather than virtual tracks
 as virtual tracks can misrepresent a bundle if the bundle consists of tracks
 with ambiguous directionality.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
Why the difference between virtuals and examplars in the bi-directionality
 problem?
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Comparisons within and between subjects
\end_layout

\begin_layout Subsubsection
Measures to compare classifications
\end_layout

\begin_layout Standard
Considerable attention has been paid to measuring the performance of one
 or more classifiers in the context of supervised learning.
 We now outline some of these metrics before applying them to the comparisons
 we are interested in.
 Let 
\begin_inset Formula $\mathcal{A}=\{A_{1},A_{2},\ldots,A_{m}\}$
\end_inset

 and 
\begin_inset Formula $\mathcal{B}=\{B_{1},B_{2},\ldots,B_{n}\}$
\end_inset

 be two classifications of 
\begin_inset Formula $N$
\end_inset

 items.
 Let the number of items in 
\begin_inset Formula $A_{i}$
\end_inset

 and 
\begin_inset Formula $B_{j}$
\end_inset

 be 
\begin_inset Formula $a_{i}$
\end_inset

 and 
\begin_inset Formula $b_{j}$
\end_inset

, with 
\begin_inset Formula $t_{ij}$
\end_inset

 items in the intersection 
\begin_inset Formula $A_{i}\cap B_{j}$
\end_inset

.
 There are a number of ways of measuring the similarity or dissimilarity
 of 
\begin_inset Formula $\mathcal{A}$
\end_inset

 and 
\begin_inset Formula $\mathcal{B}$
\end_inset

.
 The first two, Gini Impurity and Random Classification Errors, are based
 on ways we might estimate the 
\begin_inset Formula $\mathcal{A}$
\end_inset

-labels if we just have the 
\begin_inset Formula $\mathcal{B}$
\end_inset

-labelling.
\end_layout

\begin_layout Paragraph*
Gini impurity
\end_layout

\begin_layout Standard
Suppose we have a probability distribution 
\begin_inset Formula $P=(p_{1},p_{2},\ldots,p_{k})$
\end_inset

 on 
\begin_inset Formula $k$
\end_inset

 items, and we randomly assign labels 
\begin_inset Formula $(1,2,\ldots,k)$
\end_inset

 to items sampled from 
\begin_inset Formula $P$
\end_inset

 using the same distribution 
\begin_inset Formula $P$
\end_inset

.
 Then the probability of assigning the wrong label is 
\begin_inset Formula $1-{\displaystyle \sum_{i=1}^{k}}p_{i}^{2}$
\end_inset

.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
I got confused here.
 I guess this is because the probability of mislabeling is sum_i(p_i * (1-p_i))
 but then I didn't understand the role of the 'items sampled from P' above.
\end_layout

\end_inset

This is the Gini Impurity of 
\begin_inset Formula $P$
\end_inset

 -
\emph on
 
\emph default

\begin_inset Formula $\mathtt{\gini}(P)$
\end_inset

 - with 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $0\le\mathrm{\mathtt{\gini}}(P)\le1-\frac{1}{k}$
\end_inset

.
 The lower limit occurs when 
\begin_inset Formula $P$
\end_inset

 assigns probability 
\begin_inset Formula $1$
\end_inset

 to just one label (i.e.
 a very pure, concentrated distribution); the upper limit occurs when all
 labels have equal probability 
\begin_inset Formula $\frac{1}{k}$
\end_inset

.
\end_layout

\begin_layout Standard
If 
\begin_inset Formula $P_{\mathcal{A}|B_{j}}$
\end_inset


\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none
 is the observed conditional probability distribution 
\begin_inset Formula $(p_{i|j}=\frac{t_{ij}}{b_{j}},\, i=1,\ldots,m)$
\end_inset

 of 
\begin_inset Formula $\mathcal{A}$
\end_inset

 given 
\begin_inset Formula $B_{j},$
\end_inset

 then we define the Gini Impurity of 
\begin_inset Formula $\mathcal{A}$
\end_inset

 with respect to 
\begin_inset Formula $\mathcal{B}$
\end_inset

 as 
\begin_inset Formula $\gini(\mathcal{A}|\mathcal{B})={\displaystyle \sum_{j=1}^{n}\frac{b_{j}}{N}\thinspace\mathrm{\mathtt{\gini}}(P_{\mathcal{A}|B_{j}})}$
\end_inset

.
 In terms of the matrix 
\begin_inset Formula $T=(t_{ij})$
\end_inset

 this is the 
\begin_inset Formula $\mathcal{B}$
\end_inset

-weighted average of the impurities of the rows of 
\begin_inset Formula $T.$
\end_inset

 We similarly define 
\begin_inset Formula $\gini(\mathcal{B}|\mathcal{A})$
\end_inset

 and it is equal to the 
\begin_inset Formula $\mathcal{B}$
\end_inset

-weighted average of the impurities of the columns of 
\begin_inset Formula $T.$
\end_inset

 
\begin_inset Note Note
status open

\begin_layout Plain Layout

\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none
Am I confused or isn't this the A-weighted average of the impurities of
 the columns of T? 
\end_layout

\end_inset

It is convenient to arrange our comparison metrics so that they are similarity
 measures, with high values indicating high dissimilarity.
 So we define and report the complementary 
\begin_inset Formula $\mathtt{{gini\_purity}}$
\end_inset

as 
\begin_inset Formula $1-\gini$
\end_inset

.
 We also restrict ourselves to reporting the symmetrised value 
\begin_inset Formula $(\gini(\mathcal{A}|\mathcal{B})+\gini(\mathcal{B}|\mathcal{A}))/2$
\end_inset

.
\end_layout

\begin_layout Paragraph
Random classification errors
\end_layout

\begin_layout Standard
In this case given 
\begin_inset Formula $P$
\end_inset

 we assign to each item the label with maximum probability 
\begin_inset Formula $i_{\mathrm{max}}=\argmax_{i}p_{i}$
\end_inset

.
 The Random Classification Error in this case is 
\begin_inset Formula $1-p_{\mathrm{max}}=1-\max_{i}p_{i}$
\end_inset

.
 When we do this conditional on the 
\begin_inset Formula $\mathcal{B}$
\end_inset

-label and average over those labels, we get the Random Classification Error
 of 
\begin_inset Formula $\mathcal{A}$
\end_inset

 conditional on 
\begin_inset Formula $\mathcal{B}$
\end_inset

, 
\begin_inset Formula $\mathrm{\rce}(\mathcal{A}|\mathcal{B})={\displaystyle \sum_{j=1}^{n}\frac{b_{j}}{N}\thinspace(1-p_{B_{j}}^{*})}$
\end_inset

.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
What is p*? - sorry to be ignorant.
\end_layout

\end_inset

We define 
\begin_inset Formula $\rce(\mathcal{B}|\mathcal{A})$
\end_inset

 similarly, and the complementary 
\begin_inset Formula $\mathtt{{random\_classification\_accuracy}}.$
\end_inset

 A further simplification is to use the symmetrised value 
\begin_inset Formula $(\rce(\mathcal{A}|\mathcal{B})+\rce(\mathcal{B}|\mathcal{A}))/2$
\end_inset


\end_layout

\begin_layout Paragraph*
Correctness and completeness (splitting and lumping pairs of items)
\end_layout

\begin_layout Standard
For the next two metrics the focus moves to the labels assigned by 
\begin_inset Formula $\mathcal{A}$
\end_inset

 and 
\begin_inset Formula $\mathcal{B}$
\end_inset

 to pairs of items.
 Differences in the partitions
\begin_inset Formula $\mathcal{A}$
\end_inset

 and 
\begin_inset Formula $\mathcal{B}$
\end_inset

 are reflected in two ways.
 Items assigned the same label by 
\begin_inset Formula $\mathcal{A}$
\end_inset

 are said to be split by 
\begin_inset Formula $\mathcal{B}$
\end_inset

 if their 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $\mathcal{B}$
\end_inset

-labels are not equal; alternatively items assigned different 
\begin_inset Formula $\mathcal{A}$
\end_inset

-labels are said to be lumped 
\family default
\series default
\shape default
\size default
\emph default
\bar default
\noun default
\color inherit
by 
\begin_inset Formula $\mathcal{B}$
\end_inset

 if they are assigned the same 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $\mathcal{B}$
\end_inset

-label.
 Note that what is lumped (split) by 
\begin_inset Formula $\mathcal{B}$
\end_inset

 will equally be split (lumped) by 
\family default
\series default
\shape default
\size default
\emph default
\bar default
\noun default
\color inherit

\begin_inset Formula $\mathcal{A}$
\end_inset

.
\end_layout

\begin_layout Standard
The total number of pairs from 
\begin_inset Formula $N$
\end_inset

 items is 
\begin_inset Formula $\mathtt{pairs(\mathcal{A})=}\binom{N}{2}=\frac{N(N-1)}{2}$
\end_inset

.
 The number of pairs assigned the same 
\begin_inset Formula $\mathcal{A}$
\end_inset

-labels is 
\begin_inset Formula ${\displaystyle \mathtt{together}(\mathcal{A})=\sum_{i=1}^{m}\binom{a_{i}}{2}}$
\end_inset

.
 The number of pairs assigned different labels is 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $\mathtt{apart}(\mathcal{A})=\mathtt{pairs}(\mathcal{A})-\mathtt{together}(\mathcal{A})$
\end_inset

.
 This can also be written as 
\begin_inset Formula ${\displaystyle \sum_{1\le i\ne i'\le m}a_{i}a_{i'}}$
\end_inset

which in turn can be expressed in terms of the cumulative sum of 
\begin_inset Formula $(a_{i})$
\end_inset

 which is an efficient way of programming these calculation of sums of all
 products with unequal subscripts.
 The number of 
\begin_inset Formula $\mathcal{A}$
\end_inset

-pairs split by 
\begin_inset Formula $\mathcal{B}$
\end_inset

 is 
\begin_inset Formula 
\[
\mathtt{split}(\mathcal{A}|\mathcal{B}){\displaystyle =\sum_{i=1}^{m}}\bigl({\displaystyle \sum_{1\le j\ne j'\le n}n_{ij}n_{ij'}}\bigr)=\mathtt{lumped}(\mathcal{B}|\mathcal{A}).
\]

\end_inset

 Similarly 
\begin_inset Formula 
\[
\mathtt{lumped}(\mathcal{A}|\mathcal{B}){\displaystyle =\sum_{j=1}^{n}}\bigl({\displaystyle \sum_{1\le i\ne i'\le m}n_{ij}n_{i'j}}\bigr)=\mathtt{split}(\mathcal{B}|\mathcal{A}).
\]

\end_inset


\end_layout

\begin_layout Standard

\emph on
Completeness
\emph default
 and 
\emph on
correctness
\emph default
 are define in terms of these quantities: 
\begin_inset Formula 
\[
\mathtt{completeness}(\mathcal{A}|\mathcal{B})=1-\mathtt{split}(\mathcal{A}|\mathcal{B})/\mathtt{together}(\mathcal{A})
\]

\end_inset

 and 
\begin_inset Formula 
\[
\mathtt{correctness}(\mathcal{A}|\mathcal{B})=1-\mathtt{lumped}(\mathcal{A}|\mathcal{B})/\mathtt{apart}(\mathcal{A}).
\]

\end_inset

 A combined measure of 
\emph on
discordance
\emph default
 between 
\begin_inset Formula $\mathcal{A}$
\end_inset

 and 
\begin_inset Formula $\mathcal{B}$
\end_inset

 is defined as 
\begin_inset Formula 
\[
\mathtt{\mathtt{discord}(\mathcal{A},\mathcal{B})=(lumped}(\mathcal{A}|\mathcal{B})+\mathtt{split}(\mathcal{A}|\mathcal{B})/\mathtt{pairs}(\mathcal{A}).
\]

\end_inset


\end_layout

\begin_layout Standard
For the clusterings we encounter in tractography typically the number of
 apart pairs in 
\begin_inset Formula $\mathcal{A}$
\end_inset

 is very high, and only a small percentage (e.g.
 0.5%) of these pairs will be lumped by 
\begin_inset Formula $\mathcal{B}$
\end_inset

.
 This is because [claim on the basis of little analysis] the average cluster
 size is small by comparison with the number of clusters.
 [Needs numerical evaluation.] As a consequence, the 
\bar under
correctness
\bar default
 measure is not a particularly useful metric.
 By contrast the number of together pairs is modest, and the 
\bar under
completeness
\bar default
 measure is more sensitive.
 Because apart pairs seem to dominate together pairs, the 
\bar under
discord
\bar default
 measure has the same limited sensitivity as correctness.
\end_layout

\begin_layout Paragraph
Maximum Agreement (
\begin_inset Formula $\kappa_{\max}$
\end_inset

)
\end_layout

\begin_layout Standard
Our fifth metric is Cohen's 
\begin_inset Formula $\kappa$
\end_inset

, which is a well-known measure of agreement between raters on the assignment
 of a set of items to a shared classification scheme.
 It adjusts the agreements (items on which the raters agree) for the number
 of agreements that might have ocurred by chance:
\end_layout

\begin_layout LyX-Code
\begin_inset Formula 
\[
\kappa=\mathtt{\frac{observed\: proportion\: of\: agreements-\mathtt{expected\: proportion\: of\: chance\: agreements}}{\mathtt{expected\: proportion}\: of\: non-chance\: agreements}}.
\]

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
Sorry - I realize I am not thinking this through, but I wasn't clear what
 an 'expected' proportion of 'non-chance' agreements would be.
 From the formula I am tempted to describe the denominator as 'expected
 proportion of chance disagreements'.
\end_layout

\end_inset

This can be simply represented in terms of the overlap matrix 
\begin_inset Formula $T=(t_{ij})$
\end_inset

 by the formula:
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
\kappa(T)=\frac{{\displaystyle \sum_{i=1}^{M}}t_{ii}/N-{\displaystyle \sum_{i=1}^{M}}r_{i}c_{i}/N^{2}}{1-{\displaystyle \sum_{i=1}^{M}}r_{i}c_{i}/N^{2}},
\]

\end_inset

where 
\begin_inset Formula $r_{i}$
\end_inset

 and 
\begin_inset Formula $c_{j}$
\end_inset

 represent the row and column totals of 
\begin_inset Formula $T$
\end_inset

.
 We have extended 
\begin_inset Formula $T$
\end_inset

 to a square matrix of size 
\begin_inset Formula $M=\max(m,n)$
\end_inset

 by adding if necessary rows or columns of zeros.
 When we adapt this measure to the case of comparing two clusterings we
 need further to take account of the lack of prior correspondence between
 the two sets of labels.
 The 
\begin_inset Formula $\kappa_{\max}$
\end_inset

 statistic is the result of maximising 
\begin_inset Formula $\kappa$
\end_inset

 over all possible correspondences:
\end_layout

\begin_layout Standard
\begin_inset Formula 
\[
\kappa_{\max}=\mathrm{max}_{\pi}\kappa(T_{\pi})=\frac{{\displaystyle \sum_{i=1}^{M}}t_{i\pi(i)}/N-{\displaystyle \sum_{i=1}^{M}}r_{i}c_{\pi(i)}/N^{2}}{1-{\displaystyle \sum_{i=1}^{M}}r_{i}c_{\pi(i)}/N^{2}},
\]

\end_inset

where 
\begin_inset Formula $T_{\pi}$
\end_inset

 is the matrix 
\begin_inset Formula $T$
\end_inset

 with columns reordered by a permutation 
\begin_inset Formula $\pi.$
\end_inset

 The principal trouble with the 
\begin_inset Formula $\kappa_{\max}$
\end_inset

 statistic is that its computation is 
\begin_inset Formula $O(N!)$
\end_inset

 if all permutations are tried.
 One way out of the problem caused by the size of the search set might be
 to use a randomised search strategy for instance based on a simulated annealing
 approach.
 An alternative route is to redefine the objective.
 One obvious choice in this setting is the 
\begin_inset Formula $\mathtt{number\; of\; agreeements}={\displaystyle \sum_{i=1}^{M}}t_{i\pi(i)}$
\end_inset

 corresponding to the permutation 
\begin_inset Formula $\pi$
\end_inset

, which is the leading term in the numerator of 
\begin_inset Formula $\kappa_{\max}$
\end_inset

.
 Seeking to maximise the 
\begin_inset Formula $\mathtt{number\; of\; agreeements}$
\end_inset

 amongst all permutations 
\begin_inset Formula $\pi$
\end_inset

 is a combinatorial optimization problem (weighted assignment problem on
 a bipartite graph) that can be reformulated as a linear programming problem
 whose efficient solution by the `Hungarian Method' is well known.
\end_layout

\begin_layout Paragraph
`Hungarian' matching
\end_layout

\begin_layout Standard
We observe that it is the term 
\begin_inset Formula ${\displaystyle \mu_{\max}=\sum_{i=1}^{M}}t_{i\pi(i)}$
\end_inset

 that has key role in the behaviour of 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none

\begin_inset Formula $\kappa_{\max}$
\end_inset

.
 
\begin_inset Formula $\kappa_{\max}$
\end_inset

Can we find a mapping 
\begin_inset Formula $\pi$
\end_inset


\family default
\series default
\shape default
\size default
\emph default
\bar default
\noun default
\color inherit
 that maximises 
\begin_inset Formula $\mu_{max}$
\end_inset

? This turns out to be a classical assignment problem in bipartite graph
 theory for which the so-called Hungarian Method 
\begin_inset CommandInset citation
LatexCommand cite
key "Kuhn1955"

\end_inset

 is an efficient solution.
 We have tried various published coded implementations of the version by
 Lawler 
\begin_inset CommandInset citation
LatexCommand cite
key "Lawler2001"

\end_inset

 of the Hungarian Method and have found that the one published by Carpaneto
 et al.
 
\begin_inset CommandInset citation
LatexCommand cite
key "Carpaneto1988"

\end_inset

 and implemented by them 
\begin_inset CommandInset citation
LatexCommand cite
key "CarpanetoAPC"

\end_inset

 in 
\begin_inset Formula $\textsc{Fortran}$
\end_inset

 is both fast and capable of handling assignment problems of unlimited size.
\begin_inset Note Note
status open

\begin_layout Plain Layout
Are the gini, rce, splitting lumping and kappa measures used in this chapter?
\end_layout

\end_inset


\end_layout

\begin_layout Subsubsection
Tightness comparisons
\begin_inset CommandInset label
LatexCommand label
name "sub:Tightness-comparisons-1"

\end_inset


\end_layout

\begin_layout Standard
We have found rather few systematic ways to compare different clustering
 results for tractographies in the literature
\series bold
 
\series default

\begin_inset CommandInset citation
LatexCommand cite
key "moberts2005evaluation"

\end_inset

.
 We think that being able to compare results of clusterings is crucial for
 creating stable brain imaging procedures.
 We would like to have a way to compare different clusterings of the same
 subject or different subjects.
 Although we recognise that this is a difficult problem, we propose the
 following solution with a metric which we call tight comparison (TC).
 Tight comparison works as follows.
 Let's assume that we have gathered the exemplar tracks from clustering
 
\begin_inset Formula $A$
\end_inset

 in 
\begin_inset Formula $E_{A}=\{e_{0},...,e_{|E_{A}|}\}$
\end_inset

 and from clustering 
\begin_inset Formula $B$
\end_inset

 in 
\begin_inset Formula $E_{B}=\{e_{0}^{'},...,e_{|V_{B}|}^{'}\}$
\end_inset

 
\begin_inset Note Note
status open

\begin_layout Plain Layout
s/V_B/E_B/
\end_layout

\end_inset

where 
\begin_inset Formula $|E|$
\end_inset

 denotes the number of exemplar tracks of each clustering 
\begin_inset Formula $E$
\end_inset

.
 The size of set 
\begin_inset Formula $E_{A}$
\end_inset

 does not need to be the same as that of 
\begin_inset Formula $E_{B}$
\end_inset

 (i.e.
 
\begin_inset Formula $|E_{A}|\neq|E_{B}|$
\end_inset

 or 
\begin_inset Formula $|E_{A}|=|E_{B}|$
\end_inset

).
 Next, we calculate all pairwise MDF distances between the two sets and
 store them in rectangular matrix 
\begin_inset Formula $D_{AB}$
\end_inset

.
 From the the columns of 
\begin_inset Formula $D_{AB}$
\end_inset

 we can find correspondence pairs from 
\begin_inset Formula $A$
\end_inset

 to 
\begin_inset Formula $B$
\end_inset

 (
\begin_inset Formula $E_{A\rightarrow B}$
\end_inset

) 
\begin_inset Note Note
status open

\begin_layout Plain Layout
D_{ab} has first index indexing clusters of A? Rows therefore are distances
 from given cluster in A to all clusters in B.
 
\end_layout

\end_inset

and by checking the rows of 
\begin_inset Formula $D_{AB}$
\end_inset

 we can find the corresponding pairs from 
\begin_inset Formula $A$
\end_inset

 to 
\begin_inset Formula $B$
\end_inset

 (
\begin_inset Formula $E_{B\rightarrow A}$
\end_inset

).
 From these pairs we only keep those which have distances smaller than a
 small tight threshold 
\begin_inset Formula $t_{thr}$
\end_inset

.
 Then we define TC to be
\end_layout

\begin_layout Standard
\begin_inset Formula 
\begin{equation}
TC=\frac{1}{2}\left(\frac{|E_{A\rightarrow B}\leq t_{thr}|}{|E_{A}|}+\frac{|E_{B\rightarrow A}\leq t_{thr}|}{|E_{B}|}\right)
\end{equation}

\end_inset


\end_layout

\begin_layout Standard
where 
\begin_inset Formula $|E_{A\rightarrow B}\leq t_{thr}|$
\end_inset

 denotes the number of exemplars from A which had a neighbour in B that
 is closer than 
\begin_inset Formula $t_{thr}$
\end_inset

 and similarly for 
\begin_inset Formula $|E_{B\rightarrow A}\leq t_{thr}|$
\end_inset

 the number of exemplars from B to A which their distance was smaller than
 
\begin_inset Formula $t_{thr}$
\end_inset

.
 When 
\begin_inset Formula $TC=0$
\end_inset

 that means that no exemplar from the one set was closer than 
\begin_inset Formula $t_{thr}$
\end_inset

 to any exemplar in the other set.
 When 
\begin_inset Formula $TC=1$
\end_inset

 then all exemplars from one set had a close neighbour to the other set.
 This metric is extremely useful especially when comparing tractographies
 from different subjects.
\end_layout

\begin_layout Standard
We run an experiment were we compared TC between pairs of 10 subjects with
 their tractographies warped in MNI space (SEE SECTION).
 This generated 
\begin_inset Formula $\binom{10}{2}=45$
\end_inset

 TC values.
 We did this experiment twice once with keeping only the bundles with more
 than 10 tracks and once with keeping the bundles with more than 
\begin_inset Formula $100$
\end_inset

 tracks.
 The average value for TC10 was 
\begin_inset Formula $0.47$
\end_inset

 and standard deviation 
\begin_inset Formula $0.026$
\end_inset

.
 As expected TC100 (bigger landmarks) did better with average value of 
\begin_inset Formula $0.53$
\end_inset

 and standard divination 
\begin_inset Formula $0.049$
\end_inset

.
 We then we calculated the t-statistic between TC10 and TC100 which was
\begin_inset Formula $-8.036$
\end_inset

 and p-value 
\begin_inset Formula $3.58e-10$
\end_inset

.
 This was a strong evidence that QB can be used in order to find agreements
 between different brains.
\begin_inset Note Note
status open

\begin_layout Plain Layout
Not sure how the t value difference between TC10 and TC100 shows the ability
 of QB to find agreements between brains - can you say more? 
\end_layout

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
!!!!CREATE A DIAGRAM WHERE YOU SHOW COMPARISONS FROM DIFFERENT ORDERINGS
 AND DIFFERENT SUBJECTS
\end_layout

\begin_layout Plain Layout
There are correspondences between different subjects
\end_layout

\begin_layout Plain Layout
10 means more than we keep clusters with more than 10 tracks and 100 means
 more than 100 tracks.
 I think!
\end_layout

\begin_layout Plain Layout
For different subjects 
\begin_inset Formula $\binom{10}{2}=45$
\end_inset

 , TC10 mean 0.4690955208544213, TC10 std 0.02635549381315298, 
\end_layout

\begin_layout Plain Layout
TC100 mean 0.52516460626048289, TC100 std 0.049120754359623708, 
\end_layout

\begin_layout Plain Layout
TC100 is better than TC10
\end_layout

\begin_layout Plain Layout
Comparing TC10 with TC100 and showed that TC100 is a reliably higher figure
 than TC10
\end_layout

\begin_layout Plain Layout
t-test t-statistic -8.0359458956507464, p-value 3.5820657551255408e-10.
\end_layout

\end_inset


\change_inserted 3 1319203524

\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\noindent
\align center
\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/TC_comparisons_diff_subjects.png
	lyxscale 60
	scale 60

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout

\end_layout

\end_inset


\begin_inset Note Note
status open

\begin_layout Plain Layout
This bar chart would make Edward Tufte very upset !
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout

\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Parallel version
\begin_inset CommandInset label
LatexCommand label
name "sub:Parallel-version"

\end_inset


\end_layout

\begin_layout Subsubsection
Algorithm
\end_layout

\begin_layout Standard
QB is already a super-efficient algorithm however we wanted to make it even
 more efficient so that for example it is trivial to cluster hundreds of
 subjects together and use many CPUs or computers simultaneously.
 With the speed of QB that would mean that you would actually be able to
 create an atlas of hundreds of subjects in a few hours.
 Therefore, we have extended QB to a parallel version which we call pQB.
 This algorithm works as follows.
 First we redirect and downsample all tracks.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
Can be done in parallel
\end_layout

\end_inset

Then we put all tracks together and break them into subsets.
 For every subset we assign a new thread and set QB to run on that thread.
 Therefore, we have now many QBs running on different CPUs.
 Then we collect all individual clusterings and start merging them together.
 We can pair every two results together and merge them in a binary way or
 just merge all clusterings to the first clustering.
 We can do merging with many different ways.
 Here we present the most modest but useful attempt.
\begin_inset Note Note
status open

\begin_layout Plain Layout
This will give you a different answer to the serial version surely?
\end_layout

\end_inset


\end_layout

\begin_layout Subsubsection

\series bold
Merging
\series default
 two sets of bundles
\end_layout

\begin_layout Standard
We can merge bundles using exemplar tracks or virtual tracks.
 We first set a distance threshold 
\begin_inset Formula $\mathtt{thr}$
\end_inset

 usually the same as the one we used for the QBs in the previous step.
 Let's assume now that we have gathered the virtual tracks from clustering
 
\begin_inset Formula $A$
\end_inset

 in 
\begin_inset Formula $V_{A}=\{v_{0},...,v_{|V_{A}|}\}$
\end_inset

 and from clustering 
\begin_inset Formula $B$
\end_inset

 in
\begin_inset Formula $V_{B}=\{v_{0}^{'},...,v_{|V_{B}|}^{'}\}$
\end_inset

 where 
\begin_inset Formula $|V|$
\end_inset

 denotes the number of virtual tracks of each clustering.
 
\begin_inset Formula $|V_{A}|$
\end_inset

 can be different 
\begin_inset Formula $|V_{B}|$
\end_inset

.
 (a) For every 
\begin_inset Formula $v_{i}^{'}$
\end_inset

 in set 
\begin_inset Formula $V_{B}$
\end_inset

 we find the closest 
\begin_inset Formula $v_{j}$
\end_inset

 in set 
\begin_inset Formula $V_{A}$
\end_inset

 and store the distance between these two tracks.
 Therefore we now have a set of minimum distances from 
\begin_inset Formula $V_{B}$
\end_inset

 to 
\begin_inset Formula $V_{A}$
\end_inset

.
 The size of this set is equal to 
\begin_inset Formula $|V_{B}|$
\end_inset

.
 (b) Finally, we merge those clusters from 
\begin_inset Formula $B$
\end_inset

 whose virtual tracks have minimum distances smaller than 
\begin_inset Formula $\mathtt{thr}$
\end_inset

 into the corresponding clusters of 
\begin_inset Formula $A$
\end_inset

, and if there a virtual track in 
\begin_inset Formula $V_{B}$
\end_inset

 has no sub-threshold neighbour in 
\begin_inset Formula $V_{A}$
\end_inset

 then its cluster becomes a new cluster in the merged clustering.
 In that way clusters from the two sets who have very similar features will
 merge together and if not new clusters will be created, and we will not
 have any loss of information from the two sets of clusters.
\end_layout

\begin_layout Subsection
Direct applications
\end_layout

\begin_layout Standard
We found that QB has numerous applications from detecting erroneous tracks
 to creating atlases, finding landmarks and guiding registration algorithms.
 Here we present just a few of the strategies that can be pursued further.
\end_layout

\begin_layout Subsubsection
Rapidly detecting erroneous tracks
\end_layout

\begin_layout Standard
It is well known that there are different artifacts seen in tractographies
 caused by subject motion, poor voxel reconstruction, incorrect tracking
 and many other reasons.
 However there is no known automatic method to try and detect these tracks
 and therefore remove them from the datasets.
 The idea here is that QB speeds up the search for erroneous tracks.
 We will concentrate here on tracks that loop one or many times; something
 that it is considered impossible to happen in nature.
\end_layout

\begin_layout Standard
One of the tracks which are most likely erroneous are tracks which wind
 more than one times, like a spiral.
 We can detect those with the following algorithm.
 Lets assume that we have a track 
\begin_inset Formula $s$
\end_inset

 and we want to check if it winds: (a) we perform a singular value decomposition
 on the centered track 
\begin_inset Formula $U,\mathbf{d},V=\mathtt{SVD}(s-\bar{s})$
\end_inset

; (b) project the highest singular value 
\begin_inset Formula $\mathbf{d_{0}}$
\end_inset

 to the first column of 
\begin_inset Formula $U,$
\end_inset

 
\begin_inset Formula $U_{o}$
\end_inset

 creating the first component of a two dimensional coordinate 
\begin_inset Formula $p_{x}$
\end_inset

 and the second highest 
\begin_inset Formula $\mathbf{d_{1}}$
\end_inset

 to the second column 
\begin_inset Formula $U_{1}$
\end_inset

 creating the second coordinate 
\begin_inset Formula $p_{y}$
\end_inset

; and (c) calculate the cumulative winding angle on the 2d plane; d) if
 the cumulative angle is more that 
\begin_inset Formula $400^{\circ}$
\end_inset

 then that would mean that the initial track 
\begin_inset Formula $s$
\end_inset

 is winding and therefore needs to be removed, see Fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:winding"

\end_inset

.
\end_layout

\begin_layout Standard
Winding tracks can be dangerous when we merge clusters because they could
 be close to many different clusters.
 We found that winding tracks often form bundles with many similar tracks.
 Also, they are usually long tracks so they will not be removed with any
 filters which remove short tracks.
 We could use QB with a low threshold to reduce the number of tracks while
 avoiding embedding winding tracks into otherwise ordinary clusters and
 then run the winding algorithm just on the exemplar tracks of the bundles
 rather than the entire tractography.
 
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset CommandInset label
LatexCommand label
name "Flo:winding"

\end_inset


\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/winding.png
	scale 50

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
Example of detecting a possibly erroneous 3d bundle (on the left) by projecting
 its exemplar track and counting the winding cumulative angle 
\begin_inset Formula $\sum_{0}^{N}\omega_{i}$
\end_inset

 on the 2d plane as shown on the right, where 
\begin_inset Formula $N$
\end_inset

 is the total number of track segments.
 Usually bundles with total angle higher than 400 degrees are removed from
 the datasets as most likely to be erroneous.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
QB can also simplify detection of tracks which are very dissimilar to others
 and therefore they are very distant from all other clusters.
 Usually when we use a QB threshold of about 
\change_inserted 3 1319698782

\begin_inset Formula $10$
\end_inset


\change_unchanged
mm these tracks will be part of small bundles containing a few tracks (
\begin_inset Formula $<10$
\end_inset

) and the distance of the bundle they belong to from all other bundles will
 be much higher than average.
 This can give us another detection method for outliers.
\end_layout

\begin_layout Standard
Finally, QB can be used to remove small or broken tracks in an interactive
 way, for example see Fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:cst_pbc"

\end_inset

 where the red large bundle has been merged by an expert and then with QB
 we can extract the skeleton of the bundle and see which parts create that
 structure.
 Without QB it would be too difficult to work out that this bundle consists
 of many small or divergent parts.
 In this figure both very diverging, small or broken tracks can be identified
 after the simplification provided by QB.
 
\end_layout

\begin_layout Standard
In summary, we have shown that QB can facilitatea fully automatic, efficient
 and robust detection system for erroneous tracks in specific bundles or
 entire tractographies.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/erroneous_tracks.png
	lyxscale 30
	scale 65
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset CommandInset label
LatexCommand label
name "Flo:erroneous_tracks"

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
161 most likely erroneous bundles automatically detected by our winding
 method all having total winding angle higher than 500 degrees and shown
 with random colours per bundle.
 On the left panel we see the bundles on their exact position in the dataset
 from the top of the head, on the middle panel we see the same tractography
 from the side and the third panel we see the part of middle panel on the
 red box slightly rotated and much zoomed so that some erroneous tracks
 can be easily shown.
 To cluster the initial tractography not shown here we used QB with threshold
 10mm.
 This is the first known automatic detection system of outliers and erroneous
 tracks for tractography data.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
Really? Nothing for short tracks for example?
\end_layout

\end_inset

By calculating the number of winding tracks in your datasets over the total
 number of tracks you can have an indicator of the quality of our datasets.
\begin_inset Note Note
status open

\begin_layout Plain Layout
These are all winding tracks? The windingness is difficult to see - maybe
 some isolated examples?
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Subsubsection
Alignments, landmarks and atlases
\begin_inset CommandInset label
LatexCommand label
name "sub:Atlases-made-easy"

\end_inset


\end_layout

\begin_layout Standard
We have used QB to construct a robust tractographic atlas in MNI space from
 data for 10 subjects.
 Here we explain the steps we used to achieve that.
\end_layout

\begin_layout Standard

\series bold
Alignment
\series default
.
 Tractographies were created using EuDX with QA produced by Generalized
 Q-Sampling with diffusion sampling length of 
\change_inserted 3 1319699084

\begin_inset Formula $1.2$
\end_inset


\change_unchanged
 and datasets from 
\change_inserted 3 1319699080

\begin_inset Formula $101$
\end_inset


\change_unchanged
 gradient directions with maximum b-value 4000 and 1 with b-value ~0 gathered
 from 10 healthy subjects at the MRC-CBU Siemens Trio using the 32 channels
 coil, STEAM sequence and voxel size 
\begin_inset Formula $2.5\times2.5\times2.5$
\end_inset

; see section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Acquisition-sequences-in-use"

\end_inset

 for acquisition details.
 The EuDX parameters were: low QA threshold of 0.0239, propagation step size
 of 0.5, angular threshold 60 degrees and total weighting 0.5.
 The tractographies for all subjects were initially in native space and
 the goal was to warp them in MNI space, using nonlinear registration.
 
\end_layout

\begin_layout Standard
Because the registration problem is generally considered a difficult problem
 with a non-unique solution we wanted to make sure that we are using a known,
 well established and robust method therefore we chose to use
\begin_inset Formula $\texttt{fnirt}$
\end_inset

with
\begin_inset Note Note
status open

\begin_layout Plain Layout
space
\end_layout

\end_inset

 the same parameters as used with the first steps of TBSS
\begin_inset CommandInset citation
LatexCommand cite
key "Smith2006NeuroImage"

\end_inset

.
 For that reason FA volumes were generated from the same datasets using
 Tensor fitting with weighted least squares after scull 
\begin_inset Note Note
status open

\begin_layout Plain Layout
sp
\end_layout

\end_inset

stripping with 
\begin_inset Formula $\texttt{bet}$
\end_inset

 and parameters '
\begin_inset Formula $\texttt{-F -f .2 -g 0}$
\end_inset

'.
 These FA volumes were again in native space therefore we needed to warp
 them in MNI space.
 For this purpose a standard template 
\begin_inset Formula $\texttt{FMRIB58\_FA\_1mm}$
\end_inset

 from the FSL toolbox was used as the reference volume.
 However, we wanted primarily to have the displacements which would do a
 point wise mapping from native space to MNI space and we found this to
 be technically very difficult with the FSL tools as they assume that these
 displacements will be applied only on volumetric data and not with point
 data as those used in tractographies.
 Finally, after some considerable effort we found a combination of 
\change_inserted 1 1319032651

\begin_inset Formula $\texttt{flirt}$
\end_inset


\change_deleted 1 1319032671

\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
texttt{flirt}
\end_layout

\end_inset


\change_unchanged
, 
\begin_inset Formula $\texttt{invwarp}$
\end_inset

, 
\begin_inset Formula $\texttt{fnirtfileutils}$
\end_inset

 and 
\change_inserted 1 1319032687

\begin_inset Formula $\texttt{fnirtfileutils -withaff}$
\end_inset


\change_deleted 1 1319032700

\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
texttt{fnirtfileutils -withaff}
\end_layout

\end_inset


\change_unchanged
 which gave us the correct displacements.
 As this being very technical we will not describe it further here but the
 code is available in module (
\begin_inset Formula $\texttt{dipy.external.fsl}$
\end_inset

).
 It is also important to say that we didn't use eddy correction with any
 of this type of datasets because eddy correction is unstable with volumes
 at high b-values because there is no much signal for guiding a correct
 registration with the other volumes at lower b-values.
\end_layout

\begin_layout Standard
After creating the displacements for every subject; these were applied to
 all tractographies in the native space so they are mapped in the MNI space
 of voxel size 
\begin_inset Formula $1\times1\times1\textrm{mm}^{3}$
\end_inset

.
 Having all tractographies in MNI space is something very useful because
 we can now compare them against available templates or against each other
 and calculate different statistics.
 However this is not where we stop; we proceed to generate a tractographic
 atlas using clustering.
\end_layout

\begin_layout Standard

\series bold
Tractographic Atlas.

\series default
 For all subjects, (a) load warped tractography (b) re-direct a static point
 
\begin_inset Formula $(100,100,100)$
\end_inset

 as explained in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:The-bi-directionality-problem"

\end_inset

, (c) downsample the tracks to have only 12 points, (d) calculate and store
 QB clustering with a 
\begin_inset Formula $10$
\end_inset

mm threshold.
 Then merge all clusterings again with 10 mm threshold as explained in section
 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Parallel-version"

\end_inset

 (merging).
 When creating an atlas by merging many different subjects the most important
 issue is what you remove from the atlas as outliers.
 QB here provides a possible solution for this problem.
 If we plot the number of tracks for each cluster sorted in ascending order
 we can see an interesting pattern: 
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset CommandInset label
LatexCommand label
name "Flo:atlas_big_bundles"

\end_inset


\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/big_bundles_atlas.png
	scale 60
	rotateOrigin center

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
14520 clusters when created by joining the QB clusterings of 10 subjects
 in MNI space.
 We found that most of the clusters had a few tracks and only a few clusters
 had many.
 In the diagram above we can see 20% of the largest clusters had more than
 90% of the total amount of tracks.
 This was a very positive result showing that there is a lot of agreement
 between different subjects which would be useful for a solid atlas with
 the biggest bundles becoming landmark bundles and the small bundles removed
 as outliers.
\begin_inset Note Note
status open

\begin_layout Plain Layout
Not sure what largest bundles % is? Is the gradient at 100% then the number
 of tracks taken by the smallest bundles?
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard

\series bold
Finding and Using Landmarks
\end_layout

\begin_layout Standard
One can use this atlas or similar atlases created from more subjects in
 order to select specific structures and study these structures directly
 in different subjects without using any of the standard region-of-interest
 based methods.
\end_layout

\begin_layout Standard
A simple example is given in Fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:CloseToSelected"

\end_inset

.
 In the first row we see a tractographic atlas joined by merging the QB
 clusterings of 10 healthy subjects as described in the previous section.
 Then from these clusters represented by their virtual tracks we keep only
 196 biggest clusters i.e.
 those which contain the highest number of tracks, so that we are sure that
 there is enough agreement from the different tractographies and from these
 we just pick by way of an example 19 virtual tracks which correspond to
 well known bundle structures in the literature: 1 from genu of corpus callosum
 (GCC), 3 from the body of corpus callosum (BCC), 1 from the splenium (SCC),
 1 from the pons cerebellar peduncle (CP), 1 from left arcuate fasciculus
 (ARC-L), 1 from right arcuate fasciculus (ARC-R), 1 from left inferior
 occipitofrontal fasciculus (IFO-L) and 1 from right inferior occipitofrontal
 fasciculus (IFO-R), 1 from right fornix (FX-R), 1 from left fornix (FX-L),
 1 optic radiation (OR), 1 left cingulum (CGC-L), 1 from right cingulum
 (CGC-R), 1 from left corticospinal tract (CST-L), 1 from right corticospinal
 tract (CST-R), 1 from left uncinate (UNC-L) and 1 from right uncinate (UNC-R).
 These 19 tracks are coloured randomly.
 Then on the second row we bring for the first 6 selections 
\begin_inset Note Note
status open

\begin_layout Plain Layout
selections of what?
\end_layout

\end_inset

 the tracks closer than 20mm from 3 arbitrarily selected subjects.
 Similarly, on the third row the tracks closer than 15mm to the next 7 picked
 tracks
\begin_inset Note Note
status open

\begin_layout Plain Layout
not sure what these 7 are
\end_layout

\end_inset

.
 Finally on the last row we bring the tracks from the same 3 subjects which
 are closer than 18mm.
 The colours used for the selected tracks are automatically assigned from
 the colours of tracks picked from the atlas.
 We can see that there is a massive reliability and continuity in the same
 or between different subjects even though we only selected a very small
 number of tracks.
 Using a similar procedure we could create a book of bundles for every subject
 and then compare the subjects at the level of bundles.
 
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/close_distance.png
	lyxscale 30
	scale 70

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset CommandInset label
LatexCommand label
name "Flo:CloseToSelected"

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
A novel way to do comparisons between subjects.
 Correspondence between different subjects (last 3 rows) and a few landmarks
 picked from the tractographic atlas generated by merging QB clusterings
 of 10 subjects (top row).
 That we can see this amount of agreement and continuity on the last 3 rows
 from such a few skeletal tracks is a great hope for implementing new robust
 ways of statistical comparisons using tractographic datasets.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Subsubsection
QB as input to other learning methods
\end_layout

\begin_layout Standard
We found that QB is of great value as an adjunct to many less efficient
 algorithms e.g.
 hierarchical clustering, affinity propagation, nearest neighbours, spectral
 clustering and other unsupervised and supervised learning methods.
 We present here one example with QB as input to affinity propagation and
 one with QB as input to hierarchical clustering.
\begin_inset Note Note
status open

\begin_layout Plain Layout
I wonder if you can use QB as input to QB? I mean to try and find the 'global
 minimum' clustering - the 'mean' clustering from all possible clusterings
 with QB - independent therefore of starting point.
 This might be somehow an optimum solution, and would fit naturally with
 parallel solutions so the serial version would be the same as the parallel
 version.
 
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Most clustering algorithms need to calculate all pairwise distances between
 tracks; that means that for a medium sized tractography of 250,000 tracks
 we would need 232 GBytes of RAM with single floating point precision.
 Something which is not and will not be available soon in personal computers.
 In those cases some people might hope that sparse matrices could provide
 a nice approximation; however dense tractographies produce very dense distance
 matrices.
 The straightforward solution to this problem is to use QB in order to first
 segment in small clusters and then use the skeletons (i.e.
 exemplar or virtual tracks) of these clusters with other higher complexity
 operations and merge the clusters together in bigger clusters.
\begin_inset Note Note
status open

\begin_layout Plain Layout
I remember Fernando talking about large scale diffusion clustering methods
 being partially based on the ability to ignore long distances when clustering.
 I guess QB here gives you a reliable measure of a long distance so that
 you can zero out this entry in your connection matrix.
\end_layout

\end_inset


\end_layout

\begin_layout Standard

\series bold
Procedure
\series default
:
\begin_inset Note Note
status collapsed

\begin_layout Plain Layout
We need a different name than algorithm so that we don't have any 
\end_layout

\begin_layout Plain Layout
conflicts with the float algorithms
\end_layout

\end_inset


\end_layout

\begin_layout Standard
(1) Cluster using QB as explained in section
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Atlases-made-easy"

\end_inset


\end_layout

\begin_layout Standard
(2) Gather virtual tracks
\end_layout

\begin_layout Standard
(3) Calculate MDF distance of virtual tracks with themselves.
\end_layout

\begin_layout Standard
(4) Use any other clustering method to segment this much smaller distance
 matrix 
\begin_inset Formula $D$
\end_inset

.
\end_layout

\begin_layout Standard
In Fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:LSC+HC+AP"

\end_inset

 at the left panel we show a result were we used hierarchical clustering
 with single linkage for step (4) with a threshold of 20mm using the package
 
\begin_inset Formula $\texttt{hcluster}$
\end_inset

 (see 
\begin_inset CommandInset citation
LatexCommand cite
key "eads-hcluster-software"

\end_inset

).
 A known drawback of single linkage is the so-called chaining phenomenon:
 clusters may be brought together due to single elements being close to
 each other, even though many of the elements in each cluster may be very
 distant to each other.
 Chaining is usually considered as a disadvantage because it is too driven
 by local neighbours.
 Nevertheless, we can use this property to cluster the corpus callosum (CC)
 all together (shown with dark red in left top of Fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:LSC+HC+AP"

\end_inset

) creating a fully automatic CC detection system.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
How reliable is this do you think?
\end_layout

\end_inset

Furthermore, we can use different cutting thresholds on the underlying dendrogra
m to amalgamate together different structures e.g.
 see the cingulum bundles in the same panel.
\end_layout

\begin_layout Standard
In the right panel of Fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:LSC+HC+AP"

\end_inset

 we see implementation of step (4) using a more recent algorithm: affinity
 propagation (AP) 
\begin_inset CommandInset citation
LatexCommand cite
key "dueck2009affinity"

\end_inset

, which earlier was correctly identified by us (see Fig.
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:LSC+HC+AP"

\end_inset

) and 
\begin_inset CommandInset citation
LatexCommand cite
key "malcolm2009filtered"

\end_inset

 for being difficult or impossible to be used for group analysis or to cluster
 entire tractographies of many thousands of tracks.
 A small outline of how this algorithm works is given in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Affinity-Propagation"

\end_inset

.
 Here we see in the bottom right panel of (see Fig.
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:LSC+HC+AP"

\end_inset

) how nicely AP, after the simplification provided by QB, has clustered
 arcuate, longitudinal occipitofrontal fasciculus and other structures known
 from the literature.
 The input of AP was the negative distance matrix
\begin_inset Formula $-D$
\end_inset

, the preference weights were set to matrix 
\begin_inset Formula $\mathtt{median}(-D)$
\end_inset

 and the hierarchical clustering parameter was set to 
\begin_inset Formula $20$
\end_inset

mm.
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/LSC_with_others.png
	lyxscale 30
	scale 70

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset CommandInset label
LatexCommand label
name "Flo:LSC+HC+AP"

\end_inset


\begin_inset Caption

\begin_layout Plain Layout
\begin_inset Formula $2$
\end_inset

examples were 
\begin_inset Note Note
status open

\begin_layout Plain Layout
Two examples where
\end_layout

\end_inset

QB output is used to cluster an entire set of 
\begin_inset Formula $10$
\end_inset

 tractographies together and then the result is given as input to hierarchical
 clustering (HC) using single linkage on the left and to affinity propagation
 (AP) on the right.
 Colours encode cluster labels and on the left we see 
\begin_inset Formula $19$
\end_inset

 clusters and on the right 
\begin_inset Formula $23$
\end_inset

.
 QB facilitates massively the operation of the other two algorithms which
 would not be able to cluster the entire datasets on current computers.
 Pay attention at the top left panel where QB+HC have managed to cluster
 the entire CC as one bundle.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
For the hierarching clustering we used the software 
\begin_inset ERT
status open

\begin_layout Plain Layout

hcluster
\end_layout

\end_inset

 and for affinity propagation we used the 
\begin_inset ERT
status open

\begin_layout Plain Layout

scikit-learn
\end_layout

\end_inset

 both implemented in 
\begin_inset ERT
status open

\begin_layout Plain Layout

Python
\end_layout

\end_inset

.
\end_layout

\begin_layout Subsubsection
Exemplar tracks vs ROIs vs Masks
\end_layout

\begin_layout Standard
Medical practitioners and neuroanatomists often argue that when they use
 multiple spherical or rectangular masks to select some bundles many tracks
 are thrown away because they are small and the mask operations cannot get
 hold of them.
 Our method provides a solution to this problem as it can identify broken
 or smaller bundles inside other bigger bundles which are otherwise very
 difficult or even sometimes impossible to identify visually or with the
 use of masks.
 Our method attacks this problem and suggests a very efficient and robust
 solution which sets the limit for unsupervised clustering of tractographies
 and facilitates incredibly tractography exploration and interpretation.
 The point here is one can now use exemplar tracks as access points into
 the full tractography and with a single click on that exemplar track obtain
 the entire bundle.
 Therefore a super-bundle can be created just with with a few clicks based
 on a selection from exemplar tracks.
 
\end_layout

\begin_layout Standard
In order to create this system we implemented a collision detection system
 in Python and OpenGL similar with what 
\begin_inset Note Note
status open

\begin_layout Plain Layout
similar to those used
\end_layout

\end_inset

is used in commercial game engines.
 This project has attracted great interest from other researchers and it
 has now become a new independent scientific software project accessed at
 
\begin_inset Formula $\texttt{http://www.fos.me}$
\end_inset

.
\end_layout

\begin_layout Subsection
Affinity Propagation
\begin_inset CommandInset label
LatexCommand label
name "sub:Affinity-Propagation"

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
You have already used and referred to AP above.
 Put this section earlier?
\end_layout

\end_inset

Affinity propagation (AP) is a very recent 
\begin_inset CommandInset citation
LatexCommand cite
key "frey2007clustering"

\end_inset

, 
\begin_inset CommandInset citation
LatexCommand cite
key "dueck2009affinity"

\end_inset

 
\begin_inset Formula $O(N^{2})$
\end_inset

 clustering method which is based in loopy belief propagation 
\begin_inset CommandInset citation
LatexCommand cite
key "pearl1988probabilistic"

\end_inset

 and other recent innovations in graphical models and more specifically
 is an instance of the max-sum algorithm in factor graphs.
 AP is an exemplar based clustering method where the center of a cluster
 is a real data point (exemplar) as in k-medoids, and k-centres rather than
 an average virtual point as in k-means.
 AP starts by simultaneously considering all data points as potential exemplars.
 Every data point is a node in a network and AP recursively transmits real-value
d messages along the edges of the network until a good set of exemplars
 and corresponding clusters emerges.
 
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status collapsed

\begin_layout Plain Layout
\noindent
\align center
\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/affinity_propagation_ok2.png
	lyxscale 40
	scale 40
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset CommandInset label
LatexCommand label
name "Fig:AP_2d"

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
Simple example of affinity propagation at work where it can precisely identify
 4 different normal distributions with means 
\begin_inset Formula $(1,1),\;(-1,-1),\:(1,-1),$
\end_inset


\begin_inset Formula $(-2,2)$
\end_inset

 and standard deviation 
\begin_inset Formula $.5$
\end_inset

.
 You can see the exemplars - most representative actual points - with thicker
 dots perfectly aligned with the means.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
AP takes as input a collection of similarities between data points, where
 the similarity 
\begin_inset Formula $S(i,k)$
\end_inset

 indicates how well the data point with index 
\begin_inset Formula $k$
\end_inset

 is suited to be the exemplar for data point 
\begin_inset Formula $i$
\end_inset

.
 In order to understand AP we can think just for the moment that we try
 to cluster 2d data points and each similarity is expressed as the negative
 Euclidean distance 
\begin_inset Formula $S(i,k)=-||x_{i}-x_{k}||^{2}$
\end_inset

 see Fig.
\begin_inset CommandInset ref
LatexCommand ref
reference "Fig:AP_2d"

\end_inset

 therefore 
\begin_inset Formula $S$
\end_inset

 for the moment is the negative complete squared distance matrix.
 Rather than requiring the number of clusters to be prespecified, AP adds
 a real number (preference weights) to the diagonal elements of 
\begin_inset Formula $S$
\end_inset

, one for each data point so that larger values of 
\begin_inset Formula $S(k,k)$
\end_inset

 are more likely to become exemplars.
 For, simplicity we can choose the 
\begin_inset Formula $median(S)$
\end_inset

 as the common preference weight for all points; in this way we don't enforce
 any 
\emph on
a priori
\emph default
 information for one point to be an exemplar any more than any other point.
 For some applications this could be an appropriate requirement.
 There are two different messages exchanged between points (1) responsibilities
 
\begin_inset Formula $R(i,k)=S(i,k)-{\displaystyle \max_{k':k'\neq k}}[S(i,k')+A(i,k')]$
\end_inset

 and (2) availabilities which are 
\family roman
\series medium
\shape up
\size normal
\emph off
\bar no
\noun off
\color none
initially 
\begin_inset Formula $A(i,k)=0$
\end_inset

 and then equal to:
\end_layout

\begin_layout Standard
\begin_inset Formula 
\begin{equation}
\forall i,k:\: A(i,k)=\begin{cases}
{\displaystyle \sum_{i':i'\neq i}} & max[0,\: R(i',k)],\: for\:\: k=i\\
\min & \left[0,\: r(k,k)+{\displaystyle \sum\max_{i':i'\notin\{i,k\}}[0,r(i',k)]}\right],\: for\:\: k\neq i
\end{cases}
\end{equation}

\end_inset


\end_layout

\begin_layout Standard
A simple version of affinity propagation is shown in alg.
\begin_inset CommandInset ref
LatexCommand ref
reference "alg:AP"

\end_inset

.
 A very interesting fact is the way we get the final exemplars using AP.
 
\end_layout

\begin_layout Standard
After the messages have converged, there are two ways you can identify exemplars
: 
\end_layout

\begin_layout Standard
1) For data point 
\begin_inset Formula $i$
\end_inset

, 
\begin_inset Formula $if\:\: R(i,i)+A(i,i)>0$
\end_inset

, then data point 
\begin_inset Formula $i$
\end_inset

 is an exemplar 
\end_layout

\begin_layout Standard
2) For data point 
\begin_inset Formula $i$
\end_inset

, 
\begin_inset Formula $if\:\: R(i,i)+A(i,i)>R(i,j)+A(i,j)$
\end_inset

, for all 
\begin_inset Formula $i$
\end_inset

 not equal to 
\begin_inset Formula $j$
\end_inset

, then data point 
\begin_inset Formula $i$
\end_inset

 is an exemplar.
 
\end_layout

\begin_layout Standard
Therefore, the availabilities and responsibilities are added to identify
 exemplars.
 For point 
\begin_inset Formula $i$
\end_inset

, the value of 
\begin_inset Formula $k$
\end_inset

 that maximizes 
\begin_inset Formula $A(i,k)+R(i,k)$
\end_inset

 either identifies 
\begin_inset Formula $i$
\end_inset

 as an exemplar if 
\begin_inset Formula $k=i$
\end_inset

, or identifies the data point that is the exemplar for point 
\begin_inset Formula $i$
\end_inset

.
 The message passing procedure is terminated either after a fixed number
 of iterations, or after changes in the messages stay low, or local decisions
 stay constant; also the messages are damped - combining previous with current
 message - to avoid numerical oscillations.
 
\end_layout

\begin_layout Standard
Of course when we need to calculate distances between many points then the
 distance matrix becomes too big for the available memory.
 In that case if we are lucky and the datasets are sparse then we can use
 AP on sparse matrices but if the datasets are not sparse as it is the case
 with tractographies then we need to reduce the dimensionality of the data
 sets and this is where QB could be very handy.
 The complete algorithm for AP is given in alg.
\begin_inset CommandInset ref
LatexCommand ref
reference "alg:AP"

\end_inset

.
\end_layout

\begin_layout Standard
\begin_inset Float algorithm
wide false
sideways false
status open

\begin_layout Plain Layout
\begin_inset CommandInset label
LatexCommand label
name "alg:AP"

\end_inset


\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
textbf{Input} Similarity/affinity matrix $S$ where the diagonal elements
 of $S(k,k)$ indicate the a priori preference for $k$ to be chosen as an
 exemplar 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
textbf{Output} Clustering $CAP=
\backslash
{c_{o},...,c_{k},...,c_{|CAP|-1}
\backslash
}$, where a cluster $c=
\backslash
{I,
\backslash
mathbf{e},N
\backslash
}$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$
\backslash
forall i,k:
\backslash
: A(i,k)=R(i,k)=0$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$S=S+n$ 
\backslash
# remove degeneracies
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$d=0.5$ 
\backslash
# set damping factor
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

last
\backslash
_iter=$100$ 
\backslash
# last iteration 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$
\backslash
textbf{For}$ iter $
\backslash
textbf{From}$ $1$ $
\backslash
textbf{To}$ last
\backslash
_iter $
\backslash
textbf{Do}$ 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $R_{old}=R$
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
forall i,k:
\backslash
: R(i,k)=S(i,k)-max_{k':k'
\backslash
neq k}[S(i,k')+A(i,k')]$ 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $R=(1-d)R+d*R_{old}$ 
\backslash
# dampen responsibilities 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $A_{old}=A$ 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} 
\backslash
# update availabilities 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $
\backslash
forall i,k:
\backslash
: A(i,k)=
\backslash
begin{cases} 
\backslash
sum_{i':i'
\backslash
neq i} & max[0,
\backslash
: R(i',k)],
\backslash
: for
\backslash
:
\backslash
: k=i
\backslash

\backslash
 
\backslash
min & 
\backslash
left[0,
\backslash
: 	R(k,k)+{
\backslash
displaystyle 
\backslash
sum
\backslash
max_{i':i'
\backslash
notin
\backslash
{i,k
\backslash
}}[0,R(i',k)]}
\backslash
right],
\backslash
: for
\backslash
:
\backslash
: k
\backslash
neq i
\backslash
end{cases}$ 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
hspace*{2em} $A=(1-d)A+dA_{old}$ 
\backslash
# dampen availabilities 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$
\backslash
forall i, I_{e}=argmax
\backslash
: S(i,I_{d})$ 
\backslash
# find indices of exemplars 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$I_{e}(I_{d})=1:size
\backslash
: (I_{d})$ 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$L=I_{d}(I_{e})$ 
\backslash
# assign labels 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout

$C_{AP}=
\backslash
{c_{0},...,c_{k},...,c_{|C_{AP}|-1}
\backslash
}$ 
\backslash
# clustering output
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
# where a cluster $c=
\backslash
{I,
\backslash
mathbf{e},N
\backslash
}$ holds the AP exemplars $
\backslash
mathbf{e}$,
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
# the indices $I$ of the cluster elements and $N$ the number of elements
 
\backslash

\backslash

\end_layout

\begin_layout Plain Layout


\backslash
caption{Affinity Propagation}
\end_layout

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Note Note
status collapsed

\begin_layout Plain Layout
Using now the labels we can add an extra step in AP where we create a tree
 structure $C_{AP}=
\backslash
{c_{0},...,c_{k},...,c_{|C_{AP}|-1}
\backslash
}$, where a cluster $c=
\backslash
{I,
\backslash
mathbf{e},N
\backslash
}$ holds the exemplars $
\backslash
mathbf{e}$, the indices $I$ of the cluster elements and $N$ the number of
 elements in a cluster so that it resembles with the output of QB.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Direct Tractography Registration
\end_layout

\begin_layout Standard
Direct tractography registration is a recent studied problem with a small
 amount of publications and to our knowledge with no publicly available
 solutions.
 By direct registration we mean that no other information apart from the
 tractographies themselves is guiding the registration.
 This is in contrast to the previous sections where we used FA registration
 mappings applied to tractographies (see section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Atlases-made-easy"

\end_inset

) which is also most commonly used in the literature along with other tensor
 based methods
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "goh2006algebraic"

\end_inset

.
 The current known methodologies on this subject are as follows.
 Leemans et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "leemans2006multiscale"

\end_inset

uses the invariance of curvature and torsion under rigid registration along
 with Procrustes analysis to co-register together different tractographies.
 Mayer et al.
\begin_inset space ~
\end_inset

used iterative closest point applied to register pre-selected bundles (bundles
 of interest - BOI) 
\begin_inset CommandInset citation
LatexCommand cite
key "mayer2008bundles"

\end_inset

, 
\begin_inset CommandInset citation
LatexCommand cite
key "mayerdirect"

\end_inset

 and extended it using probabilistic boosting tree classifiers for bundle
 segmentation in
\begin_inset CommandInset citation
LatexCommand cite
key "mayer2011supervised"

\end_inset

.
 Durrleman et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "durrleman2010registration"

\end_inset

 reformulate the tracks as currents and implemented a currents based registratio
n.
 Zvitia et al.
\begin_inset space ~
\end_inset


\begin_inset CommandInset citation
LatexCommand cite
key "zvitia2008adaptive"

\end_inset

 
\begin_inset CommandInset citation
LatexCommand cite
key "Zvitia2010"

\end_inset

, used adaptive mean shift clustering to extract a number of representative
 ﬁbre-modes.
 Each fibre mode was assigned to a multivariate Gaussian distribution according
 to its population thereby leading to a Gaussian Mixture model (GMM) representat
ion for the entire set of ﬁbres.
 The registration between two ﬁbre sets was treated as the alignment of
 two GMMs and is performed by maximizing their correlation ratio.
 A further refinement was added using RANSAC
\begin_inset CommandInset citation
LatexCommand cite
key "fischler1981random"

\end_inset

 to obtain all 
\begin_inset Formula $12$
\end_inset

 affine parameters.
 Ziyan et al.
\begin_inset CommandInset citation
LatexCommand cite
key "ZiyanMICCAI07"

\end_inset

 developed a nonlinear registration algorithm based on the log-Euclidean
 polyaffine framework
\begin_inset CommandInset citation
LatexCommand cite
key "Arsigny2009"

\end_inset

; however this is not a direct tractography registration algorithm as they
 first create scalar volumes, therefore they don't try to register the tracks
 themselves in their space.
 
\end_layout

\begin_layout Standard
Our algorithm is extremely efficient and simple to use, completely automatic
 and provides an evidently robust rigid direct tractography registration
 algorithm available in seconds.
 This algorithm could be of great use when comparing healthy versus severely
 diseased brains e.g.
 stroke or vegetative state patients when non-rigid registration is not
 recommended because of severe asymmetries in the diseased brains.
 The algorithm is based on the robustness of QB to find good skeletons.
\end_layout

\begin_layout Standard

\series bold
Algorithm
\series default
.
 Here we describe a simple algorithm where 2 tractographies 
\begin_inset Formula $T_{A}$
\end_inset

,
\begin_inset Formula $T_{B}$
\end_inset

 are aligned together in native space.
\end_layout

\begin_layout Enumerate
All tracks with length smaller than 
\begin_inset Formula $100$
\end_inset

mm and longer than 
\begin_inset Formula $300$
\end_inset

mm are removed from the datasets.
 This will reduce the size of tractography to about 
\begin_inset Formula $1/4$
\end_inset

 of its initial size (
\begin_inset Formula $~200,000$
\end_inset

 tracks).
\begin_inset Note Note
status open

\begin_layout Plain Layout
Why these numbers? Won't larger brains have different tracks removed compared
 to smaller brains?
\end_layout

\end_inset


\end_layout

\begin_layout Enumerate
Both tractographies are equidistantly downsampled so every track contains
 only 
\begin_inset Formula $12$
\end_inset

 points.
 
\end_layout

\begin_layout Enumerate
We run QB with distance threshold at 
\begin_inset Formula $10$
\end_inset

mm for both tractographies.
\end_layout

\begin_layout Enumerate
Collect all exemplar tracks from clusters containing more than 
\begin_inset Formula $0.2\%$
\end_inset

 tracks.
 Lets assume we have these now in 
\begin_inset Formula $E_{A}$
\end_inset

 and 
\begin_inset Formula $E_{B}$
\end_inset

.
\end_layout

\begin_layout Enumerate
Calculate all pairwise distances 
\begin_inset Formula $D=\mathtt{MDF}(E_{A},E_{B})$
\end_inset

 and save them in rectangular matrix 
\begin_inset Formula $D$
\end_inset

.
 
\end_layout

\begin_layout Enumerate
Create a cost function (optimizer) which we will try to minimize the symmetric
 minimum distance 
\begin_inset Formula $SMD=\sum_{i}\min_{j}D(i,j)+\sum_{j}\min_{i}D(i,j)$
\end_inset

 .
\end_layout

\begin_layout Enumerate
Use modified Powell's method 
\begin_inset CommandInset citation
LatexCommand cite
key "fletcher1987practical"

\end_inset

 to minimize 
\change_inserted 3 1319745190

\begin_inset Formula $SMD$
\end_inset


\change_unchanged
 starting with zeroed initial conditions.
 In each iteration of the optimization 
\begin_inset Formula $E_{B}$
\end_inset

 will be transformed and 
\change_inserted 3 1319745183

\begin_inset Formula $SMD$
\end_inset


\change_unchanged
 will be recalculated.
 To ensure smooth rotations we use Rodriguez formula.
\end_layout

\begin_layout Standard
In Fig.
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:direct_registration"

\end_inset

A we see the result of this algorithm applied to two tractographies - represente
d with their exemplar tracks - depicted with orange and purple.
 We can see in the upper panel that the orange tractography is misaligned
 with respect to the purple one, and in the lower panel we see their improved
 alignment after applying our algorithm.
\end_layout

\begin_layout Standard

\series bold
Metric
\series default
.
 SMD is proposed here for registration of trajectory datasets, but one could
 equally use mutual information
\begin_inset CommandInset citation
LatexCommand cite
key "maes1997multimodality"

\end_inset

 or the correlation ratio 
\begin_inset CommandInset citation
LatexCommand cite
key "roche1998correlation"

\end_inset

 for registration of volumetric datasets.
 Nonetheless, the advantage of SMD is that it comes from robust landmarks
 generated by QB which bring together local and global components.
 Initially, it was not clear if we should use SMD or just the sum of all
 distances 
\begin_inset Formula $SD=\sum_{i,j}D(i,j)$
\end_inset

.
 Therefore, we made a small experiment to validate the smoothness and convexity
 of these two cost functions.
 We plotted both functions under a single-axis translation or a single-angle
 rotation of the same tractography as show in Fig.
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:direct_registration"

\end_inset

 B and C.
 From, there it was clear that although for translations only the SD was
 entirely convex, with rotations the SD had stronger local minima which
 is not a good property for registration.
 Furthermore, the SMD had steeper gradients towards the global minimum which
 is a positive indicator for faster convergence.
 
\end_layout

\begin_layout Standard

\series bold
Experiments
\series default
.
 The first large scale experiment took place using the same tractography
 of a single individual copied and transformed 
\change_inserted 3 1319745405

\begin_inset Formula $1000$
\end_inset


\change_unchanged
 times with range of all three angles from 
\begin_inset Formula $-45$
\end_inset

 degrees to 
\begin_inset Formula $45$
\end_inset

 and range of all x,y,z translations from 
\begin_inset Formula $-113$
\end_inset

 to 
\begin_inset Formula $113$
\end_inset

mm.
 Then we registered all transformed tractographies to the static one and
 calculated all pairwise MDF distances storing them in a square matrix 
\begin_inset Formula $D$
\end_inset

.
 We would expect that if the registration was correct then the sum of all
 diagonals elements of 
\begin_inset Formula $D$
\end_inset

 would be close to 
\begin_inset Formula $0$
\end_inset

.
 This was confirmed with both cost functions used SD and SMD getting close
 to zero 99.8% of the time however SMD was always closer to perfect alignment
 than SD, having precision of more 7 decimals.
 Consequently SMD is chosen as a better cost function for tractography registrat
ion.
\begin_inset Note Note
status open

\begin_layout Plain Layout
I seem to remember that registrations perform a bit differently when registering
 to transformed versions of themselves.
 I haven't got internet, but look for the AIR website (Roger Woods) and
 a comment about registrations to rotated versions of self.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
We uses QA tractographies from 10 subjects and we registered all full combinatio
ns of pairs 
\begin_inset Formula $\binom{10}{2}=45$
\end_inset

.
 Comparing different tractographies is not a trivial problem however we
 can use the tightness comparison (TC) metric explained in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Tightness-comparisons-1"

\end_inset

.
 We are happy to report the mean initial TC was 
\begin_inset Formula $34.8\%\pm8.0\%$
\end_inset

 and the mean final TC after applying our direct registration method was
 
\begin_inset Formula $48.1\%\pm6.1\%$
\end_inset

.
 This was a statistically highly significant improvement (
\begin_inset Formula $t_{\text{\textrm{paired}}}(44)=11.2$
\end_inset

 ,
\begin_inset Formula $p\leq10^{-13}$
\end_inset

 ).
\begin_inset Note Note
status open

\begin_layout Plain Layout
Isn't the metric used the registration related to the metric used to assess
 the registration? Maybe you could some measure of alignment of the structural
 image instead? I mean, transform the structural image with the same parameters.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset CommandInset label
LatexCommand label
name "Flo:direct_registration"

\end_inset


\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/LSC_registration2.png
	lyxscale 30
	scale 80

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
In panel A we see two tractographies from different subjects before (up)
 and after rigid registration (down) using our method.
 In panel B we see the metric 
\begin_inset Formula $SMD$
\end_inset

 that we chose to optimize of two copies of the same tractography with the
 second copy under translation (up) and rotation (down) This metric found
 to be smooth with a single global minimum and only slightly non-convex
 with small local minima.
 In panel C another possible candidate metric 
\begin_inset Formula $SD$
\end_inset

 is shown which although more convex on translations it was much more problemati
c with rotations.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Strategies with Small fibres
\end_layout

\begin_layout Standard
In many parts of this document we didn't consider short tracks.
 That is perfectly valid because (a) the longer tracks are more likely to
 be used as useful landmarks when comparing or registering different subjects
 because it is more likely for them to exist in most subjects, (b) removing
 short tracks facilitates the usage of distance based clustering (no need
 for adaptive distance threshold) 
\begin_inset Note Note
status open

\begin_layout Plain Layout
what is an adaptive distance threshold? Can this be in the earlier discussion?
\end_layout

\end_inset

and interaction with the tractography, (c) someone would first want to see
 the overall skeleton of the tractography and later go to the details.
 Nonetheless, after having clustered the longer tracks there are many ways
 to assign the smaller bundles to their closest longer bundles.
 For this purpose we recommend to use a different distance from 
\begin_inset Formula $d_{df}$
\end_inset

 (MDF) for example the minimum version of MAM referred to as 
\begin_inset Formula $m_{in}$
\end_inset

 in eq.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:min_average_distance"

\end_inset

.
 
\end_layout

\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
status open

\begin_layout Plain Layout
\align center
\begin_inset CommandInset label
LatexCommand label
name "Flo:arcuate_close"

\end_inset


\begin_inset Graphics
	filename ../dev_trees/didaktoriko/last_figures/arcuate_small_fibers.png
	lyxscale 50
	scale 70
	rotateOrigin center

\end_inset


\end_layout

\begin_layout Plain Layout
\begin_inset Caption

\begin_layout Plain Layout
A simple and vigorous strategy for handling short and long tracks together
 by picking a track of interest from one of our atlases.
 Colourmap encodes track length.
 A: one picked selected atlas track, B: 245 subject tracks closer than 15mm
 (MDF distance), C: B tracks clustered in 23 skeletons, D: 3421 tracks closer
 than 6mm (MAM distance) from the skeletons of B are shown.
 We can see that a great number of small tracks have been brought together
 along with the tracks in B but also shorter.
\end_layout

\end_inset


\end_layout

\end_inset


\end_layout

\begin_layout Standard
Here are some simple strategies for clustering short fibres.
 The first is for unsupervised clustering and the second one is for supervised
 learning.
\end_layout

\begin_layout Standard

\series bold
Strategy1
\series default
.
 Cluster the long tracks using QB with distance threshold at 10mm and then
 cluster the short tracks (<100mm) to a lower threshold and assign them
 to their closest long track bundle from the first clustering using the
 
\begin_inset Formula $m_{in}$
\end_inset

 distance.
\begin_inset Note Note
status open

\begin_layout Plain Layout
I could imagine a short track crossing a large track at right angles and
 still being matched with the larger track.
\end_layout

\end_inset


\end_layout

\begin_layout Standard

\series bold
Strategy2
\series default
.
 Read the tractography of a single subject,
\change_deleted 3 1319745849
 
\change_unchanged
 use a tractographic atlas as the one created in section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Atlases-made-easy"

\end_inset

 and pick one or more close skeletal tracks from that atlas and then find
 the closest tracks from the subject to that selected track using 
\begin_inset Formula $d_{df}$
\end_inset

,
\change_deleted 3 1319745849
 
\change_unchanged
 cluster the closest tracks found from the previous step and for each one
 of these new skeletons find the closest tracks using the 
\begin_inset Formula $m_{in}$
\end_inset

.
 We should now have an amalgamation of shorter and longer fibres in one
 cluster.
 
\end_layout

\begin_layout Standard
An example of this second strategy is shown in Fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:arcuate_close"

\end_inset

: (A) a track of interest from the arcuate fasciculus is selected from the
 tractographic atlas shown in Fig.
 
\begin_inset CommandInset ref
LatexCommand ref
reference "Flo:CloseToSelected"

\end_inset

(top row-middle), (B) the tracks of the subject closer than 15mm (
\begin_inset Formula $d_{df}$
\end_inset

) from the selected cluster are shown and clustered with a distance threshold
 of 6.25mm in (C), (D) from every skeleton track in C we find the closest
 tracks using the (
\begin_inset Formula $m_{in}$
\end_inset

) distance from entire tractography.
\end_layout

\begin_layout Subsection
Discussion and conclusion
\end_layout

\begin_layout Standard
In this document we presented a completely novel and extremely powerful
 algorithm; QuickBundles (QB).
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
It is 'completely novel'? Compared to - say - BIRCH? Maybe just 'novel'?
\end_layout

\end_inset

This algorithms
\begin_inset Note Note
status open

\begin_layout Plain Layout
algorithm
\end_layout

\end_inset

 provides simplifications to the ancient problem of white matter anatomy
 which recently attracted much scientific attention but it can be used as
 well for any trajectory clustering problem especially when large data sets
 are involved.
 QB can be used with all types of diffusion MRI tractographies which generate
 streamlines probabilistic or deterministic and it is independent of the
 reconstruction model.
\end_layout

\begin_layout Standard
In common with mainstream clustering algorithms such as k-means, k-centers
 and expectation maximization, QB is not a global clustering method therefore
 it can give different results under different initial conditions of the
 dataset when there is no obvious distance threshold which can separate
 the clusters in to meaningful bundles; for example we should expect different
 clusters under different permutations/orderings of the tracks in a densely
 packed tractography.
 However, we found that there is enough agreement even between two clusterings
 of the same tractography with different orderings
\begin_inset Note Note
status open

\begin_layout Plain Layout
Need evidence for this assertion?
\end_layout

\end_inset

.
 If the clusters are truly separable by distances then there is a global
 solution independent of orderings.
 This is often perceivable in smaller subsets of the initial tractography.
 We empirically found that this problem is minimized even with real datasets
 when a low distance threshold of about 
\begin_inset Formula $10$
\end_inset

 mm is used.
 
\end_layout

\begin_layout Standard
Furthermore the output of QB can become now input for another recent quick
 algorithm of quadratic time on average 
\begin_inset Formula $O(M^{2})$
\end_inset

 called affinity propagation where now 
\begin_inset Formula $M\ll N$
\end_inset

 therefore the overall time stays linear on the number of tracks 
\begin_inset Formula $N$
\end_inset

.
 Other algorithms previously too slow to be used on the entire tractography
 can now be used efficiently too e.g.
 kNN, hierarchical clustering and many others.
\end_layout

\begin_layout Standard
We saw that QB is a linear time clustering method based on track distances,
 which is on average linear time 
\begin_inset Formula $O(N)$
\end_inset

 where 
\begin_inset Formula $N$
\end_inset

 is the number of tracks and with worst case 
\begin_inset Formula $O(N^{2})$
\end_inset

 when every track is a singleton cluster itself.
 Therefore QB is the fastest known tractography clustering method and even
 real-time on tractographies with less than 
\begin_inset Formula $20$
\end_inset

K tracks.
\end_layout

\begin_layout Standard
QB is fully automatic and very robust as when we use it we can find good
 agreements even between different subjects and can be used to create tractograp
hy atlases at high speed.
 Additionally, it can be used to explore multiple tractographies and find
 correspondences between tractographies, create landmarks used for registration
 or population comparisons.
 
\end_layout

\begin_layout Standard
QB can be used as well for reducing the dimensionality of the data sets
 at the time of interaction; providing an alternative way to ROIs using
 BOIs (bundles of interest) or TOIs (tracks of interest).
 We also showed that it can be used to find 
\begin_inset Quotes eld
\end_inset

hidden
\begin_inset Quotes erd
\end_inset

 tracks not visible to the user at first instance.
 Therefore QB opens up the road to create a rapid tools for exploring massive
 tractographies.
\begin_inset Note Note
status collapsed

\begin_layout Plain Layout
Rotation, translation and scale invariant (check).
\end_layout

\begin_layout Plain Layout
Unlearned tracks will be added as new clusters as being very distant from
 all other clusters.
\end_layout

\begin_layout Plain Layout
Contains only one meaningful threshold i.e.
 distance threshold usually easily set in mm.
\end_layout

\begin_layout Plain Layout
Easy understand how it works when think of bundles as cylinders.
\end_layout

\begin_layout Plain Layout
Clusters hold the entire tractography information.
 Complete assignments - no fuzziness.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
The main concept of this clustering method is that cluster can be represented
 by virtual tracks which are used only during cluster comparisons and not
 updated at every iteration.
\end_layout

\begin_layout Standard
A virtual (centroid) track is the average of all tracks in the cluster.
 We call it virtual because it doesn't really exist in the real dataset
 and to distinguish it from exemplar (medoid) tracks which are again features
 of the cluster but are represented by real tracks.
 
\begin_inset Note Note
status open

\begin_layout Plain Layout
Could this explanation go earlier in the chapter?
\end_layout

\end_inset


\end_layout

\begin_layout Standard
The clustering creates a book of bundles/clusters which have easy to obtain
 descriptors/features.
 When clusters are held in a tree structure this permits upwards amalgamations
 to form bundles out of clusters, and downwards disaggregation to split
 clusters into finer sub-clusters corresponding to a lower distance threshold.
 However, we will touch this hierarchical extension of this algorithm here
 and mostly concentrate on one level amalgamations.
\end_layout

\begin_layout Standard
We worked mostly with long tracks but strategies for short tracks or bundles
 are straightforward and documented.
 We also showed an efficient method where QB can speedup finding erroneous
 bundles or detecting structures of specific characteristics.
\end_layout

\begin_layout Standard
We showed results with simulated, single or multiple real subjects and the
 code for QuickBundles is freely available at 
\begin_inset ERT
status open

\begin_layout Plain Layout

dipy.org
\end_layout

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
Further notes for discussion
\end_layout

\begin_layout Plain Layout
-No distance matrix
\end_layout

\begin_layout Plain Layout
-empeiricaly average linear time on the #tracks
\end_layout

\begin_layout Plain Layout
-used for regist.
 /visual./landmarks/simplification
\end_layout

\begin_layout Plain Layout
-clinical relevance
\end_layout

\begin_layout Plain Layout
-simple concept
\end_layout

\begin_layout Plain Layout
-fast -> real time
\end_layout

\begin_layout Plain Layout
- we haven't worked with hierarchies
\end_layout

\begin_layout Plain Layout
-bundles of interest (BOIs) seem more useful than ROIs
\end_layout

\begin_layout Plain Layout
Further notes for complexity
\end_layout

\begin_layout Plain Layout
O(N) best case (one cluster only)
\end_layout

\begin_layout Plain Layout
O(MN) average case but with M usually much smaller than N that can empeirically
 go similar to O(N)
\end_layout

\begin_layout Plain Layout
O(N^2) worst case scenario (every track is a cluster - not used in practice)
\end_layout

\end_inset


\end_layout

\begin_layout Standard

\lang british
\begin_inset CommandInset bibtex
LatexCommand bibtex
bibfiles "diffusion"
options "plain"

\end_inset


\end_layout

\end_body
\end_document