-
Notifications
You must be signed in to change notification settings - Fork 10
/
article.tex
executable file
·1745 lines (1498 loc) · 81.3 KB
/
article.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\NeedsTeXFormat{LaTeX2e}[1995/12/01]
\documentclass[doublespacing]{bmcart}
%\documentclass[twocolumn]{bmcart}% uncomment this for twocolumn layout and comment line below
%%% additional documentclass options:
% [doublespacing]
% [linenumbers] - put the line numbers on margins
% Load packages
\usepackage{url} % Formatting web addresses
\urlstyle{rm}
\usepackage[utf8]{inputenc} %unicode support
\usepackage{amssymb}
%\usepackage{cite}
\usepackage{graphicx}
\usepackage{multirow}
%% Comment out before submission
%%\usepackage[colorinlistoftodos,textsize=small]{todonotes}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% %%
%% If you wish to display your graphics for %%
%% your own use using includegraphic or %%
%% includegraphics, then comment out the %%
%% following two lines of code. %%
%% NB: These line *must* be included when %%
%% submitting to BMC. %%
%% All figure files must be submitted as %%
%% separate graphics through the BMC %%
%% submission process, not included in the %%
%% submitted article. %%
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\def\includegraphic{}
%\def\includegraphics{}
%%% Put your definitions there:
\startlocaldefs
\endlocaldefs
%%%%%%%%%%%%%%%%%%%%%%%%%%%% RESEARCH PAPER %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\def \cdkversion {\textbf{2.0}}
\def \cdkversion {v2.0}
\begin{document}
%%% Start of article front matter
\begin{frontmatter}
\begin{fmbox}
\dochead{Research}
% previous titles:
%
% The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics.
% Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics.
\title{The Chemistry Development Kit (CDK) \cdkversion{}: atom typing, depiction, molecular formulas, and substructure searching}
\author[
addressref={um}, % ORCID: 0000-0001-7542-0286
%corref={aff1},
%noteref={n1},
email={[email protected]}
]{\inits{ELW}\fnm{Egon L} \snm{Willighagen}}
\author[
addressref={nm}, % ORCID: 0000-0001-7730-2646
email={[email protected]}
]{\inits{JWM}\fnm{John W} \snm{Mayfield}}
\author[
addressref={uppsala},
email={[email protected]}
]{\inits{JA}\fnm{Jonathan} \snm{Alvarsson}}
\author[
addressref={uppsala},
email={[email protected]}
]{\inits{AB}\fnm{Arvid} \snm{Berg}}
\author[
addressref={azg}, % ORCID: 0000-0001-9491-4134
email={[email protected]}
]{\inits{LC}\fnm{Lars}~\snm{Carlsson}}
\author[
addressref={idea},
email={[email protected]}
]{\inits{NJ}\fnm{Nina}~\snm{Jeliazkova}}
\author[
addressref={leicester},
email={[email protected]}
]{\inits{SK}\fnm{Stefan} \snm{Kuhn}}
\author[
addressref={wi_mit}, % ORCID: 0000-0002-6940-3006
email={[email protected]}
]{\inits{TP}\fnm{Tomáš}~\snm{Pluskal}}
\author[
addressref={miquel}, % ORCID: 0000-0002-4659-1446
email={[email protected]}
]{\inits{MRJ}\fnm{Miquel}~\snm{Rojas-Chertó}}
\author[
addressref={uppsala},
email={[email protected]}
]{\inits{OS}\fnm{Ola} \snm{Spjuth}}
\author[
addressref={gilleain}, % ORCID: 0000-0002-8368-6954
email={[email protected]}
]{\inits{GT}\fnm{Gilleain} \snm{Torrance}}
\author[
addressref={um},
email={[email protected]}
]{\inits{CTE}\fnm{Chris~T}~\snm{Evelo}}
\author[
addressref={nih},
email={[email protected]}
]{\inits{RG}\fnm{Rajarshi} \snm{Guha}}
\author[
addressref={jena},
email={[email protected]}
]{\inits{CS}\fnm{Christoph}~\snm{Steinbeck}}
\address[id=um]{
\orgname{Dept of Bioinformatics - BiGCaT, NUTRIM, Maastricht University}, % university, etc
%\street{Waterloo Road},
\postcode{NL-6200 MD},
\city{Maastricht},
\cny{The Netherlands}
}
\address[id=nm]{
\orgname{NextMove Software Ltd},
%\street{D\"{u}sternbrooker Weg 20},
\postcode{CB4 0EY},
\city{Cambridge},
\cny{UK}
}
\address[id=uppsala]{
\orgname{Department of Pharmaceutical Biosciences, Uppsala University},
%\street{D\"{u}sternbrooker Weg 20},
\postcode{751 24},
\city{Uppsala},
\cny{Sweden}
}
\address[id=azg]{
\orgname{AstraZeneca, Innovative Medicines \& Early Development, Quantitative Biology},
%\street{},
%\postcode{LE1 7RH},
\city{Mölndal},
\cny{SE}
}
\address[id=idea]{
\orgname{Ideaconsult Ltd},
\street{A. Kanchev 4},
\postcode{1000},
\city{Sofia},
\cny{Bulgaria}
}
\address[id=leicester]{
\orgname{Department of Informatics, University of Leicester},
%\street{},
%\postcode{LE1 7RH},
\city{Leicester},
\cny{UK}
}
\address[id=wi_mit]{
\orgname{Whitehead Institute for Biomedical Research},
\street{455 Main Street},
\postcode{MA 02142},
\city{Cambridge},
\cny{USA}
}
\address[id=miquel]{
\orgname{Química Clínica Aplicada},
%\street{4 Hanway Place},
\postcode{43870},
\city{Amposta},
\cny{Spain}
}
\address[id=gilleain]{
%\orgname{Department of Informatics, University of Leicester},
\street{4 Hanway Place},
\postcode{W1T 1HD},
\city{London},
\cny{UK}
}
\address[id=nih]{
\orgname{National Center for Advancing Translational Science},
\street{9800 Medical Center Drive},
\postcode{MD 20878},
\city{Rockville},
\cny{USA}
}
\address[id=jena]{
\orgname{Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University},
\street{Lessingstr. 8},
\postcode{07743},
\city{Jena},
\cny{Germany}
}
%\begin{artnotes}
%\note{Sample of title note} % note to the article
%\note[id=n1]{Equal contributor} % note, connected to author
%\end{artnotes}
\end{fmbox}% comment this for two column layout
%% The Abstract begins here
\begin{abstractbox}
\begin{abstract}
% Do not use inserted blank lines (ie \\) until main body of text.
\parttitle{Background}
The Chemistry Development Kit (CDK) is a
widely used open source cheminformatics toolkit, providing data
structures to represent chemical concepts along with methods to
manipulate such structures and perform computations on them. The
library implements a wide variety of cheminformatics algorithms
ranging from chemical structure canonicalization to molecular
descriptor calculations and pharmacophore perception. It is used
in drug discovery, metabolomics, and toxicology. Over the
last ten years, the code base has grown significantly, however, resulting in
many complex interdependencies among components and poor
performance of many algorithms.
\parttitle{Results}
We report improvements to the CDK \cdkversion{} since the v1.2 release series,
specifically addressing the increased functional complexity and poor
performance. We first summarize the addition of new functionality, such
atom typing and molecular formula handling, and improvement to existing
functionality that has led to significantly better performance for
substructure searching, molecular fingerprints, and rendering of molecules.
Second, we outline how the CDK has evolved with respect to
quality control and the approaches we have adopted to ensure stability, including
a code review mechanism.
\parttitle{Conclusions}
This paper highlights our continued efforts to provide a community
driven, open source
cheminformatics library, and shows that such collaborative projects can thrive over
extended periods of time, resulting in a high-quality and performant
library. By taking
advantage of community support and contributions, we show
that an open source cheminformatics project can act as a peer reviewed
publishing platform for scientific computing software.
\end{abstract}
\begin{keyword}
\kwd{Java}
\kwd{cheminformatics}
\kwd{bioinformatics}
\kwd{metabolomics}
\kwd{depiction}
\end{keyword}
% MSC classifications codes, if any
%\begin{keyword}[class=AMS]
%\kwd[Primary ]{}
%\kwd{}
%\kwd[; secondary ]{}
%\end{keyword}
\end{abstractbox}
%
%\end{fmbox}% uncomment this for twcolumn layout
\end{frontmatter}
%%%%%%%%%%%%%%%%
%% Background %%
%%
\section*{Background}
The open source cheminformatics community has made significant steps
forward recently~\cite{OBoyle2011b} as evidenced by the growing number
of tools and underlying toolkits, along with the usage of these
software components in a variety of applications.
The Chemistry Development Kit (CDK) is one of the tools developed
under the aegis of the Blue Obelisk, a movement promoting Open Data,
Open Source, and Open Standards in chemistry~\cite{OBoyle2011b,Guha2006}.
The CDK providing data structures to represent chemical concepts along with
methods to manipulate such structures and perform computations on them.
Previously documented CDK versions have been widely adopted~\cite{Steinbeck2003,Steinbeck2006}.
Use of the CDK ranges from inclusion of CDK functionality in
wrapper platforms such as Cinfony~\cite{OBoyle2008}, incorporation
within the R environment (rcdk~\cite{Guha2007}), and as plugins for
Taverna~\cite{Truszkowski2011},
KNIME~\cite{Beisken2013}, Cytoscape (ChemViz2~\cite{ChemViz2}), and for
Microsoft Excel (LICSS~\cite{Lawson2012}).
In contrast to scenarios that have made CDK functionality available in
larger systems, a number of projects have employed the CDK as a
general cheminformatics toolkit. Examples include
jCompoundMapper~\cite{Hinselmann2011}, ScaffoldHunter~\cite{wetzel2009interactive,Klein2013}, OMG~\cite{Peironcely2012},
PaDEL~\cite{yap2011padel}, ChemDes~\cite{Dong2015},
ReactPRED~\cite{ReactPRED}, SMSD~\cite{Rahman2009,Rahman2014,Rahman2016},
WhichCyp~\cite{Rostkowski2013}, MetaPrint2D~\cite{Carlsson2010}, MetFrag~\cite{Wolf2010},
and the IUPHAR/BPS Guide to Pharmacology~\cite{Southan2016}, BRENDA~\cite{Placzek2017} and
QSAR DataBank~\cite{Ruusmann2015} databases.
A number of such tools were initially developed using older versions
of the CDK and are updated to new releases as they are made
available. Examples include Bioclipse~\cite{spjuth2007bioclipse,
spjuth2009bioclipse} and
AMBIT~\cite{jeliazkova2011ambit,jeliazkova2011ambitsmarts,kochev2013ambit}. The
CDK has also played a role in a number of chemical studies, such as finding
the maximally bridging rings in chemical structures~\cite{Marth2015},
prediction of organic reactions~\cite{Segler2016}, and bioactivities
of compounds~\cite{Alvarsson2016}.
While the CDK has purported to be a general purpose cheminformatics
toolkit, older versions were designed by a community with specific
applications in mind, primary among them being structure elucidation. In
addition, an implicit goal of previous versions was to have the CDK
serve as an educational resource to enable students of cheminformatics
to understand the underlying algorithms. This resulted in certain
functionalities, such as molecular fingerprinting~\cite{Clark2014,Cannon2006},
receiving more attention than others, such as stereochemistry. The
outcome was significant variance in performance and features
throughout the toolkit.
The growth of open source software over the last 10 years is evidence % add REF?
of the ability of communities of developers to develop systems and
processes that lead to high quality software systems for long term
use. The CDK is no different. The adoption of automatic build systems
and quality control methodologies such as unit testing, automated
source code validation, and peer review by fellow developers have
greatly improved the stability of the library. While it has slowed
development somewhat, it has allowed for cleaning up interdependencies
between modules of functionality, and importantly, has improved the scalability
of the development model. This has resulted in significant new
functionality in core application programming interfaces (APIs)
while maintaining the quality of code depending on those core APIs.
Examples of new features supported by the improved development model
include InChI functionality~\cite{Spjuth2013}, greatly improved ring
detection algorithms~\cite{May2014}, improvements to the core atom
type perception module that now covers a much more comprehensive set
of elements, charge states and radical species than previous versions,
a more comprehensive fingerprinting API, new depiction functionality,
and many speed and stability improvements.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Results and Discussion %%
%%
\section*{Implementation and Results}
This section describes the specifics of new APIs and
improvements to pre-existing methods that are available in the latest
CDK. We then discuss how we have improved and formalized the
development model for the project using unit testing, code review and
guidelines for handling version control. Finally we report on the
availability of binary distributions of the library, allowing users to
include specific modules (and their dependencies) of the CDK in their
own projects (as opposed to developers who work on the CDK library
itself).
\subsection*{New APIs and improved implementations}
We here outline various new and improved APIs in the CDK library since the
two previous publications in 2003 and 2006~\cite{Steinbeck2003,Steinbeck2006}.
\subsubsection*{Atom Typing}
Atom type perception is core cheminformatics functionality: the
atom types describe chemical features of atoms, such as the number
of neighbors, possible formal charges, (approximate) hybridization,
electron distribution over orbitals and so on. However, previous
versions of the CDK implemented atom type perception as part of
different algorithms, resulting in duplicated and sometimes
divergent typing schemes. As a result it was cumbersome to add new
atom types and implement support for new charged and radical species
in a consistent manner.
This CDK version has a new, centralized atom typing framework,
removing the perception of atom types from various algorithms. This
allows for a consistent and extensive typing scheme, that can be
also be tested independently of other code. The new code defines
the atom types using a list that specifies for each type the element
symbol, hybridization, formal charge, number of lone pairs, and an
enumeration of the bond orders (see Figure~\ref{fig:atomtype}). This list of
properties captures the information needed for the various
algorithms in the CDK. For example, hybridization information can be
used in certain aromaticity models (see later), and the lone pair
information is needed for resonance structure calculation needed,
for example, for Gasteiger $\pi$-charges.
A reference implementation, \texttt{CDKAtomTypeMatcher}, has been
written in such a way that perceives these atom types, and validates the
perception automatically against the properties defined by the
ontology. This class handles a variety of types of missing
information, as commonly resulting from various (file) formats; for
example, it can handle undefined hydrogen counts and undefined
double bond positions if hybridization information is provided
instead. That makes the perception code flexible but also more
complex. Alternative algorithms for atom typing have not been
explored. This reference implementation can be used on a single
atom:
\vspace{0.2cm}
\begin{verbatim}
for (IAtom atom : molecule.atoms()) {
type = matcher.findMatchingAtomType(molecule, atom);
}
\end{verbatim}
\vspace{0.2cm}
And on a full molecule, in which case the list of types is ordered in the same
order as the atoms in the molecule object:
\vspace{0.2cm}
\begin{verbatim}
types = matcher.findMatchingAtomTypes(molecule);
\end{verbatim}
\vspace{0.2cm}
\subsubsection*{Stereochemistry}
Previous versions of the API represented stereochemistry in different ways. This hindered
interconversion between and within file formats. CDK \cdkversion{} standardizes
upon a new core representation and procedures have been updated or added to
enable duplicate checking, pattern matching, and interconversion.
The preferred representation of stereochemistry is now for it to be stored at the molecule
level as a \texttt{\textbf{StereoElement}}. In abstract terms a stereo element describes local
geometry using a type, focus, carriers, and configuration (Figure~\ref{fig:stereodatastructure}).
Currently the most common types of stereochemistry are supported: Tetrahedral, Cis-trans isomerism
around a double bond, and Extended Tetrahedral. Rarer types of stereochemistry, such as: Square
Planar, Trigonal Bipyramidal, Octahedral, could easily be incorporated into the chosen description
given sufficient demand from the community.
Along with the new stereochemistry representation, algorithms were required in several areas. Generally,
a user does not need to invoke these procedures explicitly as they are called as needed within existing
APIs:
\vspace{0.2cm}
\begin{itemize}
\item perception from 2D coordinates,
\item perception from 3D coordinates,
\item wedge assignment,
\item graph (sub)isomorphism matching,
\item SMARTS matching, and
\item canonicalization.
\end{itemize}
\vspace{0.2cm}
The perception from coordinates and wedge assignment algorithms are fundamental for conversion
between formats that store stereochemistry implicitly based on coordinates (e.g. molfile\footnote{Molfiles
can also store tetrahedral stereochemistry as a parity value, this is read if no coordinates are specified.
In general there is no guarantee the parity value is read and the only portable way to store
stereochemistry in a molfile is with coordinates.}, CML) and
explicitly (e.g. SMILES, CML, InChI). Perception from 2D coordinates
can optionally identify
perspective projections, specifically: Fischer, Haworth, and Chair projections. With the perception of
perspective projections enabled, database entries currently considered distinct can be merged
(Figure~\ref{fig:stereoprojections}).
% ChEMBL 21 ~2281 structures (~11454 TH center) drawn in perspective
Pattern matching of stereochemistry with the described representation is straight forward. Given
the atom-atom mapping from a query structure to a target molecule, the focus and carriers of
the query stereochemistry are mapped to the target. Using the permutation parity of this mapping
the configurations were compared. SMARTS matching requires some special handling for complex cases
\cite{May2014_SMARTS}. For canonicalization, a partial canonical ordering is used to assign an
absolute label which can then be integrated into the ordering. The algorithms used
for stereochemistry are thoroughly detailed in Chapter~6 of~\cite{May2015}. The perception
from projections is based on an algorithm briefly described by~\cite{Karapetyan2015}.
\subsubsection*{Atomic and Molecular Signatures}
An implementation has been provided of the Signature structure descriptor for
molecules~\cite{Faulon2003}. These act as a linear notation - like the SMILES format -
for the whole molecule as well as for connected substructures rooted at a single atom. The
descriptor can also be canonicalized to provide isomorphism-independent
representations~\cite{Faulon2004}. Signatures of depth two can be calculated
for atoms with:
\vspace{0.2cm}
\begin{verbatim}
for (IAtom atom : molecule.atoms()) {
String signature = new AtomSignature(atom, 2, molecule).toCanonicalString();
}
\end{verbatim}
\vspace{0.2cm}
But they can also can be calculated for full molecules:
\vspace{0.2cm}
\begin{verbatim}
String signature = new MoleculeSignature(molecule).toCanonicalString();
\end{verbatim}
\vspace{0.2cm}
Finally, a signature fingerprint can be calculated for molecules, to allow
similarity calculations. This can then be used in QSAR modeling~\cite{Alvarsson2016,
signaturefingerprints,Spjuth2011DS,Moghadam2015,Alvarsson2014,Spjuth2012OS,Norinder2013}.
\subsubsection*{Rendering API}
A new rendering API has been introduced to make the rendering code independent
from Java widget toolkits. The previous code was tightly linked to the Swing
toolkit, but other tools use different widget toolkits. For example, Bioclipse
is based on Eclipse which uses the Standard Widget Toolkit (SWT)~\cite{spjuth2007bioclipse}.
A second new design goal was introduced to balance between size restrictions
of some use cases, such as Java applets, and the rendering functionality. In
particular, some functionality, even after modularization, needed considerable
parts of the CDK library, making creation of a small-sized applet unfeasible.
Therefore, the rendering API was modularized to allow splitting up rendering
functionality into modules, with varying CDK dependencies.
Rendering is split up into several generation steps: previous versions split
up bond from atom rendering. Heteroatom symbols were simply drawn over lines
representing bonds using a white rectangle to mask. A new \texttt{StandardGenerator}
has been introduced that does bond and atom rendering at the same time. It incorporates
many ideas described by Alex Clark~\cite{Clark10,Clark13}. The depictions generated are of much
higher quality and suitable for publication.
% DESCRIBE THE CORE APIs
Moreover, a simplified high-level API has been introduced that addresses most of the
common rendering needs, with the DepictionGenerator class. To depict a molecule
loaded into a variable `benzene' the following code can be used:
\vspace{0.2cm}
\begin{verbatim}
IAtomContainer benzene = ...;
new DepictionGenerator()
.withSize(300, 300)
.depict(benzene)
.writeTo("benzene.png");
\end{verbatim}
\vspace{0.2cm}
Many of the rendering options are available as parameters in the
core API and as methods on the DepictionGenerator class. This
includes substructure coloring, exemplified with an example reaction
shown in Figure~\ref{fig:depiction}.
When missing, 2D coordinates are generated on the fly with the new structure
diagram layout functionality.
\subsubsection*{Structure Diagram Layout}
The structure diagram layout has been improved and the new code solves a
number of long standing issues. In particular, collision avoidance has been
greatly improved. Figure~\ref{fig:sdg} shows a difference in output
between the old code base, with and without overlap resolving, and with the
new refinement based implementation\cite{Helson07}. Generation of 2D coordinates is done as shown below:
\vspace{0.2cm}
\begin{verbatim}
IAtomContainer mol = ...;
StructureDiagramGenerator sdg = new StructureDiagramGenerator();
sdg.generateCoordinates(mol);
\end{verbatim}
\vspace{0.2cm}
While the API itself has not been significantly changed, the
internals have been revamped. In addition to improved overlap
resolution noted above, the engine appropriately handles large ring
systems, maintains input stereochemistry, and makes use of a large
template library. Templates are useful for laying out substructure.
While previous CDK versions partially supported
double bond stereochemistry the new engine is more efficient in
using this information when generating 2D layouts. Furthermore, the
engine assigns wedge bond information based on tetrahedral
ste\-reo\-chem\-is\-try. These features are exemplified by the following
code and the resulting layout depiction in
Figure~\ref{fig:sdgstereo}:
\vspace{0.2cm}
\begin{verbatim}
String smi = "C[C@@H](\N=C(/S)\Nc1ccccn1)C2CCCCC2 CHEMBL305150";
IAtomContainer mol = parser.parseSmiles(smi);
sdg.generateCoordinates(mol);
\end{verbatim}
\vspace{0.2cm}
\subsubsection*{Molecular Formula}
A chemical formula is the simplest chemical representation of a
compound. It defines the number of isotopes or elements that
compose a compound without describing how atoms
are bonded. With the rise of metabolomics it has become increasingly
relevant to have full support for these in cheminformatics
libraries~\cite{Wolf2010,RojasCherto2011,Pluskal2012,Pluskal2010,Duhrkop2015}.
The CDK interfaces can handle several concepts related to chemical formulas:
the formula itself, sets of formulas, chemical formula ranges, adducts, isotope
containers and patterns, and rules to filter formula sets. These new tools can
be used for a number of tasks, including calculating the isotopic pattern from
a given chemical formula, determining the possible elemental compositions for a
given mass (mass decomposition), and calculating the exact mass from a given
chemical formula.
The CDK contains two algorithms for the decomposition of mass ranges into
possible elemental formulas. For most inputs, a Round Robin algorithm,
originally developed for the SIRIUS metabolite identification
tool~\cite{Bocker2009}, is used. The algorithm discretizes the real-value mass
decomposition problem into an integer-value knapsack
problem~\cite{Martello1990}. It first computes a dynamic programming table and
then backtracks within it to generate matching formulas~\cite{Duehrkop2013,
Boecker2008}. Data for the Round Robin algorithm is stored in an extended
residue table~\cite{Bocker2005}, resulting in a low memory footprint of several
kilobytes. For certain problem instances, such as very large mass values (above
$400\,000$ Da) or mass range span larger than $1$ Da, the Round Robin algorithm
is not suitable and CDK falls back to an optimized full enumeration search
method, originally developed as part of the MZmine~2 framework for mass
spectrometry data processing~\cite{Pluskal2012, Pluskal2010}.
The following code calculates all possible chemical formulas for a given
accurate mass, within allowed counts for each element:
\vspace{0.2cm}
\begin{verbatim}
Isotopes ifac = Isotopes.getInstance();
MolecularFormulaRange range =
new MolecularFormulaRange();
range.addIsotope( ifac.getMajorIsotope("C"), 8, 20);
range.addIsotope( ifac.getMajorIsotope("H"), 0, 20);
range.addIsotope( ifac.getMajorIsotope("O"), 0, 1);
range.addIsotope( ifac.getMajorIsotope("N"), 0, 1);
MolecularFormulaGenerator tool =
new MolecularFormulaGenerator(
SilentChemObjectBuilder.getInstance(),
133.0, 133.1, range
);
IMolecularFormulaSet mfSet = tool.getAllFormulas();
for (mf in mfSet) {
println MolecularFormulaManipulator.getString(mf) + " " +
MolecularFormulaManipulator.getTotalExactMass(mf)
}
\end{verbatim}
\vspace{0.2cm}
This gives the following output:
\vspace{0.2cm}
\begin{verbatim}
C11H 133.007825032
C9H11N 133.089149352
C9H9O 133.065339908
C8H7NO 133.052763844
\end{verbatim}
\vspace{0.2cm}
To evaluate the performance of the CDK molecular formula generator, we
compared its runtimes to those of the classic, full enumeration-based HR2
formula generator~\cite{Kind2007} and those of a recently developed Parallel
Formula Generator (PFG)~\cite{Zhang2016} (Table~1\ref{tab:formula_generators}).
As inputs, we used two sets of $10\,000$ small ($< 500$~Da) and $20$ large ($>
1\,500$ and $< 3\,500$~Da) molecular mass values downloaded from the Global
Natural Products Social Molecular Networking database~\cite{wang2016}. The mass
tolerance was set to 0.001 or 0.01 Da. The CDK \cdkversion{}'s Round-Robin
formula generator outperformed the other methods in all cases, despite running
in a single thread (PFG utilizes multiple threads). The performance gain of
the Round Robin algorithm was particularly apparent when narrow mass ranges
were queried (e.g. $\pm 0.001$ Da), thus showing its suitability for
applications in high-resolution mass spectrometry.
\subsubsection*{SMILES parser and generator}
The SMILES~\cite{Weininger1988} parsing has been replaced by code from the external Beam project~\cite{Beam}.
This BSD-licensed SMILES parser is a complete implementation of the SMILES
and OpenSMILES (\url{http://opensmiles.org/}) specifications by one of the
authors (including stereochemistry), and is independent of
the CDK library. The SmilesParser API uses this library underneath, and the
Beam API is hidden by this class. Basic usage is as follows:
\vspace{0.2cm}
\begin{verbatim}
IChemObjectBuilder bldr
= SilentChemObjectBuilder.getInstance()
SmilesParser smipar = new SmilesParser(bldr);
IAtomContainer mol = smipar.parseSmiles('[nH]1cccc1');
\end{verbatim}
\vspace{0.2cm}
The most significant functional change here is that the SMILES parser
automatically locates the positions of double bonds in de-localised aromatic
systems (Kekulisation). If this invariant cannot be met the SMILES is rejected
as invalid. It is possible to override this check but this is strongly
discouraged as rejected molecules do not have a fixed formula or tautomer~\cite{May2015}.
The SMILES generation API has also been simplified and made more flexible able
to produce several different flavours. The \texttt{SmiFlavor} flags are used
to control the type of SMILES generated. Historically the terms: generic,
isomeric, unique, absolute have been used in other toolkits and are also
supported.
\vspace{0.2cm}
\begin{verbatim}
SmilesGenerator smigen
= new SmilesGenerator(SmiFlavor.Canonical |
SmiFlavor.StereoTetrahedral)
\end{verbatim}
\vspace{0.2cm}
Support for ChemAxon Extended SMILES (CXSMILES)\cite{CXSMILES} layers has been added
to CDK \cdkversion{}. CXSMILES provides a powerful means of including auxiliary
information in a SMILES string such as 2D/3D coordinates, atom values, generic labels,
repeat units, and positional variation. CXSMILES is achieving by placing additional
information between pipe characters (`\texttt{|}') in the SMILES title field. Information
is annotated based on the order of the atoms in the SMILES string. An example CXSMILES
for a generic structure is shown below.
\vspace{0.2cm}
\small
\begin{verbatim}
c1(:*:c2c(:*:c1*)C(N(N2)*)=O)* |$;Y;;;X;;R10;;;;Z;;R11$| US 2007/0129374 (I)
<- SMILES -------------------> <- CXSMILES ------------> <- Title --------->
\end{verbatim}
\normalsize
\vspace{0.2cm}
\subsubsection*{Substructure and SMARTS matching}
Substructure matching is fundamental cheminformatics operation and
plays a key role in many other functions such as fingerprint and
descriptor generation, and atom typing. Since CDK v1.2, functionality
has been added to handle the SMARTS query language. The SMARTS
language is supported well including features such as
stereochemistry, component grouping, and atom
maps (to match reaction transformations). A new \textit{Pattern} API
has been added to CDK \cdkversion{}, which simplifies finding,
filtering, and transforming search results. The API is immutable allowing a
pattern to be initialized once and then matched against several
molecules or reactions across multiple threads. During initialization
the pattern is inspected so as to determine what invariants will be
needed (e.g. ring size) and only required invariants are calculated. The
internal matching algorithms provide a lazy iterator, such that the
next match is only computed when it is needed. The API handles
reactions in addition to molecules, and both can be specified as
either queries or targets.
\vspace{0.2cm}
\begin{verbatim}
// initialize SMARTS pattern API
Pattern pattern =
SmartsPattern.findSubstructure("O=[C,N]aa[N,O;!H0]");
IAtomContainer mol = ...;
IReaction rxn = ...;
// check if the query matches, molecules and reactions
boolean mMatch = pattern.matches(mol);
boolean rMatch = pattern.matches(rxn);
// lazily iterate over all unique atom matches as query atom
// index bijection to atoms in 'mol', 'rxn' can also be used
for (int[] m : pattern.matchAll(mol)
.uniqueAtoms()) {
...
}
\end{verbatim}
\vspace{0.2cm}
CDK \cdkversion{} includes large improvements to algorithm
efficiency. This is emphasised in the systematic benchmark
of MACCS-like 166 key generation (Table~5). The efficiency improvements
are a combination of optimising data structures and key molecule processing
algorithms (e.g. kekulisation and aromaticity) needed before a SMARTS
match can be run~\cite{MayBlog2013a,MayBlog2013b,May2015}.
\subsubsection*{Ring finding}
Ring finding is another key functionality in a cheminformatics library, and
the CDK knows a long history of ring finding~\cite{Berger2004,May2014}. Specifically,
non-redundant ring sets have seen particular interest,
such as the smallest set of smallest rings, for which the CDK
implements two classical algorithms~\cite{Figueras1996,Berger2004}.
Recent work has implemented a new, faster algorithm, allowing
searching for various types of (non-redundant) ring
sets~\cite{May2014}. These are available via the new Cycles API:
\vspace{0.2cm}
\begin{verbatim}
allCycles = Cycles.all(mol)
relevantCycles = Cycles.relevant(mol)
essentialCycles = Cycles.essential(mol)
sssrCycles = Cycles.sssr(mol)
\end{verbatim}
\vspace{0.2cm}
\subsubsection*{Aromaticity}
Aromaticity has seen many definitions in the past and for
cheminformatics it frequently is algorithmically defined. The outcome
of an aromaticity calculation depends on a number of atom type
features and heuristics, which are often ambiguously defined in the
published literature. Based on the information used, several different
algorithmic definitions of aromaticity can be defined. Older CDK
versions had various aromaticity models implemented but the code was scattered
throughout the library, resulting in an inconsistent API
to compute aromaticity and a significant maintenance
burden. The API was unified in the current version, resulting in three
models, of which two are based on the CDK atom typer. The difference
between these two models is how contributions from exocyclic double
bonds are handled.
The current CDK version further generalizes the idea that aromaticity
is a model, and provides an API that allows the user to select one of
several aromaticity models, leading to greater interoperability with
other toolkits. The new \texttt{Aromaticity} class allows to build a
custom model by selecting and combining options. For example, to
reproduce the functionality of the previous \texttt{CDKAromaticity} class:
\vspace{0.2cm}
\begin{verbatim}
Aromaticity aromaticity = new Aromaticity(
ElectronDonation.cdk(), Cycles.cdkAromaticSet()
);
\end{verbatim}
\vspace{0.2cm}
Here, the CDK model for counting donated electrons is used, along with
the rings systems that were identified by the older algorithm in
previous versions that was limited in the number of fused rings
systems that were considered. However, an alternative aromaticity
calculator that considers all possible ring systems can now be
easily created with:
\vspace{0.2cm}
\begin{verbatim}
Aromaticity aromaticity = new Aromaticity(
ElectronDonation.cdk(), Cycles.all()
);
\end{verbatim}
\vspace{0.2cm}
For SMARTS matching and SMILES generation a model based
on Daylight \cite{DaylightCIS} can be used and offers significant
speed improvements to the one based on CDK Atom Types.
This model has recently been documented as part of the
OpenSMILES specification (\url{http://opensmiles.org/}):
\vspace{0.2cm}
\begin{verbatim}
Aromaticity aromaticity = new Aromaticity(
ElectronDonation.daylight(), Cycles.all()
);
\end{verbatim}
\vspace{0.2cm}
The aromaticity algorithm is straight forward, the potential electron donation
is calculated for each atom as -1 (not aromatic), 0, 1, 2. The set of cycles
provided in the constructor is then generated and each is checked for H{\"u}ckel's
rule ($4n+2$).
\subsubsection*{CTfile Format Improvements}
The molfile format is still very popular and despite it being a proprietary
format, it has become a de facto standard. The format forms the core of the larger
CTfile family which was originally developed by MDL Information Systems~\cite{Dalby92}. The
current format specification is published by BIOVIA and available on
request~\cite{ctfilespec}.
The CTAB block (connection table) of a molfile comes in two versions, V2000
and V3000. The V3000 provides several enhancements including but not
limited to: removing atom and bond count limits, enhanced stereochemistry,
and link nodes. For backwards compatibility V2000 is often preferred resulting
in limited usage of V3000.
CDK \cdkversion{} adds support for V3000 and has optimized and extended
support for V2000. Currently these are considered separate formats requiring
a user to know what version is being read beforehand. Future APIs will aim
to simplify this and provide a unified reader. An overview of currently
supported CTfile formats is given in Table~2\ref{tab:ctfileFormats}.
CTfile Sgroups capture and organise high level information about sets of atoms
and bonds~\cite{Gushurst91}. There are four types of Sgroup: Display Short-cuts, Polymers,
Mixtures, and Data. The most familiar Sgroups from an end user perspective are structure
repeat units (e.g. bracketing) and abbreviations (Figure~\ref{fig:sgroups}).
CDK~\cdkversion{} adds supports for representation, reading, writing, and depiction of Sgroups.
% note, a pie chart would be cool, showing the various relative amounts of the
% various formats.
\subsubsection*{New Object Builders}
Originally, the CDK was developed as a shared library between
JChemPaint~\cite{krause2000jchempaint} and
Jmol~\cite{Willighagen2007jmol,Hanson2010}. JChemPaint used a MVC
approach with an event-passing mechanism to update the view when the
model was changed. This can cause a cascade of change events being
passed around. This was not always a desirable feature, especially for
non-UI code. To address this, interfaces were introduced allowing
multiple implementations of the core interfaces. With much code of the CDK
library no longer based on the original data model, a builder is needed to
create objects of that data model, such as an implementation of the IAtom.
The new \texttt{IChemObjectBuilders} allow implementations to be created, allowing
implementations of the interfaces to be instantiated without the need
of explicitly referencing those implementations. This way, any algorithm
implementation in the CDK can use any of the data model interface
implementations.
The CDK v1.0 and v1.2 implementations of the \texttt{IChemObjectBuilder} had,
however, one method for each data object constructor, resulting in a very large
interface. Moreover, this interface API had to be updated each time a new class
was introduced, and when existing methods changed and constructors were updated.
To simplify the API, the new \texttt{IChemObjectBuilder} collapses all methods
into a single method, which takes as a first parameter the class of the
interface that is to be constructed. All further parameters are passed as
parameters to the class constructor.
For example, to construct a new atom from its element symbol, one
would write previously:
\vspace{0.2cm}
\begin{verbatim}
IChemObjectBuilder builder = ...;
IAtom atom = builder.newAtom("C");
\end{verbatim}
\vspace{0.2cm}
With the new builder, the code looks like:
\vspace{0.2cm}
\begin{verbatim}
IChemObjectBuilder builder = ...;
IAtom atom = builder.newInstance(IAtom.class, "C");
\end{verbatim}
\vspace{0.2cm}
The CDK library is now mostly refactored and no longer depends on a specific
implementation of the \texttt{IChemObjectBuilder}, allowing the user of the
CDK to select a builder suitable to their software. Therefore, if software
depends on event passing, then the \texttt{DefaultChemObjectBuilder} can be
used, in most cases this isn't needed and the \texttt{SilentChemObjectBuilder}
is preferred resulting in a typical speed up of 10\% to 20\%:
\vspace{0.2cm}
\begin{verbatim}
IChemObjectBuilder builder = SilentChemObjectBuilder.getInstance();
\end{verbatim}
\vspace{0.2cm}
The third builder is the \texttt{DataDebugChemObjectBuilder} which generates debug
information for all changes to the content of the data classes. This
can be useful for debugging and other forms of code inspection.
\subsubsection*{Molecular Fingerprints}
Molecular fingerprints have also seen significant development in this CDK version.
Previously, fingerprints were represented using the \texttt{BitSet} class from
the Java library. While using this class
allowed the use of pre-existing methods to manipulate bit strings, it
keeps a vector of bits in memory. The solution was excellent for
hashed, relatively small fingerprints, \textit{e.g.}, 1024 bits,
\textit{i.e.} with a $2^{10}$ indexing space (128 B). However, implementing a
fingerprint designed to avoid collisions with a $2^{32}$ bit
indexing space using this approach would be memory-inefficient (512 MiB).
To allow for multiple fingerprint representations, a bit
fingerprint interface was introduced: \texttt{IBitFingerprint}.
\vspace{0.2cm}
\begin{verbatim}
IFingerprinter ecfp = new CircularFingerprinter();
IBitFingerprint fp = ecfp.getBitFingerprint(mol);
\end{verbatim}
\vspace{0.2cm}
Also, although fingerprints traditionally are bit vectors a count
fingerprint was also introduced making fingerprints based on integer
vectors supported in CDK as well. The counts in the fingerprint then
represent how often this substructure is found in the molecule it
represents.
\vspace{0.2cm}
\begin{verbatim}
IFingerprinter ecfp = new CircularFingerprinter(CLASS_ECFP4);
ICountFingerprint fp = ecfp.getCountFingerprint(mol);
\end{verbatim}
\vspace{0.2cm}
The fingerprints currently provided by the CDK are listed in Table~3\ref{tab:fingerprints}.
\subsection*{Improved Coding Standards}
As the CDK library grew over the years, so did the complexity of the
maintenance. The main branch frequently failed to compile and bug
fixes became more onerous due to unexpected side effects. Often
fixing a bug in one part of the code, broke some other code which made
the incorrect assumptions about the fixed code. With the increased size of
the CDK developer community, such issues were inevitable in the
absence of any formal coding and testing standards.
To address these issues, we have adopted a number of coding
standards. While not a comprehensive implementation of software
engineering best practices, they attempt to find a balance between
increasing code maintainability and being flexible enough to allow
efficient code development. We appreciate the subjective
nature of this statement, and some adopted guidelines have been
heavily discussed and debated in the CDK community.
Arguably, perhaps the biggest factor in improved code quality is a
peer review process where any functionality changing patch is required
to be reviewed by one independent, senior CDK developer for the
development branch, and by two reviewers for stable branches. This patch
development system is supported by a number of automated validations
steps as outlined below.
The next sections describe some approaches the project have adopted that allows