Skip to content

Latest commit

 

History

History
186 lines (134 loc) · 9 KB

IUPAC_SMILES+_Appendix1.asciidoc

File metadata and controls

186 lines (134 loc) · 9 KB

IUPAC SMILES+ Specification: Appendix 1 - Proposed and Known SMILES Extensions [working draft]

IUPAC SMILES+ Contributors: Vincent F. Scalfani (Chair), Evan Bolton, Helen Cooke, Chris Grulke, John Irwin, Oliver Koepler, Gregory Landrum, José L. Medina-Franco, Miguel Quirós Olozábal, Susan Richardson, and Issaku Yamada.

v0.1,2019-04-15: Working Draft
IUPAC SMILES+ Project No. 2019-002-024
Copyright © 2020, IUPAC
Content is available under GNU Free Documentation License 1.2

This IUPAC SMILES+ Specification Appendix [working draft] document is a modified derivative of the OpenSMILES Specification. We have endeavored to maintain all prior author names, contributor names, copyright notices, and revision history.

OpenSMILES Specification
Craig A. James
v1.0,2016-05-15: Current specification

Copyright © 2007-2016, Craig A. James
Content is available under GNU Free Documentation License 1.2

OpenSMILES Contributors: Richard Apodaca, Noel O’Boyle, Andrew Dalke, John van Drie, Peter Ertl, Geoff Hutchison, Craig A. James, Greg Landrum, Chris Morley, Egon Willighagen, Hans De Winter, Tim Vandermeersch, John May

1. Known Extensions

coming soon…​

2. Proposed Extensions

2.1. External R-Groups

Daylight proposed, and OpenEye actually implemented, an extension that specifies bonds to external R-groups. An external R-group is specified using ampersand '&' followed by a ring-closure specification (either a digit, or '%' and two digits). However, unlike ring-closures, the bond is to an external, unspecified R-group. Example: n1c(&1)c(&2)cccc1 - 2,3-substituted pyridine.

2.2. Polymers and Crystals

Daylight (Weininger) proposed, but never implemented, an extension for crystals and polymers. Daylight also used the ampersand '&' character, (which may conflict with the R-group proposal, above), but with the added rule that if a number appears more than once, it creates a repeating unit.

SMILES Name

c1ccccc1C&1&1

polystyrene

C&1&1&1&1

diamond

c&1&1&1

graphite

2.3. Atom-based Double Bond Configuration

The directional '/' and '\' marks for cis/trans bonds seem simple on the surface but are problematic for complex systems. The issue is that in conjugated systems one directional bond can be used in defining the configuration of two double bonds. When assigning the directional bonds the existing labels must be considered or rewritten. In a long series of conjugated double bonds, changing the configuration of one bond can require rewriting dozens of bond symbols.

More importantly, there is a theoretical flaw with the use of '/' and '\'. It is possible to write valid SMILES for the molecule cyclooctatetraene by alternating directional assignments for the cis configurations. However, as shown below attempting to change one configuration is not possible. Reassigning the directional labels for adjacent double bonds will not work as it reassignment propagates around the ring and the conflict is not resolved.

Including directional labels to explicit hydrogen atoms is a possible resolution but does not follow standard-form and complicates the assignment procedure.

Depiction SMILES Comment

cyclooctatetraene

C/1=C/C=C\C=C/C=C\1

cyclooctatetraene

Todo

C/1=C/C=C/C=C/C=C\1

one bond changes two configurations

The proposed syntax for double bond configurations uses the '@' and '@@' atom-based specification. For example:

Depiction SMILES Name

trans difluoroethene

F[C@@H]=[C@H]F

trans-difluoroethene

F[C@H]=[C@@H]F

cis difluoroethene

F[C@H]=[C@H]F

cis-difluoroethene

F[C@@H]=[C@@H]F

Interpretation of '@' and '@@' follows the tetrahedral convention: The atoms, as encountered in the SMILES string, are either in anticlockwise '@' or clockwise '@@' order as viewed on the page. Since cis/trans configurations are planar, they can also be "viewed from underneath the page", which results in the two valid SMILES shown for each compound, above.

As with the other atom-bases specifications one must consider the relative position of implicit atoms. It is not always true that a trans form has opposite "clock-ness" ('@‘,’@@' or '@@‘,’@‘), and the cis form has the same "clock-ness" (’@‘,’@' or '@@‘,’@@').

Depiction SMILES Name

trans difluoroethene

F[C@@H]=[C@H]F

trans-difluoroethene

[C@H](F)=[C@H]F

cis difluoroethene

F[C@H]=[C@H]F

cis-difluoroethene

[C@@H](F)=[C@H]F

Atom-based '@' and '@@' for the stereo-specification of double bonds does not suffer from the theoretical flaw illustrated with cyclooctatetraene. The assignments are not-shared and adjacent configurations do not need to be considered. This is more flexible and and simplifies generation of canonical SMILES.

Depiction SMILES Name

cyclooctatetraene

[C@H]1=[C@@H][C@@H]=[C@@H][C@@H]=[C@@H][C@@H]=[C@@H]1

cyclooctatetraene

Note that the first stereo-specification carbon must be represented as '@' since the '1' follows the H, whereas the rest of the carbons use '@@' to characterize the cis configuration of each bond. Since this is a specification on the atom, rather than the single bond, no conflict arises at the ring-closure bond.

2.4. Radical

This section needs considerable work. The following text is courtesy Chris Morley, who commented: "I guess the last paragraph doesn’t look too good in a formal specification. There are two reasons for the frailty: lack of proof that the radical and aromatic uses can always be unambigous (I doubt anybody has tried); and a known deficiency in the parser." However, it is a good starting point…​

A single lowercase symbol is interpreted as a radical center. CCc is an alternative to CC[CH2] and is the 1-propyl radical; CcC or C[CH]C is the 2-propyl radical, Co is the methoxy radical. An odd number of adjacent lowercase symbols is a delocalized conjugated radical. So Cccccc is CC=CC=C[CH2] or CC=C[CH]C=C or C[CH]C=CC=C Lowercase 'c' or 'n' can be used in a ring: C1cCCCC1 is the cyclohexyl radical.

The use of the non-aromatic lowercase symbol is a shorted form with improved intelligibility that allows the use of implicit hydrogen in radicals. However it is intended only for simple unambiguous molecules and is not reliable when combined with aromatic atoms.

2.5. Twisted SMILES

An interesting extension that specifies conformational information via bond dihedral angles and bond lengths was proposed by McLeod and Peters:

3. Revision History

3.1. IUPAC SMILES+ Specification: Appendix 1 - Proposed and Known SMILES Extensions

Revision Date Description Name

1.0

2020-09-24

Transfer proposed extensions to this appendix

Vincent F. Scalfani