The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.
In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.
A SMILES string consists of characters (in ASCII) without spaces.
Atoms are represented by their element's symbol. For example, mercury is
[Hg]. The elements
I (the "organic subset") can be typed without brackets when the number of attached hydrogens conforms to the lowest normal valence consistent with explicit bonds. Where it can be inferred, hydrogen atoms may be omitted. For example, hydrogen chloride (HCl) is just
Cl; ammonia is just
N (the valence of nitrogen is 3, so three hydrogen atoms are inferred).
The hydrogen atom rule only applies to the organic subset without brackets. For comparison,
S refers to hydrogen sulfide (H2S, two hydrogen atoms are inferred) while
[S] refers to elemental sulfur (S).
Atoms in aromatic rings are specified in lowercase. Example:
Within brackets, any attached hydrogen atoms and formal charges must always be specified. The number of attached hydrogen atoms is shown by the symbol H followed by an optional digit. Similarly, a formal charge is shown by one of the symbols + or -, followed by an optional digit. If unspecified, charge is assumed to be zero. Multiple + or - signs are synonymous with the same sign followed by the charge, for example,
[Fe++] is also
[H+]- proton (H+)
[NH4+]- ammonium (NH4+)
[C#N-]- cyanide (CN-)
Single bonds, double bonds, triple bonds, and aromatic bonds are represented by
:, respectively. Adjacent atoms (in the string) are always assumed to be single or aromatic bonded. Example:
O=C=O- carbon dioxide (CO2)
Branches from any atom in the sequence may be specified using parentheses, and may be nested. Examples:
CC(=O)O- acetic acid (CH3COOH)
CC(O)C- isopropyl alcohol (2-propanol)
CC(C)C(=O)O- isobutyric acid
Cyclic "bonds" may be specified by replacing the bond with a reference to the concerned atoms. Example:
C1CCCCC1- cyclohexane. This associates a string of six carbon atoms with the first atom (numbered 1) and the sixth atom (also numbered 1) bonded together.
Multiple bonds may be assigned to a single atom. For example,
C12 means that the carbon atom is assigned to a bond number 1 and another bond number 2 (not bond number 12).
Bond numbers can be reused after the "second" atom with the number is typed. This reduces the number of ring closures beyond 10. Should this happen, a percent sign (%) must precede the number. For example,
C%12 is a carbon atom with bond number 12.
Disconnected compounds are separated by a dot (.).
Isotopes can be specified by prefixing with the isotope's atomic mass. Example:
Configuration around double bonds is specified using the characters "/" and "\". For example,
F/CC/F is trans-difluoroethene, while
F/CC\F is cis-difluoroethene.