Markup for chemical formulas

The Kelly lab has special markup rules to make it easy to write chemical formulas. There are two forms of markup, a simple (transparent) method, and an explicit method for more advanced formulas.

Markup Rules and Syntax

Two types of rules are provided to support chemical formulas in text. A simple, transparent rule automatically subscripts numbers in text that appears to be a chemical formula. A more advanced rule is explicitly invoked by a leading ampersand (&) character, and offers much more powerful markup options.

Simple (transparent) chemical formula markup:

Any string that appears to be a common atom followed by a number is rewritten to subscript the number. Strings like CH3CH2CH2OH, polymer-NH2, and H2SO4 are automatically converted to chemical formulas.

Recognized atoms are C, H, O, N, P, S, Na, Si, Cl, and Br. These represent the most commonly encountered in biochemistry; extending this rule to other atoms is trivial.

This markup makes it natural to write about (for instance) H2N-C6H4-COOH (p-aminobenzoic acid), or simple reactions like this:

3NaOH + H3PO4 ↔ Na3PO4 + 3H2O

Some examples of transparent chemical formula markup are:

Source text Displays as
Na2PO4 Na2PO4
CaCl2(s) CaCl2(s)

The transparent method can be combined with explicit markup to extend the repertoire of chemical formulas that can be created:

Source text Displays as
(C6H5)'_2_'NCH2CH3 (C6H5)2NCH2CH3
CH3CH((CH2)'_3_'NH2)NH3'^+^' CH3CH((CH2)3NH2)NH3+
The sulfate ion (SO4'^2–^') ... The sulfate ion (SO42–) …
FeCl3·6H2O FeCl3·6H2O
Nitroxide radical (R'_2_'NO•) ... Nitroxide radical (R2NO•) …

Clearly, writing the explicit markup quickly becomes cumbersome, and the resulting source is often hard to read and understand while editing.

More advanced chemistry markup:

For more complicated formulas, the author must signal the start of the formula text with a leading ampersand ('&') character. The formula continues until a character other than numbers, letters, or certain punctuation is encountered.

The text "&formula" is interpreted as follows:

  • Strings of numbers following letters, periods or parentheses in the chemical formula are subscripted.
  • The '.' character is replaced with an HTML middot (·) character, as in 'CuSO4 · 6H2O'.
  • The string '==' is converted to a triple bond (rendered as a mathematical "≡" character), as in 'CH3C≡N'.
  • Trailing '*' characters are converted to HTML bullet (•) characters (free radicals, e.g. 'HO•').
  • Trailing '+' or '-' characters become ionic charges (e.g. the text 'SO4-- ion' becomes 'SO42– ion').
  • Lowercase letters following the ')' character in polymer formulas, such as '(PEO)m(PPO)n', are automatically subscripted and italicized ('(PEO)m(PPO)n').
  • Chemical "R-groups" with numerical indices are superscripted instead of subscripted. For example, 'R1-NH2CH2-R2' is rendered as 'R1-NH2CH2-R2 '.

Despite the relative simplicity of these rules, this markup scheme allows a wide range of formulas to be rendered with very minimal markup. See below for more illustrative examples.

Note: No effort is made to ensure that the formula conforms to feasible chemical bonding, nor to IUPAC or SMILES naming conventions. It is up to the author to ensure that the formula makes sense from a physicochemical standpoint… no carbon atoms with 5 substituents, please!

Some examples will help demonstrate the power of this markup:

Source text Displays as
&Na3PO4 Na3PO4
&H2PO4- H2PO4
&(C6H5)2NCH2CH3 (C6H5)2NCH2CH3
&CH3CH((CH2)3NH2)NH3+ CH3CH((CH2)3NH2)NH3+
Pluronics [&(PEO)n(PPO)m(PEO)n] ... Pluronics [(PEO)n(PPO)m(PEO)n] …
The sulfate ion (&SO4--) ... The sulfate ion (SO42–)
Iron (III) chloride, &FeCl3.6H2O, ... Iron (III) chloride, FeCl3·6H2O, …
Nitroxide radicals (&RNO*) ... Nitroxide radicals (RNO•) …
... ionic &Cu++, &SO4-- or &PO4--- ... … ionic Cu2+, SO42– or PO43–
The &Na+&Cl- crystal lattice... The Na+Cl lattice…
Laurate salts [&CH3(CH2)10C(=O)O-&Na+] ... Laurate salts [CH3(CH2)10C(=O)ONa+] …
Blue-colored &CuSO4.xH2O was added... Blue-colored CuSO4·xH2O was added…
Dabsyl chloride [&(CH3)2N(C6H4)N=N(C6H4)SO2Cl]
reacts with primary amines (-NH2).
 Dabsyl chloride [(CH3)2N(C6H4)N=N(C6H4)SO2Cl] reacts with primary amines (-NH2).
NaN3 (sodium azide) has a toxicity
similar to cyanides (e.g. &NaC==N).
 NaN3 (sodium azide) has toxicity similar to cyanides (e.g. NaC≡N).
&-C-C==N bonds are common. The 
acetylides (&R1-C==C-R2) also exist.
 -C-C≡N bonds are common. The acetylides (R1-C≡C-R2) also exist.
H2O2 exposed to UV light can generate
a hydroxyl radical (&HO*). Organic peroxides
(&R1OOH) can create &R1O* radicals.
 H2O2 exposed to UV light can generate a hydroxyl radical (HO•). Organic peroxides (R1OOH) can create R1O• radicals.
The absorbance (&A490) in aqueous solution... The absorbance (A490) in aqueous solution…

Note that the '+/-' symbols representing a charge must terminate an expression. To get multiple charges (e.g. an ionic pair), simply merge two formula statements:

Source text Displays as
Zwitterionic amino acids (&H3N+-&CHR1-COO-) ... Zwitterionic amino acids (H3N+-CHR1-COO) …
&Ca++&EDTA-- complexes are used for ...  Ca2+EDTA2– complexes are used for …

Implementation considerations and limitations

Note that the markup needs to be done after links are resolved, to avoid incorrectly adding HTML markup to link names like:

  • This [[CHEMSKETCH2 | link to CHEMSKETCH2]] doesn't work because the '2' in 'H2' is subscripted.

The simple markup rule is fairly specific, but occasionally it will incorrectly mark up a word. To prevent this, simply wrap the offending text in [= ... =] to prevent markup.

For example, the string 'MP3' appears to be a simple chemical formula (because of the P followed by a digit). To prevent the subscript on the '3', just enclose the string in '[= ... =]' markers:

Source text Displays as
I listen to my MP3 player all the time. I listen to my MP3 player all the time.
I listen to my [=MP3=] player all the time. I listen to my MP3 player all the time.

Chemistry markup is supported by Rat. If you have questions or comments, please direct them to him.