CML; where Open Source and Open Data meet

DouglasConnect, 2004-10-18, Cyberspace

Power corrupts; Powerpoint corrupts absolutely (Tufte); slides in XML

This talk will be publicly archived in the University Repository: http://www.dspace.cam.ac.uk/handle/1810/749

The foresight of JD Bernal

However large an array of facts, however rapidly they accumulate, it is possible to keep them in order and to extract from time to time digests containing the most generally significant information, while indicating how to find those items of specialized interest. To do so, however, requires the will and the means. (Bernal, 1965)
[we need to] get the best information in the minimum quantity in the shortest time, from the people who are producing the information to the people who want it, whether they know they want it or not (my emphasis).

quoted by PM-R in The Globalization of Crystallographic Knowledge, Acta Cryst. D, 54 1065-1070 (1998)

Visions for the Chemical Semantic Web

Visions for the Chemical Semantic Web

"The bane of my life is doings things I know computers could do for me" (Dan Connolly, W3C)

Examples that a robot could do:

...more ambitious...

Open Programs

Programs = algorithms + data (Wirth)

Therefore for Open Science we require:

The tools include:

Chemical publication

Chemical publication

Chemistry is micropublished by humans and re-aggregated by humans



The chembots of the semantic grid cannot currently use this information

Loss during Publication

Loss during Publication

Most chemical information (80%+) is never published


Every stage during publication involves information loss...

Could you repeat a published in silico literature experiment?

CrystalEngComm

Crystal Engineering Communications

CrystEngComm, 2004, 6 (27), 130 - 158, DOI: 10.1039/b313104a
"CH/pi hydrogen bonds in crystals", Motohiro Nishio

Chemistry needs a better way of communicating structure ...


EPO - information loss

EPO - information loss

Bioinformatics

Bioinformatics

Information, infrastructure, data and ontology are Free and Open

Chemoinformatics

Chemoinformatics

Major information providers point inward, proprietary solutions, information is neither Open nor free

EPO - information retention

EPO - information retention

Lossless Publication

Lossless Publication

XML can be lossless, and re-used in many ways

Threats and Opportunities

Threats and Opportunities

Threats

Opportunities

Budapest Open Access Initiative

Budapest Open Access Initiative

from BOAI declaration
... By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.
http://www.soros.org/openaccess/read.shtml, my emphasis

Anything less (e.g. "green OA") is unacceptable to me

Open Scientific Information Robotically Managed

Open Scientific Information Robotically Managed

Any payments, logins, licenses, forms, etc. are currently insuperable obstacles.

OpenSource and OpenData

OpenSource and OpenData

Open Source and Open Data liberate thought Open Chemistry

Open Chemistry

NOT comprehensive
Software: Data: The Rights of Molecules

The Rights of Molecules

The Inter-Union Bioinformatics Group of the International Scientific Unions has strongly urged that scientific data be freely available

Funding bodies such as NSF/NIH require data to be made Open unless good cause shown

Molecular Rights

NOTE: Agreed current copyright must be respected; a different model is required for the future

Our "Manifesto for Open Chemistry" in our recent article in the RSC's Org. Biomol. Chem. is archived at http://www.dspace.cam.ac.uk/handle/1810/741

Central role of CML

Central role of CML

Overview of CML

Overview of Chemical Markup Language (CML)

Based on dictionaries, ontologies

[1]P. Murray-Rust and H. S. Rzepa, J. Chem. Inf. comp. Sci.,, 2003, 43, issue 4.
CML Architecture

CML Architecture

CML architecture is extensible through OO design:

Menu-based Schema

Menu-based Schema

A collection of context-free components which can be combined in any sensible manner

STMML

A set of domain-independent components for Scientific/Technical/Medical information. Key elements:

STM Schema XML in 1 minute

XML in 1 minute

XML defines a document structure and a vocabulary

  <list>
    <item>First item</item>
    <item>Second item</item>
  </list>

Chemical Markup Language for Nitrogen Dioxide (.O-N=O)

<molecule id="no2">
  <atomArray>
    <atom id="n1" elementType="N" hydrogenCount="0"/>
    <atom id="o1" elementType="O" hydrogenCount="0"/>
    <atom id="o2" elementType="O" hydrogenCount="0"/>
  </atomArray>
  <bondArray>
    <bond id="bo1" atomRefs2="n1 o1" order="2"/>
    <bond id="bo2" atomRefs2="n1 o2" order="1"/>
  </bondArray>
</molecule>
...and briefer...
<molecule id="no">
  <atomArray atomID="n1 o1 o2" 
    elementType="N O O" hydrogenCount="0 0 0"/>
  <bondArray atomRef1="n1 n1" atomRef2="o1 o2" order="2 1"/>
</molecule>
Advanced CML

"Advanced" CML

tert-Butyl hydroperoxide (40 ml of a 5M solution in decanes, 200 mmol, 2 eq.) was added via cannula to a solution of (2E)-oct-2-en-1-ol 23 (MW=128.21, d=0.850, 12.8 g, 15.1 ml, 100 mmol) in DCM (200 ml)

XML version
  <reaction id="r12312"
      xmlns:nist="http://units.nist.gov"
      xmlns:iupac="http://www.iupac.org/solvents"
      xmlns:rsc="http://www.rsc.org/reagents"
      xmlns="http://www.xml-cml.org">
    <substance role="reagent" ref="rsc:a213123"
      title="tert-Butyl hydroperoxide">
      <amount units="nist:ml">40</amount>
      <concentration units="nist:molar">5</concentration>
      <substance role="solvent" ref="s324235"
      title="decanes"/>
    </substance>
    <substance role="rsc:reagent" ref="g32423"
      title="(2E)-oct-2-en-1-ol">
      <amount units="nist:g">12.8</amount>
    </substance>
    <substance role="rsc:solvent" ref="s324234"
      title="DCM">
      <amount units="nist:ml">200</amount>
    </substance>
  </reaction>
IUPAC Unique Chemical Identifier

IUPAC Unique Chemical Identifier

C7H6O2,1H3-5-4H-6(8)2H-3H-7(5)9 Molecules and Properties

Molecules and Properties

Molecular Properties

Molecular Properties

Beilstein records at least 600 properties... CIFSAXDOM

CIFSAXDOM

CIF2CML

CIF2CML

CIFCML benefits

CIFCML benefits

Why "semantic"

Why "semantic"?

CML dictionaries

Macie

Dictionary - 1

Dictionary - 1

Linkage of document to single dictionary

Dictionary - 2

Dictionary - 2

Linkage of document to multiple dictionaries

Job Control

Job Control

Platform and application-independent Computational Chemistry

Component-based architecture

Component-based architecture

- Read 1000 crystal structures (CIF format)
- find largest molecule
- run MOPAC (given params)
- load results into database indexed on chemical structure

The DATUMENT [1] - Hypermedia for science

Many theses approximate to (paper) datuments!!

[1] Murray-Rust and Rzepa, Data Science (CODATA), 2002
[1] Murray-Rust and Rzepa, J. of Digital. Info (OUP/Soton), 2004 in press Peer2Peer 1

Peer2Peer 1

Examples are Napster and Gnutella for music*

Motivation ("why should I donate?"): The future

The future

Thanks

Thanks

OpenSource: Christoph Steinbeck, Egon Willighagen (CDK/JCP/Jmol), Joerg Wegner (CDK, Tubingen), Geoff Hutchinson, Michael Bancks (OB) http://wwmm.ch.cam.ac.uk; http://www.xml-cml.org; http://cml.sourceforge.net

URLs

some URLs