DouglasConnect, 2004-10-18, Cyberspace
Power corrupts; Powerpoint corrupts absolutely (Tufte); slides in XML
This talk will be publicly archived in the University Repository: http://www.dspace.cam.ac.uk/handle/1810/749
However large an array of facts, however rapidly they accumulate, it is possible to keep them in order and to extract from time to time digests containing the most generally significant information, while indicating how to find those items of specialized interest. To do so, however, requires the will and the means. (Bernal, 1965)
[we need to] get the best information in the minimum quantity in the shortest time, from the people who are producing the information to the people who want it, whether they know they want it or not (my emphasis).
quoted by PM-R in The Globalization of Crystallographic Knowledge, Acta Cryst. D, 54 1065-1070 (1998)
"The bane of my life is doings things I know computers could do for me" (Dan Connolly, W3C)
Examples that a robot could do:
Programs = algorithms + data (Wirth)
Therefore for Open Science we require:
The tools include:
Chemistry is micropublished by humans and re-aggregated by humans
The chembots of the semantic grid cannot currently use this information
Most chemical information (80%+) is never published
Every stage during publication involves information loss...
Could you repeat a published in silico literature experiment?
2004, 6 (27), 130 - 158, DOI: 10.1039/b313104a
"CH/pi hydrogen bonds in crystals", Motohiro Nishio
Chemistry needs a better way of communicating structure ...
Information, infrastructure, data and ontology are Free and Open
Major information providers point inward, proprietary solutions, information is neither Open nor free
XML can be lossless, and re-used in many ways
... By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.http://www.soros.org/openaccess/read.shtml
, my emphasis
Anything less (e.g. "green OA") is unacceptable to me
Any payments, logins, licenses, forms, etc. are currently insuperable obstacles.
The Inter-Union Bioinformatics Group of the International Scientific Unions has strongly urged that scientific data be freely available
Funding bodies such as NSF/NIH require data to be made Open unless good cause shown
Our "Manifesto for Open Chemistry" in our recent article in the RSC's Org. Biomol. Chem. is archived at http://www.dspace.cam.ac.uk/handle/1810/741
Based on dictionaries, ontologiesP. Murray-Rust and H. S. Rzepa, J. Chem. Inf. comp. Sci.,, 2003, 43, issue 4.
CML architecture is extensible through OO design:
A collection of context-free components which can be combined in any sensible manner
A set of domain-independent components for Scientific/Technical/Medical information. Key elements:
XML defines a document structure and a vocabulary
<list> <item>First item</item> <item>Second item</item> </list>
Chemical Markup Language for Nitrogen Dioxide (.O-N=O)
<molecule id="no2"> <atomArray> <atom id="n1" elementType="N" hydrogenCount="0"/> <atom id="o1" elementType="O" hydrogenCount="0"/> <atom id="o2" elementType="O" hydrogenCount="0"/> </atomArray> <bondArray> <bond id="bo1" atomRefs2="n1 o1" order="2"/> <bond id="bo2" atomRefs2="n1 o2" order="1"/> </bondArray> </molecule>...and briefer...
<molecule id="no"> <atomArray atomID="n1 o1 o2" elementType="N O O" hydrogenCount="0 0 0"/> <bondArray atomRef1="n1 n1" atomRef2="o1 o2" order="2 1"/> </molecule>
tert-Butyl hydroperoxide (40 ml of a 5M solution in decanes, 200 mmol, 2 eq.) was added via cannula to a solution of (2E)-oct-2-en-1-ol 23 (MW=128.21, d=0.850, 12.8 g, 15.1 ml, 100 mmol) in DCM (200 ml)XML version
<reaction id="r12312" xmlns:nist="http://units.nist.gov" xmlns:iupac="http://www.iupac.org/solvents" xmlns:rsc="http://www.rsc.org/reagents" xmlns="http://www.xml-cml.org"> <substance role="reagent" ref="rsc:a213123" title="tert-Butyl hydroperoxide"> <amount units="nist:ml">40</amount> <concentration units="nist:molar">5</concentration> <substance role="solvent" ref="s324235" title="decanes"/> </substance> <substance role="rsc:reagent" ref="g32423" title="(2E)-oct-2-en-1-ol"> <amount units="nist:g">12.8</amount> </substance> <substance role="rsc:solvent" ref="s324234" title="DCM"> <amount units="nist:ml">200</amount> </substance> </reaction>
Linkage of document to single dictionary
Linkage of document to multiple dictionaries
Platform and application-independent Computational Chemistry
- Read 1000 crystal structures (CIF format)
- find largest molecule
- run MOPAC (given params)
- load results into database indexed on chemical structure
Examples are Napster and Gnutella for music*Motivation ("why should I donate?"):
OpenSource: Christoph Steinbeck, Egon Willighagen (CDK/JCP/Jmol), Joerg Wegner (CDK, Tubingen), Geoff Hutchinson, Michael