The Scientist 14[15]:40, Jul. 24, 2000

PROFESSION

The Language of Bioinformatics

Format, exchange standards strain to keep pace

By Potter Wickware

Once the world had a single language and not too many words, but then clarity deteriorated into clamor. Today in the small but prolific world of bioinformatics, another Tower of Babel is rising up, with the miscommunication due as much to the rapid expansion of information as to basic changes in how it is processed. "Horrible problems" crop up as more information is computed on instead of read by a human researcher, according to Ewan Birney, a group leader in the Ensembl genome annotation project at the European Bioinformatics Institute (EBI) in Cambridge, England.

In the early days of bioinformatics, human-readable data exchange formats such as ASN.1, the format adopted for GenBank by the National Center for Biotechnology Information (NCBI) 10 years ago, were the norm. Easily editable with a text utility, ASN.1's syntactic looseness makes it congenial to the human user, but not to the machine, which likes its inputs defined with dictatorial rigidity.

While syntax and data exchange formats are important, says David Benton, a pharmaceutical scientist at SmithKline Beecham in King of Prussia, Pa., and life sciences cochair at the Object Management Group (OMG), a software standards organization, they ought to be concealed from the end user. Objects, not formats, are what count, he says. "The user wants the information--the molecular weight of the compound or the number of phosphates in the binding site--as a number that can be passed directly into a program. He doesn't want to have to find and interpret a tagged string."

OMG's approach to objectifying bioinformatics methods relies on CORBA middleware, the behind-the-scenes bit and byte brokerage that mediates between different languages, information sources, and analysis tools, each with its own way of representing data and methods. Benton explains, "The goal is to build bioinformatics software the same way we build computers. As long as your PC has a PCI bus, you buy a PCI hard drive and plug it in and it works. You don't worry about low-level details of how to address sectors or the order that bytes come off the drive."

Birney agrees that the technology choice for standards is reasonably clear: CORBA interfaces for methods and XML, the souped-up version of HTML, for data formats, implemented in an open source environment. Welcoming in those who wish to contribute expertise, code, and bridging systems, Birney aims to coordinate between legacy data formats such as EMBL/GenBank and newer ones including GAME-XML, a syntax for exchanging genomic annotations that was originally put together so the Drosophila and Celera Genomics Group databases could talk to each other.

Despite efforts to establish order, though, nonconformists still abound, such as Peter Karp's MetaCyc and EcoCyc pathway databases, which are built on a syntax called Ocelot. Gene expression, too, has been marked by a cacophony of vendor-specific formats, such as Affymetrix and Molecular Dynamics' GATC. The noninteroperability of microarray systems has complicated life for scientists faced with terabytes of new data, and according to Alvis Brazma, who works with Birney at EBI, there is now "overwhelming support" in the expression community for standardization. At a meeting ("Microarray Gene Expression Databases-2") in Heidelberg, Germany, in May attended by 250 players in the field, five work groups (data exchange format, annotations, ontology, data normalization, future user) were established. A follow-up meeting will be held at Stanford University next March.

Nomenclature is another area where anarchy reigns. The fly community is famous for gene names such as saxophone, lush, and son of sevenless. Though they sound dreamed up in the wee hours at a frat party, these names signify a deeper problem. Benton points out that gene names try to represent some knowledge of what the gene is, or what its product does, and that knowledge is always changing. "Sometimes it may really make more sense to call a gene sonic hedgehog than make a bad guess as to its function in the cell," he says.

Christopher Hogue, who was an NCBI software developer and now heads a protein-folding lab at the Samuel Lunenfeld Research Institute in Toronto, says inconsistent, ambiguous and nonunique names are an "awesome" obstacle. But he believes that the name game is a "social problem" that is not solvable at the level of database architecture. "We forgot to invite engineering practice into our new discipline," says Hogue, explaining that developing and adhering to standards requires an engineer's methodical mindset, which does not jibe particularly well with the ad hoc solutions of the rapidly expanding field of bioinformatics.

Nevertheless, all is not lost, says Benton. One group that attempts to bring some order to nomenclature is the Bio Ontologies consortium, cochaired by Robin McEntire at SmithKline Beecham and Karp at SRI International, Menlo Park, Calif. A related group is the Gene Ontology Consortium, including Michael Ashburner from EBI, Suzanna Lewis from FlyBase, a Drosophila database at the University of California, Berkeley, and Mike Cherry and David Botstein from the SacDB yeast database at Stanford. Their goal is to come up with a consistent terminology for cellular function so that the same keyword will work in all gene databases. Probably, though, there will always be some degree of conflict between scientists' natural inclination to do things differently and the prophetic vision of one language and few words. S

Potter Wickware is a freelance science writer in Mill Valley, Calif.

Standards-related Web Sites

EBI's Microarray Gene Expression Database Meetings
www.ebi.ac.uk/microarray/MGED/index.html

Microarray activities at EBI
www.ebi.ac.uk/microarray

Ewan Birney's Bioperl project
bio.perl.org

OMG's Life Sciences Research groups
www.omg.org/homepages/lsr/wgs.html

Peter Karp's XOL
(Ontology Exchange Language)
www.ai.sri.com/~pkarp/xol


© Copyright 2000, The Scientist, Inc. All rights reserved.
We welcome your opinion. If you would like to comment on this article, please write us at editorial@the-scientist.com

News | Opinions & Letters | Research | Hot Papers | LabConsumer | Profession
About The Scientist | Jobs | Classified | Web Registration | Print Subscriptions | Advertiser Information

The Scientist 14[15]:40, Jul. 24, 2000