This document discusses experiences applying logic programming techniques in bioinformatics. It describes Obol, a system that used definite clause grammars to parse biological terms, and Blipkit, a reusable bioinformatics toolkit built for SWI-Prolog. Blipkit includes domain models, I/O modules, and tools for integrating with relational databases and web services. The document discusses applications of logic programming for tasks like genome inference, phenotype matching, and consistency checking biological data. It evaluates different logic programming approaches for representing genomic data and rules.
Experiences with logic programming in bioinformatics
1. Experiences using logic programming in bioinformatics Chris Mungall Berkeley Bioinformatics and Ontologies Group http://berkeleybop.org Lawrence Berkeley National Laboratory ICLP 2009
2. Outline Biology and biological data integration: a brief introduction Obol: First experiences applying LP Blipkit: a reusable bioinformatics developer’s toolkit Modular structure I/O and relational database connectivity Some applications of Blipkit and LP Genes and genomics Phenotype matching Web applications Conclusions Where next? Some recommendations for the LP community
3. The promise and challenges of biological research Why study biological systems? Because they’re fascinating Improve health Improve the environment BUT: Biology is hard Biological systems are extremely diverse Biology deal with phenomena at multiple levels of granularity There is a deluge of data Bioinformatics Biology as an information science Computational methods vital to understanding
8. bio-databases 1200 Biological Databases published in Nucleic Acids Research many more unpublished many of these are database federations (e.g. Ensembl) Heterogeneous systems Storage mechanism: Relational XML Flat files Ad-hoc, semi-structured, natural language Limited APIs lack of standards limited query expressivity Poorly integrated Limited integration beyond identifier cross-references Users must manually integrate Bioinformatics runs on perl glue metabolic pathways mutants genes fruit flies tumors
9. Data interrogation and discovery Sample of tasks Find mutations in regions upstream of neurotransmitter-producing genes Find drug targets or animal models for neurodegenerative diseases What biological pathways are enriched in high acidity environments? Answer each of these is difficult Manual aggregation from lots of databases Various kinds of inference required
11. Obol: First experience with LP in bioinformatics Problem Many existing bio-ontologies were in fact more like terminologies Basic axioms, is_a hierarchies Deeper logical structure implicit in terms Long noun phrases, recursively composed “regulation of transcription during G1 phase of mitotic cell cycle” Existing solutions (2004) Take advantage of semi-controlled syntax of terms Parse using ad-hoc regular expressions Influence of perl in bioinformatics! But context-free grammars (at least) were required
12. A better solution: Definite Clause Grammars Obol: A collection of domain specific DCGs Significant improvement over perlRegExs Declarative More expressive Integration with simple reasoning Bi-directional: can be used for term generation from logical expressions
13. Example process grammar process(P) regulation(P) | specification(P) | transcription(P) | ... process(P and during(W)) process(P),[during],process(W). process(P andpart_of(W)) process(P),[of],process(W). regulation( regulates(P) ) [regulation,of],process(P). specification( specifies(C) ) [specification, of], cell(C). cell(C and part_of(O)) ogan(O),cell(C). “regulation of transcription during G1 phase of mitotic cell cycle” regulates(transcription) and during(g1_phase and part_of(mitosis)) “regulation of transcription from RNA polymerase II promoter involved in ventral spinal cord interneuron specification” regulates(transcription and has_signal(rna_pol_ii)) and part_of(specifies(interneuron and part_of(ventral_spinal_cord)))
20. Results Obol grammars applied successfully to generate axioms for multiple ontologies particularly the Gene Ontology Still used frequently Lessons learned Small amount of basic LP goes a long way LP techniques not widely known in bioinformatics Different LP systems have different strengths Choosing between them is hard – and frustrating
21. Could LP prove as successful in the wider bioinformatics arena? Rule-based analysis pipelines prolog > make Integration of ontology reasoning and database queries prolog > datalog > sql Pathways graphs, ASP Genomics Linear transformations, CLP Phylogenetics operations on trees
33. Anatomy of a blip domain package Model(s) of the domain dependencies to other domain modules extensional and intensional predicates I/O parsers/writers for small subset of bioinformatics file formats DCGs or external perl translators for common XML schemas Native prolog serialization of model ‘for free’ Web UI Bridges Relational Other prolog models Ontology models
34. Domain model modules A model consists of extensional + intensionalpredicates Extensional predicates Unit clauses / facts - Asserted and/or compiled from fact files Akin to relational tables Intensional predicates Declarative: No I/O side effects Prolog has no built in extensional/intensional distinction All clauses treated equally Facts conventionally declared dynamic/1 and multifile/1 Some metamodeling is useful Easy to roll own A standard metamodel module would be useful optional type system + relational DDL style constraints Works as documentation
35. Example from systems biology model %%reaction_modifier(?R,?P) is nondet % relation between a biochemical reaction and a molecular constituent that plays a role in the process but is unmodified :- extensional(reaction_modifier/2). % --- INTENSIONAL PREDICATES --- %%derivation_link(?Input,?Output,?Via) % two species directly linked via a connecting % reaction (excludes modifiers) derivation_link(Input,Output,R):- reaction_reactant(R,Input), reaction_product(R,Output). %...[snip]… :- module(sb_db,[ reaction_product/2, reaction_reactant/2, reaction_modifier/2, derivation_link/3, …]). :- use_module(bio(dbmeta)). % metamodel %%reaction_product(?R,?P) is nondet % relation between a biochemical reaction and a molecular constituent produced in the reaction :- extensional(reaction_product/2). %% reaction_reactant(?R,?P) is nondet % relation between a biochemical reaction and a molecular constituent that is consumed in the reaction :- extensional(reaction_reactant/2).
36. Integrating with relational databases Most biological data stored in relational databases Many provide open SQL ports for distributed queries RDBs scale well with large quantities of data …but RDBs lack necessary deductive capabilities Expressivity Hierarchy FOL Pure prolog Datalog Relational Model Using prolog with RDBs should be easy… right?
37. sql_compiler Given a mapping to a relational schema: rewrites prolog terms as SQL queries Used in conjunction with db connectivity module History Draxler, 1992 Source forked, modified versions available with various prologs Blip includes extensions to Rewrite sub-optimal queries Rewrite non-recursive prolog clauses Integrate with SWI ODBC
38. Example query rewriting program rewriting program ?- sqlbind(sb_db:all, mydb). derivation_link(Input,Output,R):- reaction_reactant(R,Input), reaction_product(R,Output). call goal ?- derivation_link(X,Y) schema metadata + relation(reac_in,2). attribute(1,react_in,reac_id,int). attribute(2,react_in,input_id,int). relation(reac_out,2). attribute(1,react_out,reac_id,int). attribute(2,react_out,output_id,int). query rewriting + SELECT reac_in.reac_id, reac_in.input_id, reac_in.output_id FROM reac_in, reac_out WHERE reac_in.reac_id=reac_out.reac_id; mapping reaction_reactant(R,P) <- reac_in(R,P). reaction_product(R,P) <- reac_out(R,P). odbc.pl
39. Obtaining data from web services Many large bioinformatics data providers provide RESTful APIs NCBI caBIG SWI libraries used http_client sgml (for parsing XML payloads) XML -> Models Direct translation of sgml too low level XSLT-inspired prolog template-oriented processing language Application: ontology enhanced search term expansion E.g. “find all genes implicated in neurodegenerative disease” ‘parkinsons’ OR ‘alzheimers’ OR …
40. Applications of Blipkit and LP techniques Genomics and DNA sequences Deduction of implicit information Consistency checking of genome datasets Phenotype matching Finding similarities of mutational effects
41. Genome inference Deluge of genomic data Cost per genome decreasing Soon we will all know our genome sequence But what does it mean? Effective use of genomics data relies on deductive inference Many rules are logical: genome calculus Currently encoded using ad-hoc imperative code Probabilistic inference also useful But must be built on top of the logical inference
42. DNA human chromosome 1: 247m base pairs, 4220 genes Entire genome: 3x109 bps, 20k genes T A G C
43. DNA human chromosome 1: 247m base pairs, 4220 genes Entire genome: 3x109 bps, 20k genes T A G C Gene expression: transcription splicing translation
44. Transcription A subsequence of a DNA sequence is transcribed to an RNA sequence regulated by sequence called promoters and enhancers
45.
46.
47. Genomics databases Genome databases are important for biomedicine understanding evolution in a molecular level Problem: genome databases are incomplete stating all implicit features leads to redundancy integration and complex queries difficult ad-hoc rules embedded in imperative code Problem: genome databases are inconsistent Different interpretation of gene, exon, UTR etc
48. Solution: Sequence Ontology + Deductive Database The Sequence Ontology standardizes sequence terms Additional axioms are being added Encoding genome calculus Genome relations based on Allen Interval Algebra Can be used in conjunction with a deductive genome database consistency checking does this genome dataset make sense? inference and querying what entities are present in region X?
49. Sequence relationship predicates based on Allen Interval Algebra no recursion conjunction of binary terms uses arithmetic (for efficiency) Extensions: strands circular genomes upstream_of(X,Y) :- has_end(X,XE), has_start(Y,YS), XE < YS. ?- upstream_of(exon3,X). X=exon1 ; X=exon2 exon3 exon1 exon2 exon4 exon5
50.
51. possibility of recursion through negationexon(exon1). exon(exon2). has_end(exon1,1000,t1). has_start(exon2,2000,t2). ?- intron(X). X = i(t1,1000,2000) t1 exon1 exon2
52. OWL implementation Many axioms cannot be expressed in OWL Interval relations – no arithmetic in OWL option 1: use SWRL option 2: enumerate all base pairs and use property chain axioms Cannot infer properties of unnamed individuals E.g. introns from exons Cyclic structures cannot be described Requires Description Graph extension Open World Assumption useful for semantic web CWA is more convenient for genomics
53. Deductive database implementation Methods: Convert sequence ontology OWL->DLP via Thea2 Manually edit Add rules that cannot be expressed in OWL Tested on XSB and Yap requires tabling Results Currently scales to small regions more debugging required difficult to eliminate unstratified negation
54. Disjunctive datalog implementation Adds: Constraints Disjunctions in rule heads Implementation DLV-Complex : allows functions in arguments Program written from scratch: Rules must be ‘safe’ Results Scales over small regions Useful for detecting inconsistencies in data More research needed More efficient programs Use of relational database backend Further exploration of ASP semantics Genomic rules have many exceptions
55. Prolog implementation Removes: rules that cause cycles with backtracking Implementation Optional use of Nested Containment List library (C + SWI FLI) Results Results can be incomplete due to missing rules E.g. intron :- exon, but not exon :- intron Ruleset can be tailored for dataset Scales over medium sized datasets
56. Hybrid Prolog-Relational implementation Uses same program as prolog implementation Relational database store facts (extensional) can be distributed Uses sql_compiler + mappings to genomics databases Ensembl Chado Non-recursive prolog rules dynamically translated to complex SQL Recursive subclass rules translated by query compiler using UNIONs precomputed and stored in relational database Scales to full genomes
57. LP for genomics: conclusions No one paradigm is perfect Many axioms cannot be expressed in OWL but tools are good Disjunctive Datalog good for consistency checking in small regions More research required on efficiency of tabling solution, ASPs WAM solution most efficient Manually rewriting programs is tedious! Hybrid solutions useful RDBs for asserted facts
58. Application: match.com for diseases Organisms have phenotypes characteristics under the control of the genes of that organism Related genes can have similar phenotypic effects even when the least common ancestor of the gene is 500m years ago Finding these genes can help understand disease evolution
60. Semantic Similarity Given a collection of features F = {f1, f2, …} attributes A = {a1, a2, …} feature-attribute mappings: a(f) = F x A For any feature pair x,y, calculate: Jacard coefficient |a(x) ∩ a(y)| / |a(x)∪ a(y)| maximum IC IC(a) = -log2p(a) maxIC(x,y) = Max[IC(a) : a ∈a(x)∩ a(y)]
61. SWI-Prolog implementation Uses GMP normal prolog programs have unbounded integer arithmetic allows fast bitwise implementations of set intersection/union Encode feature attribute lists as integers m : A {0, .., |A|-1} ai(f) = ∑ 2 m(a) a ∈ a(f) Set intersection and union computed using bitwise and/or Fast implementation of Jacard coefficient J is (A1 /A2 / A1 A2)
62. Similarity metrics + reasoning Attributes are description logic class expressions rarely exact matches across species a(human1) a(zebrafish7) ≠ dystrophic∩ ∃quality_of. arm_muscle atrophied∩ ∃quality_of.pectoral_fin_muscle a(human1) ∩ a(zebrafish7) = {}
63. Use reasoning to find subsumer Find Least Common Ancestor expression typically class expression, not named class a(human1) a(zebrafish7) decreased_size∩ ∃quality_of. muscle_of_upper_limb dystrophic∩ ∃quality_of. arm_muscle atrophied∩ ∃quality_of.pectoral_fin_muscle a*(human1) ∩ a*(zebrafish7) = {decreased_size∩ ∃quality_of. muscle_of_upper_limb}
64. Implementation: Uses Thea2 Thea2 is a prolog package for OWL2 http://github.com/vangelisv/thea reads/writes RDF/XML OWL-XML Native prolog form Description Logic Programs (DLPs) Reasoning strategies Prolog DL reasoners (via JPL/OWLAPI) SQL DB + forward chaining
65. Phenotype matching: Results Proof of concept on 10 human disease genes publication forthcoming Currently applying to neurodegenerative diseases Funding to extend to all Mendelian diseases
66. Web Applications http://berkeleybop.org/obo Web interface to Open Bio Ontologies Implemented in perl + SWI-Prolog Prototype for future development SWI-Prolog Production version in perl and/or java
67. Experiences using LP for bioinformatics: conclusions A little bit of LP goes a long way The theory-application gap is largely untapped A variety of LP paradigms are useful ASP, datalog, DLs, prolog, ILP, … Interoperation can be hard! LP for ‘real world’ applications It is possible! Declarative approach arguably superior Web/database applications are a sweet spot We need to show more success stories ..and to dispel myths
68. Recommendation: make it easier for users Documentation: Unify community knowledge in a single wiki Create a general LP mail list c.f. OWL/SemWeb community Tools: Program analysis Lint-like tool for tabled prologs, ASP Visualization Libraries CPAN for Prolog
69. Recommendation: make it open-source Why Encourages collaboration Bioinformaticianslove open source The people who fund bioinformaticians love open source Open source can still generate revenue How Deposit code in open source code repositories github, sourceforge, googlecode, etc Embrace Web 2.0 blog it, put it on a wiki
70. Recommendation: interoperate with RDBs Why? RDBs and LP should be a natural match Application developers are conservative and familiar with RDBs lightweight in-memory embedded RDBs are becoming more popular How: Hide LP systems behind pseudo-SQL interface SQL queries and DDL translated behind the scenes. cfsql_compiler Users can use native LP syntax and semantics as they feel comfortable Embed LP systems directly in RDBs E.g. PostgreSQL extensions Improve prolog->SQL interfaces Common API c.f. JDBC (Java), DBI (Perl)
71. Recommendation: A unified API to all LP systems Use case: calling LP system from host language (java, perl, ruby, even other prolog) Problem: No standardization amongst APIs Analagous problem: RDB APIs Solved: a 20th century problem Proposal: Common REST interface Single interface per host language
72. Interoperation between LP systems LP systems (ILP, ASP, Prolog, …) differ in whether they accept: Foo(x). ‘Foo’(x). ‘foobar’(x). foo(‘xy’). foo(“xy”). Non-prolog systems should: Adhere to ISO standard for intersection with pure prolog Or at least provide ISO mode Also: ISO Common Logic W3C RIF
74. Robot scientist The Automation of Science King et al. Science 3 April 2009: 85-89 DOI: 10.1126/science.1165620 http://news.bbc.co.uk/2/hi/science/nature/7979113.stm
75. Acknowledgments Vangelis Vassiliadis (Thea) Stephen Veitch (intervaldb) ChristophDraxler (sql_compiler) Jan Wielemaker + SWI Mail list Paulo Moura Vítor Santos Costa + Yap developers Terrence Swift + XSB developers
Notas del editor
data: curse and a blessing
typically download flat files of data and manually integrate@article{stein_perl_1996, title = {How Perl saved the human genome project}, volume = {1}, number = {0001}, journal = {The Perl Journal}, author = {L. Stein}, year = {1996}},@article{stein_creatingbioinformatics_2002, title = {Creating a bioinformatics nation}, volume = {417}, number = {6885}, journal = {Nature}, author = {L. Stein}, year = {2002}, pages = {119--120}},@article{stein_integrating_2003, title = {Integrating biological databases}, volume = {4}, number = {5}, journal = {Nature Reviews Genetics}, author = {L. D. Stein}, year = {2003}, pages = {337--345}}
Damascene conversion
@ARTICLE{Stajich2002, author = {Stajich, J. E. and Block, D. and Boulez, K. and Brenner, S. E. andChervitz, S. A. and Dagdigian, C. and Fuellen, G. and Gilbert, J. G. and Korf, I. and Lapp, H. and Lehvaslaiho, H. and Matsalla, C. and Mungall, C. J. and Osborne, B. I. and Pocock, M. R. and Schattner, P. and Senger, M. and Stein, L. D. and Stupka, E. and Wilkinson, M. D. and Birney, E.}, title = {The Bioperl toolkit: Perl modules for the life sciences}, journal = {Genome Res}, year = {2002}, volume = {12}, pages = {1611-8}, number = {10}, note = {1088-9051 Journal Article},
The empty extension problem
extensional is a macro for dynamic + multifile. Also asserts facts in metamodel module allowing introspection, saving etc. Still some repetition.pldoc comments. harder to extract metamodel info. more typing would be good.metamodel directives don’t do much : graceful failure when no data. i/o
same code for both in-memory and rdb – amazing!! v powerful. non-recursive only. choose when to swap out prolog store and use rdb.with recursive clauses, can choose to bind only the fact predicates
The term ‘junk DNA’ is outdated
Pax6 is master regulator. shared anc.5bn yrs.fly eyes vastly different.
Pax6 is master regulator. shared anc.5bn yrs.fly eyes vastly different.