AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
1. Graph DB + Bioinformatics:
Bio4j, recent applications and future
directions
www.ohnosequences.com www.bio4j.com
2. But who‟s this guy talking here?
I am Currently working as a Bioinformatics consultant/developer/researcher at
Oh no sequences! and I have been here at the Ohio State University working as a
Visiting Scholar during these last two months.
Oh no what !?
We are the R&D group at Era7 Bioinformatics.
we like bioinformatics, cloud computing, NGS, category theory, bacterial
genomics…
well, lots of things.
What about Era7 Bioinformatics?
Era7 Bioinformatics is a Bioinformatics company specialized in sequence analysis,
knowledge management and sequencing data interpretation.
Our area of expertise revolves around biological sequence analysis, particularly
Next Generation Sequencing data management and analysis.
www.ohnosequences.com www.bio4j.com
3. We‟re a small but quite peculiar company! (in the good sense of course )
Currently we have offices in:
Madrid (Spain)
Boston MA (USA)
Yeah, I know what you‟re thinking,
they are not precisely ugly cities…
Granada (Spain)
www.ohnosequences.com www.bio4j.com
4. Our team is multidisciplinary: bioinformaticians, mathematicians, lab
researchers, immunologists, biologists specialized in biochemistry and IT
professionals.
A team formed by people with different backgrounds is able to analyze the
same problem from different point of views.
We are based in Research
In a fast changing area, our activity is based in being able to offer
cutting edge solutions. This is only possible maintaining a continuous
research and innovation activity.
In addition, since many of our customers are researchers, being part
of that community allow us to be really customer oriented.
www.ohnosequences.com www.bio4j.com
5. Everything we do is 100% Open source !
Yes, we hate patents.
And no, we‟re not crazy (or maybe just a bit…)
Ok that‟s really nice, but how can that actually work??
• Free marketing and dissemination
• We can use other bioinformatics open source tools/DBs/etc…
• Faster adaptation to a fast changing field (bioinformatics, genomics)
• You may not earn a lot of money but you earn money enough doing many
creative things
www.ohnosequences.com www.bio4j.com
6. Money? Where from ??
• Providing services
• Adapting services to different infrastructures and frameworks…
OK, but you could probably get much more money with
a different business model…
Yeah, but this is our philosophy!
www.ohnosequences.com www.bio4j.com
7. We are also based on Cloud Computing
Cloud Computing has revolutionized the world of computing because in this
paradigm you get the infrastructure as a service (IaaS). We are expert in the
use of the leaders of this world: Amazon Web Services (AWS).
So, what do we get?
a) No investment in infrastructure. Pay per use.
b) Scalability: For example we can launch just one virtual server for two
hours or more than one hundred during ten hours depending on the
amount of data that should be analyzed in different projects.
www.ohnosequences.com www.bio4j.com
8. What‟s Bio4j?
Bio4j is a bioinformatics graph based DB including most data
available in :
Uniprot KB(SwissProt + Trembl)
Gene Ontology (GO)
UniRef (50,90,100)
NCBI Taxonomy
RefSeq
Enzyme DB
www.ohnosequences.com www.bio4j.com
9. What‟s Bio4j?
It provides a completely new and powerful framework
for protein related information querying and
management.
Since it relies on a high-performance graph engine, data
is stored in a way that semantically represents its own
structure
www.ohnosequences.com www.bio4j.com
10. What‟s Bio4j?
Bio4j uses Neo4j technology, a "high-performance graph
engine with all the features of a mature and robust
database".
Thanks to both being based on Neo4j DB and the API
provided, Bio4j is also very scalable, allowing anyone
to easily incorporate his own data making the best
out of it.
www.ohnosequences.com www.bio4j.com
11. What‟s Bio4j?
Everything in Bio4j is open source !
released under AGPLv3
www.ohnosequences.com www.bio4j.com
12. Bioinformatics Highly interconnected overlapping knowledge
DBs and Graphs spread throughout different DBs
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
13. Bioinformatics However all this data is in most cases modeled in relational databases.
DBs and Graphs Sometimes even just as plain CSV files
Initial motivation As the amount and diversity of data grows, domain models
become crazily complicated!
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
14. Bioinformatics With a relational paradigm, the double implication
DBs and Graphs
Entity Table
Initial motivation
does not go both ways.
Bio4j structure
You get „auxiliary‟ tables that have no relationship with the small
piece of reality you are modeling.
Some samples
You need ‘artificial’ IDs only for connecting entities, (and these are mixed
Why Bio4j? with IDs that somehow live in reality)
Bio4j and the Entity-relationship models are cool but in the end you always have to
Cloud deal with ‘raw’ tables plus SQL.
Integrating/incorporating new knowledge into already existing
databases is hard and sometimes even not possible without changing
the domain model
www.ohnosequences.com www.bio4j.com
15. Bioinformatics Life in general and biology in particular are probably not 100% like a graph…
DBs and Graphs
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
but one thing‟s sure, they are not a set of tables!
www.ohnosequences.com www.bio4j.com
16. Bioinformatics
DBs and Graphs
NoSQL (not only SQL)
Initial motivation
NoSQ… what !??
Bio4j structure
Some samples Let‟s see what Wikipedia says…
Why Bio4j? “NoSQL is a broad class of database management systems
that differ from the classic model of the relational database
Bio4j and the
Cloud management system (RDBMS) in some significant ways.
These data stores may not require fixed table schemas,
usually avoid join operations and typically scale
horizontally.”
www.ohnosequences.com www.bio4j.com
17. Bioinformatics NoSQL data models
DBs and Graphs
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
18. Bioinformatics
DBs and Graphs
Initial motivation
Neo4j is a high-performance, NOSQL graph database with all
Bio4j structure
the features of a mature and robust database.
Some samples
The programmer works with an object-oriented, flexible
network structure rather than with strict and static tables
Why Bio4j?
Bio4j and the All the benefits of a fully transactional, enterprise-strength
Cloud database.
For many applications, Neo4j offers performance
improvements on the order of 1000x or more compared to
relational DBs.
www.ohnosequences.com www.bio4j.com
19. Bioinformatics DBs
and Graphs
Ok, but why starting all this?
Were you so bored…?!
Initial
motivation
It all started somehow around our need for massive access to
protein GO (Gene Ontology) annotations.
Bio4j structure
At that point I had to develop my own MySQL DB based on the official
GO SQL database, and problems started from the beginning:
Some samples
I got crazy „deciphering‟ how to extract Uniprot protein annotations
Why Bio4j? from GO official tables schema
Bio4j and the Uniprot and GO official protein annotations were not always consistent
Cloud
Populating my own DB took really long due to all the joins and
subqueries needed in order to get and store the protein annotations.
Soon enough we also had the need of having massive access to basic
protein information.
www.ohnosequences.com www.bio4j.com
20. Bioinformatics DBs
These processes had to be automated for our (specifically
and Graphs
designed for NGS data) bacterial genome annotation system
Initial BG7
motivation
Uniprot web services available were too limited:
Bio4j structure
- Slow
Some samples
- Number of queries limitation
Why Bio4j? - Too little information available
Bio4j and the
Cloud
So I downloaded the whole Uniprot DB in XML format
(Swiss-Prot + Trembl)
and started to have some fun with it !
www.ohnosequences.com www.bio4j.com
21. BG7 algorithm
• Selection of the specific reference protein set
1
• Prediction of possible genes by BLAST similarity
2
• Gene definition: merging compatible similarity regions, detecting start and stop
3
• Solving overlapped predicted genes
4
• RNA prediction by BLAST similarity
5
6
• Final annotation and complete deliverables. Quality control.
www.era7bioinformatics.com
22. Bioinformatics DBs We got used to having massive direct access to all this protein
and Graphs related information…
Initial
motivation So why not adding other resources we needed quite often
in most projects and which now were becoming a sort of
bottleneck compared to all those already included in Bio4j ?
Bio4j structure
Then came:
Some samples
- Isoform sequences
Why Bio4j? - Protein interactions and features
- Uniref 50, 90, and 100
Bio4j and the
Cloud - RefSeq
- NCBI Taxonomy
- Enzyme Expasy DB
www.ohnosequences.com www.bio4j.com
23. Bioinformatics DBs Let‟s dig a bit about Bio4j structure:
and Graphs
Initial motivation Data sources and their relationships:
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
24. Bioinformatics DBs Bio4j domain model
and Graphs
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
25. Bioinformatics DBs
and Graphs The Graph DB model: representation
Initial motivation
Core abstractions:
Bio4j structure Nodes
Relationships between nodes
Some samples
Properties on both
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
26. Bioinformatics DBs Let‟s dig a bit about Bio4j structure:
and Graphs
Initial motivation How are things modeled?
Bio4j structure
Couldn‟t be simpler!
Some samples
Why Bio4j?
Entities Associations / Relationships
Bio4j and the
Cloud
Nodes Edges
www.ohnosequences.com www.bio4j.com
27. Bioinformatics DBs Some examples of nodes would be:
and Graphs
Initial motivation GO term
Protein
Bio4j structure
Genome Element
Some samples
Why Bio4j?
and relationships:
Bio4j and the
Cloud
Protein PROTEIN_GO_ANNOTATION
GO term
www.ohnosequences.com www.bio4j.com
28. Bioinformatics DBs We have developed a tool aimed to be used both as a reference manual and
and Graphs initial contact for Bio4j domain model: Bio4jExplorer
Bio4jExplorer allows you to:
Initial motivation
• Navigate through all nodes and relationships
Bio4j structure
• Access the javadocs of any node or relationship
Some samples
• Graphically explore the neighborhood of a node/relationship
Why Bio4j?
• Look up for the indexes that may serve as an entry point for a node
Bio4j and the
Cloud • Check incoming/outgoing relationships of a specific node
• Check start/end nodes of a specific relationship
www.ohnosequences.com www.bio4j.com
29. Bioinformatics DBs Entry points and indexing
and Graphs
There are two kinds of entry points for the graph:
Initial motivation
Bio4j structure Auxiliary relationships going from the reference node, e.g.
- CELLULAR_COMPONENT: leads to the root of GO cellular component
Some samples sub-ontology
- MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl
Why Bio4j?
Node indexing
Bio4j and the
Cloud There are two types of node indexes:
- Exact: Only exact values are considered hits
- Fulltext: Regular expressions can be used
www.ohnosequences.com www.bio4j.com
30. Bioinformatics DBs Querying Bio4j with Cypher
and Graphs
Initial motivation
Getting a keyword by its ID
Bio4j structure START k=node:keyword_id_index(keyword_id_index = "KW-0181")
return k.name, k.id
Some samples
Finding circuits/simple cycles of length 3 where at least one protein is from
Swiss-Prot dataset:
Why Bio4j?
START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
MATCH d <-[r:PROTEIN_DATASET]- p,
Bio4j and the
circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -
Cloud
[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -
[:PROTEIN_PROTEIN_INTERACTION]-> (p)
return p.accession, p2.accession, p3.accession
Check this blog post for more info and our Bio4j Cypher cheetsheet
www.ohnosequences.com www.bio4j.com
31. Bioinformatics DBs
and Graphs
A graph traversal language
Initial motivation
Get protein by its accession number and return its full name
Bio4j structure
gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name
Some samples ==> Aspartate aminotransferase, mitochondrial
Get proteins (accessions) associated to an interpro motif (limited to 4 results)
Why Bio4j?
gremlin>
g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.accessio
Bio4j and the n[0..3]
Cloud ==> E2GK26
==> G3PMS4
==> G3Q865
==> G3PIL8
Check our Bio4j Gremlin cheetsheet
www.ohnosequences.com www.bio4j.com
32. Bioinformatics DBs Visualizations (1) REST Server Data Browser
and Graphs
Navigate through Bio4j data in real time !
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
33. Bioinformatics DBs Visualizations (2) Bio4j + Gephi
and Graphs
Get really cool graph visualizations using Bio4j and Gephi visualization and
Initial motivation exploration platform
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
34. Bioinformatics DBs Visualizations (3) Bio4j GO Tools
and Graphs
Initial motivation
Bio4j structure
Some samples
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
35. Bioinformatics DBs Why would I use Bio4j ?
and Graphs
Massive access to protein/genome/taxonomy… related
Initial motivation information
Bio4j structure Integration of your own DBs/resources around common
information
Some samples
Development of services tailored to your needs built around
Why Bio4j?
Bio4j
Bio4j and the
Networks analysis
Cloud
Visualizations
Besides many others I cannot think of myself…
If you have something in mind for which Bio4j might be useful, please let
us know so we can all see how it could help you meet your needs! ;)
www.ohnosequences.com www.bio4j.com
36. Bioinformatics DBs Bio4j + Cloud (1)
and Graphs
We use AWS (Amazon Web Services) everywhere we can around
Initial motivation
Bio4j, giving us the following benefits:
Bio4j structure
Interoperability and data distribution
Some samples Releases are available as public EBS Snapshots, giving AWS users
the opportunity of creating and attaching to their instances Bio4j DB
100% ready volumes in just a few seconds.
Why Bio4j?
Bio4j and the CloudFormation templates:
Cloud
- Basic Bio4j DB Instance
- Bio4j REST Server Instance
www.ohnosequences.com www.bio4j.com
37. Bioinformatics DBs Bio4j + Cloud (2)
and Graphs
Initial motivation Backup and Storage using S3 (Simple Storage Service)
We use S3 both for backup (indirectly through the EBS snapshots) and
Bio4j structure storage (directly storing RefSeq sequences as independent S3 files)
What kind of benefits do we get from this?
Some samples
• Easy to use
Why Bio4j? • Flexible
• Cost-Effective
Bio4j and the
Cloud • Reliable
• Scalable and high-performance
• Secure
www.ohnosequences.com www.bio4j.com
38. Bioinformatics DBs Bio4j + Cloud (3)
and Graphs
Initial motivation Web servers and service providers in the cloud
Deploying your own web server in AWS using Bio4j as back-end is really
Bio4j structure simple.
A good example of this would be Bio4jTestServer, a continuously
Some samples developed server showcasing Web Services based on Bio4j.
Why Bio4j?
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
39. Bioinformatics DBs Community
and Graphs
Bio4j has a fast growing internet presence:
Initial motivation
Bio4j structure - Twitter: check @bio4j for updates
- Blog: go to http://blog.bio4j.com
Some samples
- Mail-list: ask any question you may have in our list.
Why Bio4j?
- LinkedIn: check the Bio4j group
Bio4j and the
Cloud
- Github issues: don‟t be shy! open a new issue if you think
something‟s going wrong.
www.ohnosequences.com www.bio4j.com
40. Bioinformatics DBs
and Graphs
And the best part of all this is:
Initial motivation
Bio4j structure
Some samples
You have the latest version of Bio4j
Why Bio4j? already imported and
fully working in EgStation! ;)
Bio4j and the
Cloud
www.ohnosequences.com www.bio4j.com
41. Bio4j + MG7 for the integration and
analysis of Chip-seq data
www.ohnosequences.com www.bio4j.com
42. Bio4j + MG7 + 24 Chip-Seq samples
Some numbers:
• 157 639 502 nodes
• 742 615 705 relationships
• 632 832 045 properties
• 148 relationship types
• 44 node types
And it works just fine!
www.ohnosequences.com www.bio4j.com
44. What’s MG7?
MG7 is a new system for massive analysis of sequences from
metagenomics samples specially designed for next generation sequencing
technologies.
MG7 uses cloud computing to solve the problem of massive data analysis
providing scalable, real time, on demand computing for metagenomics data
analysis.
MG7 is able to obtain annotation and functional profiles for shot gun genomic
sequences and taxonomic assignation for any type of read.
The inference of function and the assignation of taxonomical origin for each
sequence are based on massive BLAST similarity analysis.
www.ohnosequences.com www.bio4j.com
45. What’s MG7?
MG7 provides the possibility of choosing different parameters to fix the
thresholds for filtering the BLAST hits:
i. E-value
ii. Identity and query coverage
It allows exporting the results of the analysis to different data formats like:
• XML
• CSV
• Gexf (Graph exchange XML format)
As well as provides to the user with Heat maps and graph visualizations whilst
including an user-friendly interface that allows to access to the alignment
responsible for each functional or taxonomical read assignation and that displays
the frequencies in the taxonomical tree --> MG7Viewer
www.ohnosequences.com www.bio4j.com
49. Bio4j + GRG
A completely new approach for
modeling genomic information and
gene regulatory networks
www.ohnosequences.com www.bio4j.com
50. Bio4j + GRG
Integrating genomic information from organisms such as:
• Zea mays subsp. Mays
• Oryza sativa Japonica Group
• Sorghum bicolor
• Brachypodium distachyon
• Arabidopsis thaliana Columbia
• Arabidopsis lyrata lyrata MN47
www.ohnosequences.com www.bio4j.com
51. Bio4j + GRG domain model
www.ohnosequences.com www.bio4j.com
52. Bio4j + GRG
Get all the advantages of Bio4j and Graph DB while modeling genomic data for
grasses, (although it could be also applied to other species/families).
Possibility of integrating data from other projects here at CAPS/EGLab in a
common framework.
Data-mining of data that currently is not accessible or simply is not structured
enough/in a good way to explore it. Both for external genomic data included in
sites like phytozome or coming directly from the experiments/analysis performed
in the lab.
Common framework for accessing all this information together with other
“Universal” resources such as Uniprot, RefSeq or Gene Ontology.
www.ohnosequences.com www.bio4j.com
53. Bio4j + GRG
Chance for the Lab to enter the Cloud and Graph DB world, being pioneer in
providing access to this sort of data to a whole set of possible different users.
Not worrying anymore about possible problems with back-ups, mantaining
infrastructure or things like that…
And what‟s more important:
Scalability Being able to adapt to the specific needs of new projects
as they go along.
www.ohnosequences.com www.bio4j.com
54. And the best part… Acknowledgments!
Bio4j + MG7 + Chip-Seq results
Bio4j + GRG
www.ohnosequences.com www.bio4j.com
55. The other guys from the basement…
(Brett)
(Matias)
(Andrew)
www.ohnosequences.com www.bio4j.com
56. And of course the rest of the Lab !
www.ohnosequences.com www.bio4j.com
57. That’s it !
Thanks for
your time ;)
www.ohnosequences.com www.bio4j.com