2. History
http://biogps.org/ is a user-defined and user-extensible tool to
analyze genes. Given a gene of interest, different people are
interested in different data about the gene. BioGPS allows
you to select and display the data you are interested in about
your gene.
To power the backend queries, a database of gene information
was abstracted from BioGPS. The database contained
aggregated (on entrez gene id) and up to date (weekly)
information about all genes.
To access the data seamlessly from BioGPS, a REST API was
implemented giving an annotation lookup service (/gene/)
and a full text query service (/query/). The combination of
these (data aggregation/API front end) became MyGene.info
3. MyGene.info
• MyGene.info provides simple-to-use REST web services to
query/retrieve gene annotation data. Aggregated on entrez
ID.
• Examples:
– http://mygene.info/v2/gene/1017
– http://mygene.info/v2/query?q=cdk*&fields=pdb
– http://mygene.info/v2/metadata
• Hosted entirely on AWS cloud computers (3 8 GB 2-core data
nodes and 2 4GB 2-core web nodes). Currently serves
millions of requests per month.
4. MyVariant.info
• MyVariant.info provides simple-to-use REST web services to
query/retrieve variant annotation data, aggregated from
many popular data resources. Aggregated on HGVS ID.
• Examples:
– http://myvariant.info/v1/variant/chr6:g.152708291G%3E
A
– http://myvariant.info/v1/query?q=clinvar.chrom:10&fiel
ds=clinvar
– http://myvariant.info/v1/query?q=chr1:69000-
70000&fields=dbnsfp,dbsnp
5. Biothings.api - abstracting web
front end
From the point of view of the front end, the
nature of the document is inconsequential,
i.e., whether we serve a documents of genes
or variants or chemicals isn’t particularly
important => How much can we abstract out
of mygene and myvariant and apply it to
6. Motivation
• Isolate the common aspects of MyGene and
MyVariant codebases and make them
available in a separate framework:
biothings.api
• Allows easier development of additional
biothings APIs (Disease, Drug/Chemical, GO,
Species… -> JSON, aggregate on a single field)
• Allows easier maintenance and development
of current biothings (gene, variant).
7. System Overview
• The tornado HTTP server consists of handlers that contain the code to run
when a particular URL pattern is matched, e.g. /variant/, or /metadata
• The biothing codebase essentially contains the connection between the
appropriate Tornado HTTP Request Handler for a request and the elasticsearch
query that executes that request. Conceptually very similar to model-
controller framework, where the model is the elasticsearch python box, and
the controller is the tornado HTTP server.
8. Biothings – HTTP Handling
• tornado.web.RequestHandler: base tornado class for HTTP request handling. Important class methods:
get/post, get_arguments, write
• biothings.www.helper.BaseHandler: contains methods common to all biothings RequestHandlers.
Important class methods: get_query_params, return_json
• biothings.www.api.handlers.QueryHandler: contains methods to implement the biothings query
endpoint. Important class methods: get, post, _examine_kwargs
• biothings.www.api.handlers.BiothingHandler: contains methods to implement the biothings annotation
endpoint. Important class methods: get, post, _examine_kwargs
• biothings.www.api.handlers.MetaDataHandler: contains methods to implement the metadata endpoint
• biothings.www.api.handlers.StatusHandler: contains methods to implement a status endpoint for AWS
ELB
9. Biothings – HTTP Handling
• biothings.www.api.handlers.BiothingHandler:
– GET request (e.g. /variant/chr6:g.152708291G>A)
– POST request (e.g. /variant/)
10. Biothings – HTTP Handling
• biothings.www.api.handlers.QueryHandler:
– GET request (e.g. /query?q=_exists_:dbsnp)
– POST request (e.g. /query/)
11. Biothings – Elasticsearch query
• biothings.www.api.es.ESQuery – contains the python code
for constructing the elasticsearch query and formatting the resulting data
– query(q, **kwargs) – Contains the elasticsearch query to run with data obtained from a
GET or POST to the /query/ endpoint.
– get_biothing(bid, **kwargs) – Contains the elasticsearch query to run with data
obtained from a GET to the /annotation/ endpoint.
– mget_biothings(bid_list, **kwargs) – Contains the elasticsearch query to run with data
obtained from a POST to the /annotation/ endpoint.
– _cleaned_res(res) – Contains the code to format the return object for get_biothing and
mget_biothings.
– _cleaned_res2(res) – Contains the code to format the return object for query.
– _get_biothingdoc(hit) – Contains the code to format a single biothing object from any
elasticsearch query. Called by _cleaned_res and _cleaned_res2.
– _modify_biothingdoc(doc) – Contains the code to modify a biothing_doc. Called in
_get_biothingdoc. Currently empty -> for overriding.
12. Biothings - Settings
• Problem: Until now, we have left out the problem of how to
refer to things that MUST be project specific (e.g., the name
of the elasticsearch index to search, the type of the
document, etc). How do we do this?
• Solution: We make a settings module in biothings that all
code within biothings refers to. That module looks for an
environment variable called BIOTHING_SETTINGS with the
name of a module that can be imported to set project specific
variables.
– export BIOTHING_SETTINGS = ‘biothings.config’
• Similar to Django.
14. Biothings – Project template
• At this point, we have the tools necessary to easily create and
subclass 4 types of biothings handlers (BiothingHandler,
QueryHandler, MetaDataHandler, StatusHandler), and the
elasticsearch query class (ESQuery)
• Could definitely stop here and have a useful tool, but we
wanted to make it even easier to create a new project (also
enforces a uniform project structure across all biothings APIs).
• To do this we have a project template folder containing the
project directory structure and some skeleton code:
– config.py,
– URL patterns to Handlers connection
– Handlers to ESQuery connection
15. Biothings - Project template
• To create the actual project directory from the
template, we wrote a small function: biothings-
admin.py
– Usage: biothings-admin.py <path-to-project-directory>
<biothing-object-name>
– biothings-admin.py ~ variant
• Any folder or file in the template directory will be
created in the project directory. The contents of any
file are passed through the python String.template
function before they are created in the project
directory.
17. Recreating MyVariant.info using biothings.api
• Recreated current MyVariant.info service using the
biothings.api framework
– Very little extra code required (~100 lines)
– Less than a day of time to create the web front end from start.
– https://github.com/SuLab/myvariant.info/tree/biothings.variant
• Seems disingenuous to gauge the utility of a tool by
recreating a codebase if that tool was itself created from the
codebase => Should try implementing other APIs, especially
MyGene.info (has more varied gene specific query options),
and modify biothings as needed.
18. MyGene.info v3
• Sebastien reimplemented MyGene using
biothings framework
• Currently live at mygene.info/v3 for testing
purposes
• Some structural changes to data also
• Examples:
–http://mygene.info/v3/gene/1017
–http://mygene.info/v3/query?q=cdk*&fields=pd
b
19. Small Biothing Cluster
• With biothings, new front end frameworks are very easy to
set up => We are limited only by our ability to parse,
aggregate, index etc. new data.
• For small ES indices (<1 or 2 GB), we set up a small biothings
cluster with 1 m4.large data node serving all search requests,
and 1 t2.micro web node per biothing.
• Currently, this consists of:
small biothing
data/master
m4.large
Taxonomy
t2.micro
Chemical
t2.micro
20. Taxonomy biothing
• Using a taxonomy parser written by Greg.
Aggregated on NCBI taxonomy ID.
• Currently live at http://52.34.211.113
• Examples:
–http://52.34.211.113/v1/species/9606
–http://52.34.211.113/v1/query?q=human
• Soon to become http://s.biothings.io
21. Chemicals biothing
• Data from several chemical databases aggregated by Julee
on InChIKey (hash of string representation of chemical)
https://en.wikipedia.org/wiki/International_Chemical_Identi
fier#InChIKey
• currently live at: http://52.38.192.121/
• Examples:
– http://52.38.192.121/v1/drug/CHEMBL1201666
– http://52.38.192.121/v1/query?q=chembl.pref_name:ne
o*&fields=chembl.pref_name
• Soon to become http://c.biothings.io
22. Future work
• Integrate data load and data index functions into biothings
(WIP, large project)
• Documentation! – Projects like this need very good
documentation to be of any use to an API developer (on the
level of tornado’s excellent documentation:
http://www.tornadoweb.org/en/stable/web.html) (also, WIP)
• Host API services for external users data (essentially possible
without too much work already).
• Auto-generate clients (python client, R client)
• Auto-generate ansible-playbook to create cluster hardware
on AWS
• One-click API…