3. Questions If we wanted to start studying a gene of unknown function, which one(s) should we study first? How many un-annotated genes could be annotated? What proportion of unknown genes (hypothetical proteins) are probably not real proteins (i.e. pseudo-genes, mis-predicted orfs, etc.) ? What proportion of unknown gene families are probably phage-related? Can some of these families (hopefully the top ranking ones) be characterized using non-similarity based bioinformatic approaches?
6. Phylogenetic profiling Wu, et al., PLOS Genetics, 2005 C. hydrogenoformansidentified presence or absence of homologs in all other completely sequence genomes Identified many hypothetical proteins that had the same profile as other sporulation proteins
8. Community Profiling Look across multiple metagenomic samples Gene families that have similar profiles may have similar function Similar to using co-expression to identify similar functioning genes
9. So what have I done? "all metagenomics peptides" from CAMERA 43M sequences (mostly GOS) Searched against 11,000 Pfams using HMMER 3 Used “cluster” to group genes and samples
10. Results Metagenomic Samples Red = above avg. number of pfams Green = below avg. number of pfams Have not normalized Number of sequences per sample For number of pfams Pfams
12. Measuring functional relatedness Need to measure community profiling performance The hierarchal clusters were broken into 575 groups using a correlation cutoff of 0.90 or above. PFams were mapped to GO terms using pfam2GO 1893 PFams had no associated GO term 695 of these were Domains of Unknown Function:DUFs 3377 PFams had one or more associated GO terms and could be used for further analysis Only 67 (of 575) clusters contained 4 or more PFams with at least one GO term
13. Measuring GO similarity G-SESAME Measures the semantic similarity of any two GO terms Not downloadable so queries had to be made to their web server (not fun) Pair-wise similarity was measure for each pair of GO terms in each cluster had to check if terms were in same namespace
14. Results Average G-Sesame scores for each cluster The average of all cluster averages was 0.484 10 clusters had a score of 0.60 or greater. The data was then randomized by using the same GO terms but in different random clusters and a score of 0.412-0.420 over 4 iterations Each of the 4 iterations had only 1 or 0 clusters with a score of 0.60 or greater
15.
16.
17.
18. Bittorrent A peer-to-peer file sharing protocol ~ 27-55% of all Internet traffic Mostly illegal file sharing Files are shared in small pieces between several users
19. Torrents for Biology Why use torrent technology? Download large datasets much faster Searchable central listing Decentralization of data
20. What is BioTorrents? A legal file sharing website for scientists Users can upload their own research results, data, software Users can browse or search through all datasets Data is not hosted on BioTorrents
27. Who will upload data? Everyone! Realistically, Large organizations (e.g. NCBI, CAMERA, etc.) May need some convincing to host their data via torrents in addition to FTP, HTTP, etc. Scientists that really support open science Sharing data before formally complete and published
28. Technical Challenges Many institutions frown on BitTorrent technology A port must be opened/forwarded Client program and computer must be left running Ensuring data is legal, virus free, etc. Users that upload many legitimate torrents will provide more confidence to people downloading Making downloading and uploading easy