Big Data and Analytics on AWS
Chicago AWS user group event Nov 12, 2019
"Big Data in Higher Education" - Rebecca Schmidt and Alana Alfeche // @rebeccaschmidtm and @alanaalfeche
2. Presentation Breakdown
1. Share pain points in our field
2. Our field with cloud technology
3. Q & A
Disclaimer: None of the following materials presented reflects what we do at our professional roles.
These are knowledge we obtained from our graduate programs.
1
4. Whole Genome Sequencing
1995 First free-living organism to have its entire genome
sequenced (Haemophilus influenzae Rd.)
2003 Human Genome Project completed with a price tag
of $2.7 billion
2015 The cost to generate a whole-exome sequence is
estimated to be below $1500
3
Moore’s Law states that computer power double every
two years. Technology that ‘keep up’ with Moore’s Law
are widely regarded to be doing well.
NIH, 2019
5. Information Explosion
Data Volume
- By 2020, 40% of IoT devices will be related to
health and medicine
- By 2025, biomedical data will exceed the growth
of other big data domains such as astronomy,
physics, and social media
Data Velocity
- Next genome sequencing (NGS) brings us
real-time 30GB of data
Data Variety
- Biological data are heterogeneous
- No standard annotation
- Each database has its own data format
4
NCBI, October 2019
Rossi, 2018
9. CV Through the Years
● Data mining now utilizes machine learning
algorithms as tools to extract potentially-valuable
patterns held within datasets
○ Informs image recognition
● Advancements in the study of Computer Vision are
influencing almost every industry
○ Automotive
○ Healthcare
○ Retail
○ Agriculture
○ Banking
7
10. Challenges with Big Data in CV
Availability of Public Data
● Companies like Waymo are moving toward making their data publicly
available, but not necessarily in a common/centralized way
● Difficult to monitor the effectiveness of data integration
Quantity
● ML algorithms not necessarily designed to handle big data
● Adapting through new processing paradigms (MapReduce - parallel
execution of multiple nodes) and distributed processing frameworks
(Hadoop)
● Computational Complexity and Processing Performance
Non-Linearity of Data
● Difficult to observe relationships
Variance and Bias
● As volume of data increases, the learner can become too closely biased
to the training set and unable to generalize adequately for new data
● Regularization is used to avoid this, but requires more computation time
8