The document summarizes Mendeley's transition to using big data frameworks like Hadoop and cloud services from AWS to handle its large and growing dataset. It discusses how Mendeley grew too large for its original MySQL database and needed scalable solutions. It describes implementing recommendations and analytics using Hadoop and running batch jobs on AWS EMR. While the cloud provides scalability, it also reduces control and orchestration challenges.
AWS Community Day CPH - Three problems of Terraform
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Mendeley
1. Cloud Elephants and
Witches: A Big Data
Tale from Mendeley
Kris Jack, PhD
Data Mining Team Lead
2. Overview
➔
What's Mendeley?
➔
The curse that comes with success
➔
A framework for scaling up (Hadoop + MapReduce)
➔
Moving to the cloud (AWS)
➔
Conclusions
4. What is Mendeley?
...a large data technology
startup company
...and it's on a mission to
change the way that
research is done!
5. Mendeley Last.fm
3) Last.fm builds your music
works like this: profile and recommends you
music you also could like... and
1) Install “Audioscrobbler” it’s the world‘s biggest open
music database
2) Listen to music
6. Mendeley Last.fm
music libraries research libraries
artists researchers
songs papers
genres disciplines
13. In the beginning, there was...
➔
MySQL:
➔
Normalised tables for storing and serving:
➔
User data
➔
Article data
➔
The system was happy
➔
With this, we launched
the article catalogue
➔
Lots of number crunching
➔
Many joins for basic stats
14. Here's where the curse of success comes
➔
More articles came
➔
More users came
➔
The system became unhappy
➔
Keeping data fresh was a burden
➔
Algorithms relied on global counts
➔
Iterating over tables was slow
➔
Needed to shard tables to grow catalogue
➔
In short, our system didn't scale
15. 1.6 million+ users; the 20 largest userbases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
University of California at LA
University of Florida
University of North Carolina
16. 50m
Real-time data on 28m unique papers:
Thomson Reuters’
Web of Knowledge
(dating from 1934)
Mendeley after
16 months:
>150 million
individual articles,
(>25TB)
17. We had serious needs
➔
Scale up to the millions (billions for some items)
➔
Keep data fresh
➔
Support newly planned services
➔
Search
➔
Recommendations
➔
Business context
➔
Agile development (rapid prototyping)
➔
Cost effective
➔
Going viral
19. What is Hadoop?
The Apache Hadoop project develops open-source
software for reliable, scalable, distributed
computing
www.hadoop.apache.org
20. Hadoop
➔
Designed to operate on a cluster of computers
➔
1...thousands
➔
Commodity hardware (low cost units)
➔
Each node offers local computation and storage
➔
Provides framework for working with petabytes of data
➔
When learning about Hadoop, you need to learn about:
➔
HDFS
➔
MapReduce
21. HDFS
➔
Hadoop Distributed File System
➔
Based on Google File System
➔
Replicates data storage (reliability, x3, across racks)
➔
Designed to handle very large files (e.g. 64MB)
➔
Provides high-throughput
➔
File access through Java and Thrift APIs, CL and Wepapp
➔
Name node is a single point of failure (availability issue)
22. MapReduce
➔
MapReduce is a programming model
➔
Allows distributed processing of large data sets
➔
Based on Google's MapReduce
➔
Inspired by functional programming
➔
Take the program to the data, not the data to the program
23. MapReduce Example:
Article Readers by Country
doc_id1, reader_id1, usa, 2010, … HDFS
doc_id2, reader_id2, austria, 2012, … Large file (150M entries)
doc_id1, reader_id3, china, 2010, … Flattened data
.
Stored across nodes
.
.
Map
(pivot countries doc_id1, {usa, china, usa, uk, china, china...}
by doc id) doc_id2, {austria, austria, china, china, uk …}
...
doc_id1, usa, 0.27 Reduce
doc_id1, china, 0.09 (calc. document stats)
doc_id1, uk, 0.09
doc_id2, austria, 0.99
.
.
.
24. Hadoop
➔
HDFS for storing data
➔
MapReduce for processing data
➔
Together, bring the program to the data
26. We make a lot of use of HDFS and MapReduce
➔
Catalogue Stats
➔
Recommendations (Mahout)
➔
Log Analysis (business analytics)
➔
Top Articles
➔
… and more
➔
Quick, reliable and scalable
27. Beware that these benefits have costs
➔
Migrating to a new system (data consistency)
➔
Setup costs
➔
Learn black magic to configure
➔
Hardware for cluster
➔
Administrative costs
➔
High learning curve to administrate Hadoop
➔
Still an immature technology
➔
You may need to debug the source code
➔
Tips
➔
Get involved in the community (e.g. meetups, forums)
➔
Use good commodity hardware
➔
Consider moving to the cloud...
29. What is AWS?
Amazon Web Services (AWS) delivers a set of
services that together form a reliable, scalable,
and inexpensive computing platform “in the
cloud”
www.aws.amazon.com
30. Why move to AWS?
➔
The cost of running your own cluster can be high
➔
Monetary (e.g. hardware)
➔
Time (e.g. training, setup, administration)
➔
AWS takes on these problems, renting their
services to you based on your usage
31. Article Recommendations
➔
Aim: help researchers to find interest articles
➔
Combat information deluge
➔
Keep up-to-date with recent movements
➔
1.6M users
➔
50M articles
➔
Batch process for generating regular
recommendations (using Mahout)
32. Article Recommendations in EMR
➔
Use Amazon's Elastic Map Reduce (EMR)
➔
Upload input data (user libraries)
➔
Upload Mahout jar
➔
Spin up cluster
➔
Run the job
➔
You decide the number of nodes (cost vs time)
➔
You decide the spec of the nodes (cost vs quality)
➔
Retrieve the output
33. Catalogue Search
➔
50 million articles
➔
50GB index in Solr
➔
Variable load (over 24 hours)
➔
1AM is quieter (100 q/s), 1PM is busier (150 q/s)
34. At 1AM, 150 queries/second
1PM, 100 queries/second
AWS Instance
?, ?, ?...
queries
(100/s)
(150/s) AWS elastic
load balancer AWS Instance
AWS Instance
Catalogue Search in Context of Variable Load
➔
Amazon's Elastic Load Balancer
➔
Only pay for nodes when you need them
➔
Spin up when load is high
➔
Tear down load is low
➔
Cost effective and scalable
35. Problems we've faced
➔
Lack of control can be an issue
➔
Trade-off administration and control
➔
Orchestration issues
➔
We have many services to coordinate
➔
Cloud formation & Elastic Beanstalk
➔
Migrating live services is hard work
37. Conclusions
➔
Mendeley has created the world's largest scientific
database
➔
Storing and processing this data is a large scale
challenge
➔
Hadoop, through HDFS and MapReduce, provides a
framework for large scale data processing
➔
Be aware of administration costs when doing this in
house
38. Conclusions
➔
AWS can make scaling up efficient and cost
effective
➔
Tap into the rich big data community out there
➔
We plan to have make no more substantial
hardware purchases, instead use AWS
➔
Scaling up isn't a trivial problem, to save pain,
plan for it from the outset