John Hammink's Talk at Great Wide Open 2016. We discuss: 1.) the need for data analytics infrastructure that can scale exponentially and 2.) what such an infrastructure must contain and finally 3.) the need for an infrastructure to be able to handle un - and semi-structured data.
10. A solution?
• There are trade-offs to consider
• Any trade off should make it easy to collect data
• Easy does it! un- and semi-structured data (multi-
structured data)
• Open source means it’s free; also means that you need
someone on hand to maintain and implement
• Cloud storage means you don’t have to scale and/or
shard; tradeoff means performance hit against bare metal
Image: John Hammink
17. # logs from a file
<source>
type tail
path /var/log/
httpd.log
format apache2
tag web.access
</source>
# logs from client
libraries
<source>
type forward
port 24224
</source>
# store logs to ES and
HDFS
<match *.*>
type copy
<store>
type elasticsearch
logstash_format
23. an open-source bulk data loader that helps data
transfer between various databases, storages, file
formats, and cloud services
embulk.org/docs
24.
25.
26. Hivemall
Hivemall is a scalable machine learning library that
runs on Apache Hive.
Hivemall is designed to be scalable to the number
of training instances as well as the number of
training features.
• Classification
• Regression
• Recommendation
• k-nearest neighbor
• Anomaly Detection
• Feature Engineering
https://github.com/myui/hivemall
27. The Hadoop Story on MongoDB
Image courtesy of Steven Francia @ Docker