SlideShare una empresa de Scribd logo
1 de 67
100% Big Data
0% Hadoop
0% Java
Pavlo Baron, codecentric AG
Saturday, April 27, 13
pavlo.baron@codecentric.de
@pavlobaron
github.com/pavlobaron
Saturday, April 27, 13
So here is the short story...
Saturday, April 27, 13
sitting there, listening...
Saturday, April 27, 13
presented as Houdini magic...
Saturday, April 27, 13
so you telling me...
it’s smoke and mirrors?
Saturday, April 27, 13
Smells like a bunch of queues,
pipes and filters...
Saturday, April 27, 13
Looks like some NLP...
Saturday, April 27, 13
Sounds like some math...
Saturday, April 27, 13
Seems like some basic ML...
Saturday, April 27, 13
methinks: I can tinker that.
I have 2 nights in the hotel...
Saturday, April 27, 13
Fire!
Saturday, April 27, 13
Know the use cases...
Saturday, April 27, 13
Consume a feed where people
say what they think before
they think what they say...
Saturday, April 27, 13
Drink Big Data warm, straight
from the fire hose...
Saturday, April 27, 13
Then fork for immediate
notification and batch
analytics...
Saturday, April 27, 13
Some bubbles
queuefeeds filter
sentiment analysis
formalizestore
aggregatereportreact
alert
queue
queue
react
map/reduce
fork
queue
Saturday, April 27, 13
Some tech
Languages: Python, Erlang
Feeds: Tweepy, crawlers, feed readers
Queueing: RabbitMQ through Pika
Store: Riak through protobufs
Map/reduce: modified Disco to run workers on Riak-
nodes data-locally
Active event push to the browser through RabbitMQ’s
Web-STOMP
jVectorMap, jQuery, stomp and sock.js in the browser
wpcorpus built with RabbitMQ, indexed with PyTables
Saturday, April 27, 13
Some math
Analytics: NLP with NLTK
Algo training: nltk-trainer with pickle=true
Algos: naive Bayes, decision tree, binary
classification based on trigram frequencies
simple name and antiword filtering based on
public and own corpora
sentiment analysis based on public and own
corpora
troll filtering using Wikipedia as active
corpus
Saturday, April 27, 13
Some numbers...
...‘cause numbers are sexy
Saturday, April 27, 13
When numbers
become too
sexy for your
[hat|car|cat],
they mutate
into
numbers pr0n
Saturday, April 27, 13
Some numbers, revisited
I’m not into numbers pr0n
numbers need to be just good enough for
what you’re trying to solve
Saturday, April 27, 13
But it’s still the
easiest way to
impress,
especially
without
solving a
concrete
problem
Saturday, April 27, 13
So, finally, some numbers
(on my MBA)
Feed: ~10000 chaotic text msg/min
Store: ~8000 formalized msg/min, N=3,
quorum, 3 nodes
Analytics: ~7000 msg/min (filtered, pos/neg
aggregation, location based aggregation)
Demo: ~1500000 tweets, pos/neg
aggregation, stream processing in ~7min,
map/red in ~15sec
Saturday, April 27, 13
Some lessons learned
Saturday, April 27, 13
The Beliebers...
Saturday, April 27, 13
More than 60% of the
Twitter sample stream is
useless garbage...
Saturday, April 27, 13
Further 20% are trolls...
Saturday, April 27, 13
So I ended up implementing
wpcorpus - active NLP corpus
based on Wikipedia, using its
categories and their
combination as classes and
anti-classes
Saturday, April 27, 13
Real names...
Saturday, April 27, 13
Absurd profile bios...
Saturday, April 27, 13
Location...
Saturday, April 27, 13
Language...
For trigrams in NLTK, use
Spanish as “anti-class” to tell
English/German from the rest
Saturday, April 27, 13
Disco workers on Riak nodes...
PITA and a lot of tinkering, but necessary
for data locality
Extending Disco is relatively easy, but
changing it is hard...
Flooding, asynchronous, separate key/value
listing in low-level Riak goes very well with
Erlang port based Python/Erlang message
exchange in Disco. Not
Low-level vnode-data-consume needed
probabilistic correction due to N=3
Extended Disco to use RabbitMQ between
the worlds (h/t Dan North for the idea)
Saturday, April 27, 13
Mixing Python and Erlang in
one project...
Forgetting punctuation in Erlang code all the
time when quickly switching from Python
Terribly missing pattern matching in Python
Considering to embed Python in Erlang, but
it might become a double PITA then
Saturday, April 27, 13
Sentiment analysis...
Saturday, April 27, 13
Well, actually, strong
sentiment analysis...
Saturday, April 27, 13
Very unreliable given the
human nature...
Saturday, April 27, 13
In addition to the NLTK’s
movie reviews corpus, use
these for “neg” classification
Saturday, April 27, 13
FAQ
Saturday, April 27, 13
Q: Why the heck are you
doing this?
Saturday, April 27, 13
A
Because I can
Because I want
Because I want to learn
Because I want to go deep on low-level
Because I value speed over abstraction
Because it’s very interesting to combine
computer science with math
Saturday, April 27, 13
Q: Why not just use Hadoop?
Saturday, April 27, 13
A
Because I didn’t want to run this on the JVM
Because I have 2 use cases, and only one of
them is suitable for batch map/reduce
Saturday, April 27, 13
Q: Why didn’t you want to
run this on the JVM?
Saturday, April 27, 13
A: well, technically seen,
Big Data area is growing
on the JVM
Hadoop
Pig
Storm, Kafka, Esper
Mahout
OpenNLP
Saturday, April 27, 13
A: but I didn’t want this
Big Data on my drive
~/.m2
Saturday, April 27, 13
A: and I am evaluating some
alternatives to the ecosystem
Saturday, April 27, 13
Q: Why are you queueing at
all? Others do gazillions of
msg/sec without queues
Saturday, April 27, 13
A
I could, if instead of filters and batch analytics of
chaotic text, it would be just about building trivial
sums from fixed-sized ticks
with growable numbers like this, you want to
protect any sort of reliable data store from getting
flooded by writes, RDBMS or NoSQL store
Because I need to do some pipes and filters
Because I’m mixing and crossing borders of data
sources and technologies
Because (almost) all frameworks that you might
consider also do some queueing or buffering
Saturday, April 27, 13
Q: Why did you use Erlang
and Python?
Saturday, April 27, 13
A
Because reliability and distribution are built
into the Erlang VM and I don’t need separate
coordinators or to reinvent the wheel
Because both, Python and Erlang, are
“functional” enough for what I need day-by-
day
Because Python has been for many years the
platform of choice for scientists, thus there
are available clever and mature math
libraries
Because Disco is on Python and Erlang, Riak
and RabbitMQ are on Erlang
Saturday, April 27, 13
Q: isn’t Python slow like hell?
Saturday, April 27, 13
A
it’s not operating at the speed of light
yes, it is slower at some points
I’ve also been testing PyPy to improve
performance for the case I should need it,
‘cause right now it works just fast enough
without explicit bottle-necks in the given
architecture, even on one single MBA
Saturday, April 27, 13
Q: MBA is boring. Can you
make it real web scale?
Saturday, April 27, 13
A
well, to be precise, I’m operating on web
data
I can scale queues with RabbitMQ
I can scale storage with Riak
I can scale the map/reduce supported
analytics with Disco/Riak
I can scale data sources/feeds, machines,
hardware, networks, infrastructure, logins
etc. You name it
Saturday, April 27, 13
Q: what’s in the future?
Saturday, April 27, 13
A
I don’t have my crystal ball with me
I’m toying with the idea to write Pig Latin
engine in Python called “Sau” (German for
pig), to offer data scientists a comfortable
interface and to allow them to run existing
Pig scripts on this stack
I could add more data sources, improve
throughput where necessary and work on
some low level Disco modifications to change
the way it utilizes Erlang in my case
I will integrate my Disco extensions with
Disco upstream one day
Saturday, April 27, 13
Q: what do we learn about
Big Data here?
Saturday, April 27, 13
A
Big Data is about the “what”, followed by
the “how” and enabled by the “what with”
Saturday, April 27, 13
A
It’s about gathering data, filtering out most
of it as garbage, analyzing it, gaining useful
information out of it immediately, finding new
ways to gather and use information and
deriving steps for business improvements,
strategy planning, doing soft intelligence aka
enterprise level stalking or, even more
important, helping make the world a better
place - it’s up to you
Saturday, April 27, 13
A
It’s not about building SkyNet - even if this
will be built one day, it will be pretty
boring. It’s about building recommender and
decision support systems, thus letting
machines do stupid, repeated jobs fast and
human beings make high quality decisions
Saturday, April 27, 13
A
It’s not about plain numbers. It’s about
numbers that are good enough to carry the
solution. Not less, but also not more than
that
Consider the dilemma: if you want to be as
precise and as fast as possible in your “Big
Data”, you don’t crunch it blindly on tons of
machines. Instead, you go at the low-level
and optimize the stack, filter upfront
If it’s not leading to value in near realtime,
it’s useless, though probably Big
Saturday, April 27, 13
A
It’s a huge field for geeks with aspiration to
learn new things, dig into math and computer
science, play with different platforms and
tools and pick the right tool chain
Saturday, April 27, 13
Oh, and did the demo run?
Saturday, April 27, 13
Thank you!
Saturday, April 27, 13
Most images originate from
istockphoto.com
except few ones taken
from Wikipedia or Flickr (CC)
and product pages
or generated through public
online generators
Saturday, April 27, 13

Más contenido relacionado

Destacado

Destacado (11)

@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it
 
Madre feliz dia¡
Madre feliz dia¡Madre feliz dia¡
Madre feliz dia¡
 
What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)
 
Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)
 
Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)
 
Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)
 
Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)
 
(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)
 
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)
 
Expo antiagregantes plaquetarios
Expo antiagregantes plaquetariosExpo antiagregantes plaquetarios
Expo antiagregantes plaquetarios
 
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
 

Más de Pavlo Baron

Más de Pavlo Baron (12)

data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)
 
Near realtime analytics - technology choice (@pavlobaron)
Near realtime analytics - technology choice (@pavlobaron)Near realtime analytics - technology choice (@pavlobaron)
Near realtime analytics - technology choice (@pavlobaron)
 
Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)
 
a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)
 
From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)
 
The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)
 
20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)
 
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
 
Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)
 
JUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTPJUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTP
 
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

100% Big Data, 0% Hadoop, 0% Java (@pavlobaron)

  • 1. 100% Big Data 0% Hadoop 0% Java Pavlo Baron, codecentric AG Saturday, April 27, 13
  • 3. So here is the short story... Saturday, April 27, 13
  • 5. presented as Houdini magic... Saturday, April 27, 13
  • 6. so you telling me... it’s smoke and mirrors? Saturday, April 27, 13
  • 7. Smells like a bunch of queues, pipes and filters... Saturday, April 27, 13
  • 8. Looks like some NLP... Saturday, April 27, 13
  • 9. Sounds like some math... Saturday, April 27, 13
  • 10. Seems like some basic ML... Saturday, April 27, 13
  • 11. methinks: I can tinker that. I have 2 nights in the hotel... Saturday, April 27, 13
  • 13. Know the use cases... Saturday, April 27, 13
  • 14. Consume a feed where people say what they think before they think what they say... Saturday, April 27, 13
  • 15. Drink Big Data warm, straight from the fire hose... Saturday, April 27, 13
  • 16. Then fork for immediate notification and batch analytics... Saturday, April 27, 13
  • 17. Some bubbles queuefeeds filter sentiment analysis formalizestore aggregatereportreact alert queue queue react map/reduce fork queue Saturday, April 27, 13
  • 18. Some tech Languages: Python, Erlang Feeds: Tweepy, crawlers, feed readers Queueing: RabbitMQ through Pika Store: Riak through protobufs Map/reduce: modified Disco to run workers on Riak- nodes data-locally Active event push to the browser through RabbitMQ’s Web-STOMP jVectorMap, jQuery, stomp and sock.js in the browser wpcorpus built with RabbitMQ, indexed with PyTables Saturday, April 27, 13
  • 19. Some math Analytics: NLP with NLTK Algo training: nltk-trainer with pickle=true Algos: naive Bayes, decision tree, binary classification based on trigram frequencies simple name and antiword filtering based on public and own corpora sentiment analysis based on public and own corpora troll filtering using Wikipedia as active corpus Saturday, April 27, 13
  • 20. Some numbers... ...‘cause numbers are sexy Saturday, April 27, 13
  • 21. When numbers become too sexy for your [hat|car|cat], they mutate into numbers pr0n Saturday, April 27, 13
  • 22. Some numbers, revisited I’m not into numbers pr0n numbers need to be just good enough for what you’re trying to solve Saturday, April 27, 13
  • 23. But it’s still the easiest way to impress, especially without solving a concrete problem Saturday, April 27, 13
  • 24. So, finally, some numbers (on my MBA) Feed: ~10000 chaotic text msg/min Store: ~8000 formalized msg/min, N=3, quorum, 3 nodes Analytics: ~7000 msg/min (filtered, pos/neg aggregation, location based aggregation) Demo: ~1500000 tweets, pos/neg aggregation, stream processing in ~7min, map/red in ~15sec Saturday, April 27, 13
  • 27. More than 60% of the Twitter sample stream is useless garbage... Saturday, April 27, 13
  • 28. Further 20% are trolls... Saturday, April 27, 13
  • 29. So I ended up implementing wpcorpus - active NLP corpus based on Wikipedia, using its categories and their combination as classes and anti-classes Saturday, April 27, 13
  • 33. Language... For trigrams in NLTK, use Spanish as “anti-class” to tell English/German from the rest Saturday, April 27, 13
  • 34. Disco workers on Riak nodes... PITA and a lot of tinkering, but necessary for data locality Extending Disco is relatively easy, but changing it is hard... Flooding, asynchronous, separate key/value listing in low-level Riak goes very well with Erlang port based Python/Erlang message exchange in Disco. Not Low-level vnode-data-consume needed probabilistic correction due to N=3 Extended Disco to use RabbitMQ between the worlds (h/t Dan North for the idea) Saturday, April 27, 13
  • 35. Mixing Python and Erlang in one project... Forgetting punctuation in Erlang code all the time when quickly switching from Python Terribly missing pattern matching in Python Considering to embed Python in Erlang, but it might become a double PITA then Saturday, April 27, 13
  • 37. Well, actually, strong sentiment analysis... Saturday, April 27, 13
  • 38. Very unreliable given the human nature... Saturday, April 27, 13
  • 39. In addition to the NLTK’s movie reviews corpus, use these for “neg” classification Saturday, April 27, 13
  • 41. Q: Why the heck are you doing this? Saturday, April 27, 13
  • 42. A Because I can Because I want Because I want to learn Because I want to go deep on low-level Because I value speed over abstraction Because it’s very interesting to combine computer science with math Saturday, April 27, 13
  • 43. Q: Why not just use Hadoop? Saturday, April 27, 13
  • 44. A Because I didn’t want to run this on the JVM Because I have 2 use cases, and only one of them is suitable for batch map/reduce Saturday, April 27, 13
  • 45. Q: Why didn’t you want to run this on the JVM? Saturday, April 27, 13
  • 46. A: well, technically seen, Big Data area is growing on the JVM Hadoop Pig Storm, Kafka, Esper Mahout OpenNLP Saturday, April 27, 13
  • 47. A: but I didn’t want this Big Data on my drive ~/.m2 Saturday, April 27, 13
  • 48. A: and I am evaluating some alternatives to the ecosystem Saturday, April 27, 13
  • 49. Q: Why are you queueing at all? Others do gazillions of msg/sec without queues Saturday, April 27, 13
  • 50. A I could, if instead of filters and batch analytics of chaotic text, it would be just about building trivial sums from fixed-sized ticks with growable numbers like this, you want to protect any sort of reliable data store from getting flooded by writes, RDBMS or NoSQL store Because I need to do some pipes and filters Because I’m mixing and crossing borders of data sources and technologies Because (almost) all frameworks that you might consider also do some queueing or buffering Saturday, April 27, 13
  • 51. Q: Why did you use Erlang and Python? Saturday, April 27, 13
  • 52. A Because reliability and distribution are built into the Erlang VM and I don’t need separate coordinators or to reinvent the wheel Because both, Python and Erlang, are “functional” enough for what I need day-by- day Because Python has been for many years the platform of choice for scientists, thus there are available clever and mature math libraries Because Disco is on Python and Erlang, Riak and RabbitMQ are on Erlang Saturday, April 27, 13
  • 53. Q: isn’t Python slow like hell? Saturday, April 27, 13
  • 54. A it’s not operating at the speed of light yes, it is slower at some points I’ve also been testing PyPy to improve performance for the case I should need it, ‘cause right now it works just fast enough without explicit bottle-necks in the given architecture, even on one single MBA Saturday, April 27, 13
  • 55. Q: MBA is boring. Can you make it real web scale? Saturday, April 27, 13
  • 56. A well, to be precise, I’m operating on web data I can scale queues with RabbitMQ I can scale storage with Riak I can scale the map/reduce supported analytics with Disco/Riak I can scale data sources/feeds, machines, hardware, networks, infrastructure, logins etc. You name it Saturday, April 27, 13
  • 57. Q: what’s in the future? Saturday, April 27, 13
  • 58. A I don’t have my crystal ball with me I’m toying with the idea to write Pig Latin engine in Python called “Sau” (German for pig), to offer data scientists a comfortable interface and to allow them to run existing Pig scripts on this stack I could add more data sources, improve throughput where necessary and work on some low level Disco modifications to change the way it utilizes Erlang in my case I will integrate my Disco extensions with Disco upstream one day Saturday, April 27, 13
  • 59. Q: what do we learn about Big Data here? Saturday, April 27, 13
  • 60. A Big Data is about the “what”, followed by the “how” and enabled by the “what with” Saturday, April 27, 13
  • 61. A It’s about gathering data, filtering out most of it as garbage, analyzing it, gaining useful information out of it immediately, finding new ways to gather and use information and deriving steps for business improvements, strategy planning, doing soft intelligence aka enterprise level stalking or, even more important, helping make the world a better place - it’s up to you Saturday, April 27, 13
  • 62. A It’s not about building SkyNet - even if this will be built one day, it will be pretty boring. It’s about building recommender and decision support systems, thus letting machines do stupid, repeated jobs fast and human beings make high quality decisions Saturday, April 27, 13
  • 63. A It’s not about plain numbers. It’s about numbers that are good enough to carry the solution. Not less, but also not more than that Consider the dilemma: if you want to be as precise and as fast as possible in your “Big Data”, you don’t crunch it blindly on tons of machines. Instead, you go at the low-level and optimize the stack, filter upfront If it’s not leading to value in near realtime, it’s useless, though probably Big Saturday, April 27, 13
  • 64. A It’s a huge field for geeks with aspiration to learn new things, dig into math and computer science, play with different platforms and tools and pick the right tool chain Saturday, April 27, 13
  • 65. Oh, and did the demo run? Saturday, April 27, 13
  • 67. Most images originate from istockphoto.com except few ones taken from Wikipedia or Flickr (CC) and product pages or generated through public online generators Saturday, April 27, 13