Más contenido relacionado
Similar a May 2012 HUG: The Changing Big Data Landscape (20)
Más de Yahoo Developer Network (20)
May 2012 HUG: The Changing Big Data Landscape
- 2. Agenda
Backdrop
Observations
Solution
Demo
© 2012 Datameer, Inc. All rights reserved.
- 3. Big Data Landscape
Challenge Enablers Needs
Dramatic data growth Low cost storage and CPUs Democratize data access
Structured and unstructured data Disruptive new technologies Crowd-source insights
Scale economically Availability of cloud infrastructure Just-In-Time Delivery
Maintain agility
Source: Forrester
© 2012 Datameer, Inc. All rights reserved.
- 4. Hadoop - A Disruptive Response
Advantages Challenges Rapid Adoption
Economics Raw technology, complexity Led by Yahoo, Facebook, etc
Flexibility Requires significant resources Data-driven companies followed
Scalability No packaged applications Fortune 500 rapidly deploying
Goal
Make Big Data analytics accessible to business
users
Shorten time-to-insight
Seamless integration to all data types
Low cost of ownership
Demystify
© 2012 Datameer, Inc. All rights reserved.
- 5. Current State
Volume problem was solved with MPP DBs
• TCO sometimes lower than Hadoop
Variety problem is tractable with Hadoop
People still struggle with velocity
Time-to-insight is too high
Will business agility decline?
© 2012 Datameer, Inc. All rights reserved.
- 7. Observations
“The Wild Wild West”
• We’re in a lawless era for data formats
“Amateur Night”
• People re-invent crooked wheels all over their pipelines
“Open Mic Night”
• Dealing with data that talks too much
“Social Data Gold Rush”
• A rush to judgement on social media data leads to silos
© 2012 Datameer, Inc. All rights reserved.
- 8. 1. “The Wild Wild West” ...
JSON (Twitter, Facebook, MongoDB, etc)
• Not always well-formed
• Difficult to split raw (backtrack to what, ‘{‘ ?)
Sequence Files
• Metadata is completely open-ended
• Triple-packed content (Flume JSON w/ compressed files, etc.)
Raw
• “the delimiter of the week”
‣ u0001 (Hive)
‣ Þ (DoubleClick)
• Various text encoding schemes
‣ ISO-8859 vs. UTF-8
© 2012 Datameer, Inc. All rights reserved.
- 9. 2. “Amateur Night...”
Naive collection strategies
• e.g. 1 file per record (Facebook user)
• rudimentary use of batch requests / store-and-forward
Naive ingestion strategies
• e.g. per minute log ingestion with no compaction --> millions of small files
• Partitioning for ease-of-ingestion, not analytics
‣ e.g. create files/keys/partitions by the server of origin
Naive storage Strategies
• Uncompressed, all-String storage of mostly numerical fields
• Shimming compressed SEQ onto big compressed files --> not splittable
• Mixing compression codecs with data formats (e.g. LzoTextInputFormat)
© 2012 Datameer, Inc. All rights reserved.
- 10. 3. “Open Mic Night” ...
Data can be verbose
• e.g. repeating key/value pairs
Semi-structured is the norm
Deep hierarchies that explode unexpectedly
• Even beyond task JVM memory (too many friends/fans!)
Low Signal-to-noise ratio
Content in various languages
• Makes sentiment analysis tricky
© 2012 Datameer, Inc. All rights reserved.
- 11. Example: FB Profile
{"id":"10011666","name":"Test user","first_name":"Test","last_name":"user","link":"http://www.facebook.com/test.user","username":"test.user","birthday":"09/19/
1983","hometown":{"id":"103102203064024","name":"West Chester, Pennsylvania"},"location":{"id":"","name":null},"bio":"I'm an honorary Sean Connery, born '83r
nThere's only one of mernSingle-handedly raising the economyrnAin't no chance of the record company dropping mernPress be asking do I care for sodomyrnI don't
know, yeah, probablyrnI've been looking for serial monogamyrnNot some bird that looks like Billy ConnollyrnBut for now I'm down for ornithologyrnGrab your
binoculars, come follow me","quotes":"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for - in order to
get to the job you need to pay for the clothes and the car, and the house you leave vacant all day so you can afford to live in it. -Ellen GoodmanrnrnThe entire economy of
the Western world is built on things that cause cancer.-From the movie "Bliss"rnrnNever give a party if you will be the most interesting person there. -Mickey Friedmanrn
rnAhhh. A man with a sharp wit. Someone ought to take it away from him before he cuts himself. -Peter da SilvarnrnNow it seems the music Industry's working on
marketing ploys. I remember back when it wasn't about looks or color but about the voice. -Jay SeanrnrnWhy are you trying so hard to fit in, when you were born to stand
out? -RandomrnrnI think if you're ready to go out with Johnny. Now's the time to tell him about your one month limit. He wont mind he'll apreciate your fresh look on
dating. And once you've dated someone else you can date him again. I'm sure he'll like it. Everyone will appreciate it. You so novel what a good idea. You can keep your time
to your self. You don't need date insurance.You can go out with whoever you want to. Every boy, every boy, in the whole world could be yours. If you'll just listen to my planr
nTHE TEENAGE GUIDE TO POPULARITY -Nada SurfrnrnThe difference between now and the future is simply greater destruction and more universal chaos_-Stephen
Hawking rnrnIn archaeology you uncover the unknown. In diplomacy you cover the known. -Thomas PickeringrnrnYou know the disease u get when u get
married..Onegina -Russel PetersrnrnI saw you standing in my headlights. (Blink, blink, blink.)rnI thought I'd run you down for the weight you left on me.rnInstead I pushed
rewind, reversed and drove away.rnAnd seeing you disappear in my rearview brought to me the wordrn'Reciprocity!' -IncubusrnrnFew people are capable of expressing
with equanimity opinions which differ from the prejudices of their social environment. Most people are even incapable of forming such opinions. -Albert EinsteinrnrnNinety-
eight percent of the adults in this country are decent, hard-working, honest Americans. It's the other lousy two percent that get all the publicity. But then--we elected them. -Lily
TomlinrnrnWhen You Are Not Practicing, Remember: Someone Somewhere Is Practicing And When You Meet Him- He Will WinrnrnIf not I, who? If not here, where? If
not now, when?rnrnAll that is necessary for evil to triumph is for good people to stand by and do nothing -UnknownrnrnWe are the people our parents warned us
about. -Jimmy BuffettrnrnNever explain--your friends do not need it and your enemies will not believe you anyway. -Elbert HubbardrnrnMy definition of a free society is a
society where it is safe to be unpopular. -Adlai E. Stevenson Jr.rnrnToo many have dispensed with generosity in order to practice charity. -Albert Camus","work":
[{"employer":{"id":"6185812851","name":"American Express"},"location":{"id":"105540216147364","name":"Phoenix, Arizona"},"position":
{"id":"133619273341785","name":"Lead Programmer Analyst"},"start_date":"2012-01"},{"employer":{"id":"190876464341724","name":"Cardiac group"},"position":
{"id":"105630109469647","name":"Executive Producer"},"description":"We create music for Artist Placement and TV/Film.","start_date":"2002-01"},{"employer":
{"id":"6185812851","name":"American Express"},"location":{"id":"105540216147364","name":"Phoenix, Arizona"},"position":{"id":"116439401740213","name":"Senior
Database Administrator"},"start_date":"2007-10","end_date":"2012-01"},{"employer":{"id":"110067355684846","name":"Saint Joseph Hospital"},"location":
{"id":"105540216147364","name":"Phoenix, Arizona"},"position":{"id":"202489236428627","name":"Pharmacy IT
Coordinator"},"start_date":"2005-10","end_date":"2007-10"},{"employer":{"id":"110067355684846","name":"Saint Joseph Hospital"},"location":
{"id":"105540216147364","name":"Phoenix, Arizona"},"position":{"id":"144703015548786","name":"Pharmacy
Tech"},"start_date":"2001-02","end_date":"2005-10"}],"sports":[{"id":"108606435830479","name":"Karate"}],"favorite_teams":
[{"id":"87169796810","name":"Philadelphia Flyers"},{"id":"93625750491","name":"Philadelphia Phillies"},{"id":"45898408995","name":"Phoenix Suns"},
{"id":"120163518021430","name":"Philadelphia Eagles"}],"favorite_athletes":[{"id":"77922840249","name":"Steve Nash"},{"id":"105590659475179","name":"Wayne
Gretzky"},{"id":"62975399193","name":"Michael Jordan"}],"inspirational_people":[{"id":"106676942701904","name":"Gandhi"}],"education":[{"school":
{"id":"109324275761313","name":"Corona del Sol High School"},"type":"High School"},{"school":{"id":"23680344606","name":"Arizona State
University"},"type":"College"}],"gender":"male","interested_in":["female"],"relationship_status":"Single","religion":"Hinduism (One with all
things)","political":"Liberal (Left of Center)","email":"app+22c90gj.
9hh9d.f7304b58ac646e08b5f0f10a73547e34u0040proxymail.facebook.com","website":"www.slashdot.orgr
nwww.gizmodo.com","timezone":-7,"locale":"en_US","languages":[{"id":"106059522759137","name":"English"},
{"id":"112969428713061","name":"Hindi"}],"verified":true,"updated_time":"2012-03-22T17:24:25+0000"}
© 2012 Datameer, Inc. All rights reserved.
- 12. Example: Email (MBOX)
From common-user-return-16923-apmail-hadoop-common-user-archive=hadoop.apache.org@hadoop.apache.org Thu Aug 20 14:02:59 2009
Return-Path: <common-user-return-16923-apmail-hadoop-common-user-archive=hadoop.apache.org@hadoop.apache.org>
Delivered-To: apmail-hadoop-common-user-archive@www.apache.org
Received: (qmail 83137 invoked from network); 20 Aug 2009 14:02:58 -0000
Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3)
by minotaur.apache.org with SMTP; 20 Aug 2009 14:02:58 -0000
Received: (qmail 23328 invoked by uid 500); 20 Aug 2009 14:03:14 -0000
Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org
Received: (qmail 23266 invoked by uid 500); 20 Aug 2009 14:03:14 -0000
Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:common-user-help@hadoop.apache.org>
List-Unsubscribe: <mailto:common-user-unsubscribe@hadoop.apache.org>
List-Post: <mailto:common-user@hadoop.apache.org>
List-Id: <common-user.hadoop.apache.org>
Reply-To: common-user@hadoop.apache.org
Delivered-To: mailing list common-user@hadoop.apache.org
Received: (qmail 23254 invoked by uid 99); 20 Aug 2009 14:03:14 -0000
Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230)
by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Aug 2009 14:03:14 +0000
X-ASF-Spam-Status: No, hits=-0.0 required=10.0
tests=SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (nike.apache.org: local policy)
Received: from [209.85.219.209] (HELO mail-ew0-f209.google.com) (209.85.219.209)
by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Aug 2009 14:03:05 +0000
Received: by ewy5 with SMTP id 5so181532ewy.36
for <common-user@hadoop.apache.org>; Thu, 20 Aug 2009 07:02:45 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.216.39.85 with SMTP id c63mr1821542web.103.1250776964866; Thu,
20 Aug 2009 07:02:44 -0700 (PDT)
In-Reply-To: <597eea000908200259o8e3bd78l385059f2b5d31555@mail.gmail.com>
References: <597eea000908191855v579b9c4r8baeb638630cfb27@mail.gmail.com>
<e01b80590908192249s5302cd26m7984a32816c0d58c@mail.gmail.com>
<597eea000908200209o176aefacjca2a45369301c296@mail.gmail.com>
<e01b80590908200230x608ad35en5f372a9fd5aba325@mail.gmail.com>
<597eea000908200259o8e3bd78l385059f2b5d31555@mail.gmail.com>
Date: Thu, 20 Aug 2009 15:02:44 +0100
Message-ID: <ac79ea400908200702u309a4fcey9ab1a7b358f313ce@mail.gmail.com>
Subject: Re: File Chunk to Map Thread Association
From: Tom White <tom@cloudera.com>
To: common-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Virus-Checked: Checked by ClamAV on apache.org
Hi Roman,
Have a look at CombineFileInputFormat - it might be related to what
you are trying to do.
Cheers,
Tom © 2012 Datameer, Inc. All rights reserved.
- 13. What do we need?
© 2012 Datameer, Inc. All rights reserved.
- 14. Just-In-Time Supply Chain
Slow Expensive Expertise
ETL Data Warehouse Business Intelligence
© 2012 Datameer, Inc. All rights reserved.
- 15. Just-In-Time Supply Chain
Slow Expensive Expertise
ETL Data Warehouse Business Intelligence
Fast Economical Self Service
Spreadsheets+
drag ‘n drop
Raw Load Hadoop
“schema on read”
© 2012 Datameer, Inc. All rights reserved.
- 16. A “One Stop Shop”
Compressing “Time-To-Insight”
Fast Self Service
Raw Load Spreadsheet Drag and Drop Visualization
Hadoop
Economical
© 2012 Datameer, Inc. All rights reserved.
- 17. What We Do:
© 2012 Datameer, Inc. All rights reserved.
- 18. Datameer Capabilities
Seamless Data Integration Powerful Analytics Self-Service Dashboards
Wizard-based integration Interactive spreadsheet UI Drag and drop
Structured, semi- and Cleansing, transformation, Powerful visualizations
unstructured analysis
Mash-up anything
No complex mappings/schemas Over 200 built-in functions
Integrate into existing portals
Pluggable data integration API Pluggable function API
© 2012 Datameer, Inc. All rights reserved.
- 20. Demo...
© 2012 Datameer, Inc. All rights reserved.
- 21. Q/A
© 2012 Datameer, Inc. All rights reserved.
- 22. Please Download Our
Trial Edition!
www.datameer.com
© 2012 Datameer, Inc. All rights reserved.
Notas del editor
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n
- \n