This document discusses Hadoop and its use for managing large and growing amounts of data beyond what traditional systems can handle. It outlines the different layers, technologies, and distributions that make up Hadoop platforms today. It notes that while many organizations initially adopt Hadoop to save money on data management and queries, they then face the challenge of determining the next steps. It recommends asking domain experts what unanswered questions they have and finding ways to obtain the necessary data to answer those questions, either from within or outside the organization. Building data products is also presented as a way for organizations to explore their data assets. A few examples of real-world Hadoop uses are briefly described.
6. Your data is growing beyond your ability to manage & query it
CC flickr kakadu
@wattsteve
7. Save money when asking the same questions of your data
CC flickr martijnsnels
@wattsteve
8. Hadoop Customer, “Great, but now what?”
Innovators
Early
Adopters
Early
Majority
Late
Majority
Laggards
CHASM
Geoffrey Moore’s Technology Adoption Lifecycle
@wattsteve
10.
Ask your domain experts and LOB folks what unanswered questions they have
Where can you get the data you need to answer that question? (domain experts should know
where to get it)
Some of this data may be outside your organization (Social Media, Sensor Data, Data
brokerages/Marketplaces, Web Pages) and some of it may be inside.
If the data for the query doesn’t exist, figure out how to instrument or gather it.
Pair your domain experts with your data engineers so they can work out how to obtain and
massage the data given the types of queries desired
CC flickr birdwatcher63
@wattsteve
11. • Building data products is a similar exercise except that it involves typical product planning,
such as identifying a market.
• This is also a great way for an organization to explore what assets they have within their data
CC flickr syume
@wattsteve
Source: Gartner Hype Cycle - http://www.gartner.com/technology/research/methodologies/hype-cycle.jsp
“Big Data is a fad”, “Its just BI 2.0”, “This is all just hype”, “We can’t figure out how to use it”, “There’s nothing new here”, “It’s not ready”, “Too few support options”, “Its too hard”
- You’re sharding your RDBMS infrastructure and its becoming brittle and a nightmare to maintain.
- Twitter has a good quote where they stated it used to take them 2 weeks to run an alter table statement
Using Hadoop for ETL to save money by displacing ETL vendors
Using Hive to offload datasets and their corresponding queries from your EDW and lower your EDW bill
A great way to competitively differentiate with arbitrarily structured data
Hadoop’s power is in its single storage repository and its support for arbitrary data structures. You have the technology to ask any question if you just have the data.