SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
Big Data
Implementation:
Hadoopand
Beyond
CONDUCTED BY:
Tabor Communications Custom Publishing Group
IN CONJUNCTION WITH:
Dear Colleague,
Welcome to the first in a series of professionally
written reports and live online events around topics
related to big data, high-performance computing,
cloud computing, green computing, and digital manu-
facturing that comprise the media portfolio produced
by Tabor Communications, Inc.
Most upcoming Tabor Communications reports and
events will be offered at no charge to viewers through
underwriting by a limited group of sponsors. Our first
report titled Big Data Implementation: Hadoop and Be-
yond, is being graciously underwritten by IBM, Omnibond and Xyratex. More
about how these organizations tackle challenges associated with big data can
be found in the sponsor section.
Tabor Communications has been writing about the most progressive technol-
ogies driving science, manufacturing, finance, healthcare, energy, government
research, and more for nearly 25 years. The extended depth of our new report
series extends beyond the limited space we can devote for coverage on specific
topics through our regular media resources. The writers of these reports are
typically contributing editors for Tabor Communications, and/or content part-
ners like Intersect360 Research from whom research cited within this report was
developed.
The best topics to cover typically come from the user community. We en-
courage you to offer any suggestions for upcoming reports that you feel will be
helpful to you in better reaching your business goals. Please send your ideas to
our SVP of Digital Demand Services, Alan El Faye (alan@taborcommunications.
com). Sponsorship inquiries for future reports would also go to Alan.
Thank you for reading Big Data Implementation: Hadoop and Beyond.
Please visit our corporate website to learn more about the suite of Tabor pub-
lications: http://www.taborcommunications.com/publications/index.htm
Cordially,
Nicole Hemsoth
Editorial Director-Tabor Communications, Inc.
Big Data Implementation: Hadoop and Beyond
Conducted by: Tabor Communications Custom Publishing Group
A Tabor Communications, Inc. (TCI) publication © 2013. This report cannot be duplicated without prior permission from TCI. While every effort is made
to assure accuracy of the contents we do not assume liability for that accuracy, its presentation, or for any opinions presented.
Study Sponsors
OOMMUN NI ICC AT S
Media and Research Partners
By Mike May
T
he big data phenomenon is far
more than a set of buzzwords
and new technologies that fit into
a growing niche of application areas.
Further, all the hype around big data
goes far beyond what many think is the
key issue, which is data size. While data
volume is a critical issue, the related
matters of fast movement and opera-
tions on that data, not to mention the
incredible diversity of that data demand
a new movement—and a new set of
complementary technologies.
But we’ve heard all of this before, of
course. Without context, it’s hard to
communicate just how important this
movement is. Consider, for example,
President Obama’s recent campaign,
which many touted as by far the most
comprehensive Web and data-driven
campaign in history.
The team backing the big data efforts
for Obama utilized vast, diverse wells of
information ranging from GIS, social me-
dia, demographic and other open and
proprietary data to seek out the most
likely voters to lure from challenger Mitt
Romney. This approach to wrangling
big data for a big win worked, although
it’s still a matter of speculation just how
much these efforts tipped the scale.
So let’s take a more concrete set of
Big Data Implementation:
Hadoop and Beyond
In this new era of data-intensive computing, the next generation of tools
to power everything from research to business operations is emerging.
Keeping up with the influx of emerging big data technologies presents its
own challenges—but the rewards for adoption can be of incredible value.
High-performance computing and machine learning help users track their family tree at Ancestry.com.
This screen shot features the Shaky Leaf, which employs predictive analytics to suggest possible new
information in a user’s family tree. (Image courtesy of Ancestry.com.)
measurable examples. Healthcare
data are being processed at incredible
speed, pulling in data from a number
of sources to turn over rapid insights.
Insurance, retail and Web companies
are also tapping into big, fast data
based on transactions. And life sci-
ences companies and research enti-
ties alike are turning to new big data
technologies to enhance the results
from their high performance systems
by allowing more complex data inges-
tion and processing.
There is little doubt that the next
wave of big data technologies will rev-
olutionize computing in business and
research environments, but we should
back up for a moment and take in
the 50,000-foot view to put this all in
clearer context.
The roots of big data are not in the
traditional places one used to look for
the cutting edge of computational tech-
nologies—they are entrenched in the
daily needs of the world’s largest, most
complex and earliest web companies.
As Jack Norris, VP of Marketing at
Hadoop distribution startup, MapR,
reminds, “Google is the pioneer of big
data and showed the dramatic power
and ability to change industries through
better analysis.” He adds, “Not only did
Google do better analysis, but they did
it on a larger dataset and much cheap-
er than traditional alternatives.” Mak-
ing use of big data requires powerful
new computing tools, and perhaps the
best-known one is called Hadoop.
As the case of Google calls to mind,
even though big data are often big, size
alone does not adequately define the
space. Likewise, the field of big data
extends beyond any particular applica-
tion, as indicated by the examples giv-
en above. As Norris explains, “Big data
is not just about the kind or volume of
data. It’s a paradigm shift.” He adds,
“It’s a new way to process and analyze
existing data and new data sources.”
The point at which a computing chal-
lenge turns into big data depends on the
specific situation. For some research
and enterprise end users, existing ap-
plications are being strained due to the
data complexity and volume overload,
for others performance is lagging due
to these factors. Another group of users
simply needs to abandon old applica-
tions, systems or even methodologies
completely and build again to create a
big data-ready environment.
As Yonggang Hu, chief architect for
Platform Computing, IBM Systems
and Technology Group says, “One of
the simplest ways to define big data is
when an organization outgrows their
current data-management systems,
and they are experiencing problems
with the scalability, speed and reliabil-
ity of current systems in the data they
are seeing or expect to get.” When
faced with such a situation, says Hu,
“They need to find a cost-effective
way to move forward.” (See ‘A to Z for
Big Data Challenges.’)
When viewed in total, the scalability,
performance, storage, network and
other concerns continue to mount—
which has opened the door for the
new breed of big data technologies.
This includes everything from fresh
approaches to systems (new appli-
ances purpose-built for high-volume
and -velocity data ingestion and pro-
cessing) to new applications, data-
bases and even frameworks (Hadoop
being a prime example).
The Deluge of Data
to Decode
There is no one definition of a big
data application, but it’s safe to say
that many existing applications need
to be retooled, reworked or even
thrown overboard in the pursuit to
keep pace with data volume, variety
and velocity needs.
Throw in the tricky element of real-
time applications—which require high
performance on top of complex in-
gestion, storage and processing—and
Benefits of BigQuery
To explore big data, people want to ask questions about the information. “Google’s
BigQuery is a tool that allows you to run ad hoc queries,” says Michael Manoochehri,
one of Mountain View, California-based Google’s developer programs engineers. “It
lets you ask questions on large data sets, like terabytes.”
To make this tool accessible to more users, Google implemented a SQL-style
syntax. Moreover, the company has used this tool—in a version called Dremel—
internally for years before releasing a form of it in their generally available service,
BigQuery, in 2012. “It’s the only tool that lets you get results in seconds from tera-
bytes,” Manoochehri says.
This tool brings big data to many users because it is hosted on Google’s infrastruc-
ture. “You take your data, put it in our cloud and use this application,” Manoochehri ex-
plains. Because of that simplicity, a diverse group of users already employs BigQuery.
As examples, Manoochehri mentions Claritics—another Mountain View–based
company, which describes its services as providing “social gaming and social com-
merce analytics”—and Blue Box—a company in Brazil that provides mobile adver-
tisements. Regarding Blue Box, Manoochehri says, “They started with MySQL, and
when the data got too big they moved to Hadoop.” He adds, “Hadoop is cool, but it
takes set up and administration.” Then, Blue Box tried BigQuery. “It fit,” Manoocheh-
ri says, “because Blue Box is a small shop with lots of data to process. They wanted
to run queries on data and integrate it into their application.”
Various other industries also use BigQuery. As an example, Manoochehri points
out the hospitality industry, which includes hotel and travel data. Overall, he says,
“Almost all of the case studies for BigQuery are small shops, like two or three engi-
neers, that have no time to set up hardware.”
In general, BigQuery’s biggest benefits arise from speed. “It’s optimized for speed,”
Manoochehri explains. It’s fast to set up and it brings back results fast, even from
very big datasets.
They are often
looking at complexity,
trying to manage
complexity with
limited resources.
“
”
users are left with a lot of questions
about how to meet their goals. While
there is no shortage of new technolo-
gies that are aimed at meeting those
targets, the challenges can be over-
whelming to many users.
When asked about the hurdles that
users face today in working with big
data, Michael Manoochehri, developer
programs engineer at Google head-
quarters says, “They are often looking
at complexity, trying to manage com-
plexity with limited resources.” He adds,
“That’s a big problem for lots of people.”
In the past, the lack of resources kept
many potential users out of big data.
As Manoochehri says, “Google and
Amazon and Microsoft could handle
big data,” but not everyone. Today, big
data tools exist for many more users.
For example, Manoochehri says, “Our
products make these seemingly impos-
sible big data challenges more acces-
sible.” (See ‘Benefits of BigQuery.’)
A prime example of this can be
found in large-scale data mining,
which can quickly turn into a big data
problem, even if it’s a process that
used to fit neatly in a relational data-
base and commodity system. For life
sciences companies, data mining has
always been a key area of computa-
tional R&D, but the addition of new,
diverse, large data sets continues to
push them toward new solutions.
The data mining, analytics and pro-
cessing challenges for life sciences are
great. Genomics explores the chromo-
somes of a wide range of organisms—
everything from seeds to trees and hu-
mans. Chromosomes consist of genes,
whichcodeforproteins,andnucleotides
make up the genes. The nucleotides
are arranged, more or less, like beads
on a necklace, and the order of the
nucleotides and the types—they come
in four forms—are like letters spelling
out words. In the case of genomics,
the words translate into proteins, which
control all of the biochemical processes
that fuel life. Hidden in those words are
complex programming and process-
ing challenges as life sciences users
attempt to piece these bits together to
find “needle in the hackstack” discov-
eries amidst millions (and more often,
billions) of pieces at a time.
Sequencing determines the order
and identity of the nucleotides, and
that information can be used in basic
research, biotechnology, medicine and
more. Today’s sequencing machines
produce data so fast that this quickly
generates big data. In 2010, for ex-
ample, BGI—a sequencing center in
Beijing, China—bought 128 Illumina
HiSeq 2000 sequencing platforms. Ac-
cording to specifications from Illumina,
one of these machines can produce up
to 600 gigabytes of data in 8.5 days.
So 128 of these sequencing machines
could produce about one-quarter mil-
lion gigabytes of data a month, and
that’s big data. Today’s sophisticated
life science tools and big data applica-
tions also impact the healthcare indus-
try. (See ‘Boosting Big Pharma.’)
The challenges across the com-
putational spectrum are not hard to
point to in such a case, especially
when it comes to large scale genomic
sequence analysis. With the terabytes
of data rolling off sequencers, the vol-
ume side of the big data equation is
incredible, necessitating innovative
approaches to storage, ingestion and
data movement. In terms of the nee-
dle in a haystack side of the analytics
Boosting Big Pharma
“The big data question hasn’t really been fully exploited in big pharma at the mo-
ment,” says Gavin Nichols, vice president of strategy and R&D IT at Quintiles in
Research Triangle Park, North Carolina. Nonetheless, many big data opportunities
exist in healthcare. “If you took the big data around genomics, the ability to master
investigator and doctor information, the information from clinical trials and what goes
on drug labels and much more, you get a picture of where you could influence the
field,” says Nichols. In short, big data could impact the pharmaceutical industry from
discovering new drugs to commercializing them. “There is the ability to influence the
whole lifecycle,” Nichols says, “but that has not been fully realized.”
Quintiles, on the other hand, started working with big data some time age. “For the
past seven or eight years,” Nichols says, “we’ve been very strategic about the data
that we consider.” This includes both enterprise and technical computing. For ex-
ample, Quintiles analyzes large datasets in an effort to improve efficiencies across its
various organizations. Likewise, the company takes advantage of electronic health
records (EHRs) and registries in an effort to better understand the populations that
need new pharmaceuticals.
As one crucial example, Quintiles uses big data in an effort to make clinical trials
more efficient. In a late-phase trial, for example, better use of EHRs could reduce
time and cost. “Instead of going to a trial site and asking 100s of questions, like what
drugs the people take,” Nichols explains, “we could ask the EHRs those questions,
and maybe we can answer 40 percent of the questions.” Beyond increasing efficien-
cy, a similar approach could also enhance the information gained from a clinical trial.
“Information from EHRs could help us design more-targeted studies on a smaller
population,” Nichols says, “and that’s a win-win for everyone.”
Although big data offers a gold mine to the pharmaceutical industry, companies
must use this resource wisely. “Big data in the life sciences without good clinical
interpretations could be a hazard,” Nichols says. “So we bring data together and
enrich it.” Quintiles combines data from, say, clinical trials with the experience of
clinical experts.
To make the most of big data opportunities, the experts at Quintiles consider vari-
ous options. “We look long and hard at things,” Nichols says. “We’re looking at Ha-
doop, and we’ve also used Solar on a SQL frontend.” He adds, “We’re also looking at
techniques developed by companies that are building on top of Hadoop.” For patient
privacy, however, Quintiles segregates analytics from the raw data.
In many cases, though, each pharmaceutical question requires its own big data
implementation. For efficiency, Quintiles relies heavily on data scientists. “By using
research analysts—individuals who sit between the knowledge of what the data are
and the analytical approaches—we can see if we’ve examined similar questions in
the past,” Nichols says. “Then, we might be able to go back to some application that
you’ve used before.” Nonetheless, Nichols adds, “Clinical research has more and
more one-off questions.”
problem, standard software that fits
into neat boxes is no longer able to
keep pace with the incredible speed
and complexity of the puzzle-piecing.
Hence life sciences is one of the key
use cases representing the larger
challenges of big data operations.
Big data goes beyond our bod-
ies, another example of data-inten-
sive computing in action could soon
change our homes.
The “smart grid” is another hot ap-
plication area for a wide range of big
data technologies across the entire
stack. The US Department of En-
ergy describes the smart grid as “a
class of technology people are us-
ing to bring utility electricity delivery
systems into the 21st century, using
computer-based remote control and
automation.” Of course, without am-
ple processing, frameworks and ap-
plications designed from the ground
up to handle so many diverse data
feeds that power the smart grid, the
project couldn’t be a success.
Making the smart grid work requires
real innovation on the data front,
according to Ogi Kavozovic, vice
president of strategy and product at
Opower in Arlington, Virginia. “Smart
grid deployment will create a three-
orders-of-magnitude jump in the
amount of energy-usage data that the
utility industry will have,” he says.
The energy exec went on to note
that when this data spike happens,
“Then there will be a several-orders-
of-magnitude jump above that from
energy-management devices that go
inside people’s homes, like Wi-Fi ther-
mostats that read data by the sec-
ond.” Opower is already working with
Honeywell, the global leader in ther-
mostats, to make such devices. “This
is all just emerging now, and the utility
industry is preparing for it,” Kavozovic
says. “The data presents an unprec-
edented opportunity to bring energy
management and efficiency into the
mainstream.” (See ‘Evolution in En-
ergy Efficiency.’)
For the smart grid and energy sector,
moving to big data technologies is a
priority because it will enable new, criti-
cal efficiencies that can create a new
foundation for energy consumption and
distribution. But of course, the frame-
works, both on the hardware and pro-
gramming side, need to be in place.
Data-Based Decisions
In the field of big data, the wide
range of applications mirrors the sorts
of computing situations that gener-
ate big data challenges. Some users
run into big data because of han-
dling many files. In other cases, the
challenge comes from working with
very large files. Big data issues also
emerge from extensive data sharing,
allowing multiple users to explore or
analyze the same data set.
“Time plays a role in big data,” says
Steve Paulhus, senior director, strate-
gic business development at Xyratex.
“Depending on how the data is used, it
may be valuable today or it may become
valuable tomorrow. For instance, with
financial information, the data can lose
Evolution in Energy Efficiency
Utility companies that want to use big energy data to enhance efficiency can turn
to Opower in Arlington, Virginia. “We partner with about 80 utilities around the world
to help their customers understand and control their energy usage through our tools,”
says Anand Babu, lead on all data and platform marketing efforts at Opower. With
these tools, customers can compare their energy usage to that of their neighbors
and learn to improve their energy efficiency. Likewise, this use of big energy data can
help utilities reduce peak energy demands, which can reduce the need to develop or
purchase more energy.
To suggest some of the volume of big energy data, Babu says, “In the United
States alone, consider what happens once every one of our 120 million households
has a smart meter and WiFi thermostat. That’s 120 million times 100,000 data points
per month.” That’s 12 trillion data points acquired every month. Keep in mind too
that most households today only get one monthly meter read. So big energy data
includes high volume and fast growth.
As utilities start to roll out smart meters, most companies lack the infrastructure
to store the data. That’s one place where Opower comes in on big energy data. This
software-as-a-service company provides the infrastructure and the analytical exper-
tise. “We operate a managed service where utilities securely transfer data to us, and
we do the analytics,” Babu explains.
To provide this service, Opower uses several tools. Babu says, “We are the biggest
user of Hadoop in our industry, operating at the scale of nearly 50 million homes.” In
other cases, different tools work better. “For some things, we still use MySQL,” Babu
explains. “For example, to manage how we communicate with certain segments of
the customer base, that can be done more efficiently using a traditional database.”
“The exciting thing here is that customers see value on day one,” Babu says, “and
they can start to take action right away.”
In “Different Perspectives on Big Data: A Broad-Based, End-User View” from Intersect360 Research,
people polled rated a series of big data challenges on a scale of 1–5, ranging from ‘not a challenge’ to an
‘insurmountable challenge,’ respectively. As the results show, users run into the most trouble managing
many files and performing computations and analyses fast. (Image courtesy of Intersect360 Research.)
	 Business	 Technical	 All Data
Challenging dimension of Big Data	 Avg.	 %4/5	 Avg.	 %4/5	 Avg.	 %4/5
Mangement of very large files	 2.7	 25%	 3.5	 56%	 3.3	 46%
Mangement of very large number of files
(many files)	 3.5	 58%	 3.9	 70%	 3.7	 66%
Providing concurrent access to data for
multiple users	 3.1	 40%	 3.5	 52%	 3.4	 48%
Completing computation or analysis within
a shirt timeframe (short lifespan of data)	 3.8	 65%	 3.7	 61%	 3.7	 62%
Long-term access to data (long lifespan of data)	 3.6	 61%	 3.5	 56%	 3.5	 57%
Number of respondents	 102	 203	 305
Intersect360 Research, 2012
value if you don’t get it right now. In other
cases, like weather data, you might want
to look at it over a long period of time.”
(See ‘Crafting Storage Solutions.’)
Addison Snell, CEO of Intersect360
Research in Sunnyvale, California, de-
scribes another example where data
must be stored for a long time: “A pe-
diatric hospital, for example, keeps in-
formation for the life of a patient plus
some additional number of years.”
Depending on the particular ap-
plication and the kind of data that it
includes, end-users must select a
system that works for them. No sin-
gle solution handles every big data
problem. That said, most users of big
data face one consistent problem: in-
put/output scalability. “From a
technical perspective,” says
Snell, “this is the number one
big data problem.”
Moreover, I/O scalability im-
pacts both technical and en-
terprise computing. Technical
computing often involves a
user’s top-line mission, such
as car design for an automotive
manufacturer or drug discovery
for a pharmaceutical company.
Enterprise computing, on the
other hand, consists of the steps
that keep a company running,
such as handling the mechanics
of an e-commerce website.
In the past, technical comput-
ing focused almost entirely on
unsolved problems. As Snell ex-
plains,“Youdon’tneedtodesign
the same bridge twice.” Since
this area of computing moves
to harder and harder problems,
the users quickly adopt new
technologies and algorithms. In
enterprise computing, by com-
parison, once users find solu-
tions, they stick with them. “In
enterprise computing, there are
people using approaches that
are 30 years old because they
still work,” says Snell.
In many ways, the big data
era changes this distinction. By
definition, big data scenarios
push an end-user to a point
where previous methods won’t
work. This applies to technical
and enterprise computing. “So,
for the first time, you have enter-
prise-class customers viewing
performance specifications of
infrastructure as the dominant
purchase criteria,” Snell says.
The purchase criterion used the
most is I/O performance.
In addition to I/O performance
scalability, users must ask other
questions about their big-data
Computing Family Connections
At Ancestry.com, data come in many forms. This family-history company stores data from
historical records, such as census data and military and immigration records. This historical
information also includes birth and death certificates. Overall, the historical data at Ancestry.
com consists of around 11 billion records. In addition, this web-based company stores user-
generated data.
“When users come to this site, their goal is to build a family tree,” explains Leonid Zhukov,
the company’s director of machine learning and development. In this process, a user creates
a profile of themselves, plus their parents, grandparents and so on. “We have more than 40
million family trees and 4 billion people represented here,” Zhukov says. Users can also upload
images, such as photographs or even handwritten stories, and Ancestry.com stores about 185
million of these records. On top of all of that, this site keeps track of user activity. “On aver-
age, Ancestry.com servers handle 40 million searches a day,” says Zhukov, “and we log user
searches and use it in machine learning.”
The kinds of data being stored at Ancestry.com also evolve over time. As an example of that,
the company can now collect a user’s DNA data. The company provides the DNA test and pro-
cesses the information. As the company explains: “AncestryDNA uses recent advances in DNA
technology to look into your past to the people and places that matter to you most. Your test
results will reach back hundreds—maybe even a thousand years—to tell you things that aren’t
always in historical records, but are recent enough to be important parts of your family story.”
This range of data—a variety of types from various sources—can only be stored and analyzed
with powerful tools. “There’s no black-box solution for our data challenges,” Zhukov explains. “We
use different pipelines for processing different types of data. Plus, we deal with big and versatile
data, so we need to do lots of hand work on it.” He also points out that moving so much data
creates challenges. “We must customize servers and how they are connected—where to collect
data and where to process it,” Zhukov says. “The logistics of the data are quite important.”
Some of the most interesting data analysis at Ancestry.com involves machine learning. This
works on two types of objects: a person or a record, which is any document with information
about that person. Then, users can find information about ancestors by searching for names,
places of birth and so on. These focused searches provide precise results. On the other hand,
users can also turn to exploratory analysis. “In this case, a user is looking for suggestions,”
says Zhukov. “It’s similar to when you log on Amazon, and they show you the books that you
might be interested in.” So instead of a search, the exploratory analysis provides suggestions
of records or family-tree information based on what the user has done in the past, and that
depends on machine learning.
The system at Ancestry.com creates the suggestions for exploratory analysis offline. “We
train the system to make suggestions based on the feedback of previous suggestions that were
accepted or rejected,” Zhukov explains. “The system looks for similarity. It asks: How similar is
a person to a particular record?” The software uses a range of features to make these compari-
sons. It can start with a first and last name, but maybe a document includes a misspelling, and
then the system relies on a phonetic comparison. Also, a record could have been handwritten,
scanned and turned into data with optical character recognition. “You can imagine how many
mistakes there are in hand-written historical records,” Zhukov says, “so we do fuzzy matching
instead of exact matches.” The features of a record serve as the input and the training data for
machine learning come from the users rejecting or accepting the suggestions.
For many parts of this processing, Ancestry.com uses Hadoop. As Zhukov says, “Hadoop
is really great technology, but it’s not at the point where you can open the box and things
miraculously work.” Consequently, this company relies on Hadoop experts and also experts
who understand how to search huge databases for similarities between disparate sources of
information. As Zhukov adds, “Hadoop is not only for data storage, but it also provides a data-
processing pipeline. So you need to feed it with the right things.”
infrastructure needs. For example, Ken
Claffey, senior vice president and gen-
eral manager, ClusterStor at Xyratex,
asks: “Once you have all of your data
in a common environment you need to
think about the reliability, availability and
serviceability of that infrastructure as you
want that data to be consistently avail-
able.” He adds, “You need to think about
the constant with big data—multi facet-
edgrowthincapacity,datasources,data
types, number of clients, applications,
workloads and the new ways to use the
data. It is key to ask yourself how are you
going to efficiently manage such multi
faceted exponential growth over the next
number of years against the constraints
of available data scientists and data cen-
ter space, power, cooling and budget.
Your big data infrastructure needs to be
scalable to meet performance and ca-
pacity demands, flexible to respond to
change and extremely efficient.”Delving
even deeper, Norris noted that when
looking at new architecture and what’s
possible, the real focus is on what types
of applications will be used. What data
sources are we going to exploit? What
is the desired outcome?” In short, a
user must answer questions like these
to know where to start. After that, a user
decides how to integrate new infrastruc-
ture into an existing computing environ-
ment. “How does the new computing fit
my current security?” Norris asks. “How
do I make it enterprise grade?” As Nor-
ris points out, a user might not start ask-
ing these questions until working on the
second or third big data application and
sharing it across a group of users. “You
can generalize about the best platform,
but you have to have it working well
with the existing environment and ap-
plications,” Norris says. “So you’ll end
up with a little bit of a hybrid solution.
You’ll find yourself asking: How do I take
advantage of this new technology, but
on a platform that delivers what I need
in terms of reliability, ease of integration
and so forth?”
Among some of the frameworks that
are emerging in some of the key sec-
tors we’ve discussed is Hadoop and
MapReduce, which were again, born
out of the largest-scale data opera-
tions the world has seen to date—the
search engines and web companies.
Mapping the MapReduce
Paradigm
Forawiderangeofindustries,MapRe-
duce makes up one of the most com-
mon big data application frameworks.
This often gets implemented as Hadoop
MapReduce, which makes it easier for a
user to develop applications that handle
large data sets—even in multiple tera-
bytes—on computer clusters.
The complete range of big data ap-
plications could always remain in flux.
As Boyd Wilson, executive at Omnibo-
nd in Pendleton, South Carolina, says,
“As you iterate on various methods of
teasing out the information from the
data, that learning can lead to new ar-
eas of big data.” As an example, Wilson
mentions cross-disciplinary analytics.
Overall, he says, “The big data
sweet spot is lots of—potentially di-
verse—data that you want to com-
bine to learn new correlations, and
you need high-performance medium,
something like OrangeFS, to aggre-
gate and analyze in large volume and
with high throughput.” (See “A Big
data File System for the Masses.’)
As this schematic indicates, a wide range of information plays a part in big energy data. Providing benefits to both utilities and customers requires ad-
vanced storage and analytics. (Image courtesy of Opower.)
With Hadoop,
you can recognize
patterns much
more easily
and drive new
applications.
“
”
That “sweet spot” covers a lot of
ground: consumer-behavior analysis,
dynamic pricing, inventory and logis-
tics, scientific analysis, searching and
pattern matching, and more. More-
over, these applications of big data
arise in many vertical markets, includ-
ing financial services, retail, transpor-
tation, research and development, in-
telligence and the web.
In some cases, the use of big data
will come from some unexpected plac-
es. For instance, Leonid Zhukov, direc-
tor of machine learning and develop-
ment at Ancestry.com, says, “We’re a
technology company, and we do fam-
ily history, but we are a big data com-
pany.” He adds, “We use data science
to drive family-history advances. So
we’re hiring data people.” (See ‘Com-
puting Family Connections.’)
As MapReduce and Hadoop con-
tinue to mature, these use cases will
continue to extend further into main-
stream enterprise contexts, but for
now, there are some bigger problems
to address.
Hadoop’s High Points
Although Hadoop only made up
slightly more than 10 percent of the
applications mentioned in Inter-
sect360 Research’s poll (see below
for details), many experts endorse
this approach. When asked about the
benefits of applying a Hadoop solu-
tion, Norris of MapR says, “The way
to summarize the benefits of Hadoop
is: It’s about flexibility; it’s about pow-
er; it’s about new insights.” (See ‘M
Series from MapR.’)
In terms of flexibility, according to
Norris, Hadoop does not require that
you understand ahead of time what
you’re going to ask. “That seems sim-
ple,” he says, “but it’s a really dramatic
difference in terms of how you set up
and establish an application.” He adds,
“Setting up a typical data warehouse is
initially a large process, because you
are asking how you will use the data,
and that requires a whole set of pro-
cesses.” With Hadoop, on the other
hand, a user can just start putting in
data, according to Norris. “It’s really
flexible in the analysis you can perform
and the data you can perform it on,
and that really surprises people.”
Norris points about several fea-
tures of the power behind Hadoop.
“You can use different data sets, and
the volume that you can use is quite
broad.” He adds, “This eliminates the
sampling. For example, you can store
a whole year of transactions instead of
just the last 10 days. Instead of sam-
pling, you just analyze everything.”
By using such large volumes of data
in analytical procedures, users can
apply new kinds of analysis and also
more easily identify outliers. Because
of that ability to work with all of the
data, says Norris, “Relatively simple
approaches can produce some dra-
matic results.” For instance, just the
added ability to explore outliers could
reveal some pleasant surprises, may-
be some tactic that generated an un-
expectedly high amount of revenue.
On the other hand, analyzing outli-
ers can reveal failures. As an exam-
ple, Norris says, “You might find an
anomaly like an expensive part of the
supply chain that you want to elimi-
nate. Overall, analyzing these outliers
makes dramatic returns possible.”
In terms of new insights, says Norris,
“With Hadoop, you can recognize pat-
terns much more easily and drive new
applications.” He also adds that Hadoop
M’s from MapR
MapR in San Jose, California, developed a series of Apache Hadoop solutions,
including M3, M5 and the latest M7. The M7 version adds HBase capabilities, which
is the Hadoop database. “This provides real-time database operations side by side
with analytics,” says Jack Norris, the company’s vice president of marketing. “This
provides lots of flexibility for the developers, because you can run applications from
deep analytics to real-time database operations.”
All of the MapR tools make it possible to treat a Hadoop cluster like standard stor-
age. “Consequently, you can use industry-standard protocols and existing tools to
access and manipulate data in the Hadoop clusters,” Norris explains. “This makes
it dramatically simpler and easier to support and use.” This tool can also be used as
an enterprise-grade solution.
For all of the MapR editions, the developers aimed at making Hadoop easy, fast
and reliable to use. For example, these tools provide full data protection. Norris adds,
“In terms of hardware systems, the beauty of MapR is that it supports standard,
off-the-shelf hardware. So you don’t need a specialized server.” He adds, “You get
better hardware utilization with MapR, because, for example, you can use a greater
density of disk drives and faster drives.”
To get an idea of how MapR can be applied, consider these case studies with
M5. “One of the largest credit-card issuers is using M5 to create new products and
services that better detect fraud,” says Norris. “This includes some broad uses of
analytics that benefit the cardholders.” Some large online retailers also use M5 to
better analyze shopping behavior and then use that to recommend items or simplify
the user’s experience. Other websites also use M5 in digital advertising and digital-
media platforms. For example, says Norris, “Ancestry.com is using it as part of their
service.” Norris also points out: “Manufacturing customers use M5 for basically ana-
lyzing failures. They collect sensor data on a global basis and detect precursors to
problems and schedule maintenance to avoid or prevent downtime.” In summary,
Norris calls the MapR users “really, really broad.”
A user just starting with big data might start with Apache Hadoop. “When you
really put this into production use,” Norris explains, “having enterprise grade with
advanced features is a real benefit.” Some of the advanced features of MapR tools
include a reduced need for service. For instance, the MapR platforms are self-heal-
ing.
To ensure that MapR platforms include the features that users really want and
need, the company works with an extensive network of partners. “In some case,
these partners provide vertical expertise within a market or even do consulting work
to identify likely use cases to help us select the best implementations of MapR,”
Norris says. In addition, this partner network also helps develop some of the training
available from the MapR Academy, which provides free training. “We want to ensure
that users can easily share information and applications on a cluster,” Norris says.
“That is basically a unique value proposition of MapR.”
can be an economical solution. “A high-
end data warehouse can cost $1 million
per terabyte, whereas Hadoop is only
hundreds of dollars per terabyte.”
At Ancestry.com, Zhukov and his
colleagues use Hadoop to solve
many problems. “For different data
types and different business units, we
use Hadoop in different ways,” Zhu-
kov explains. For example, the team
that provides optical character rec-
ognition for documents scanned into
Ancestry.com uses Hadoop for that.
Also, when this company acquires
newspaper data that are structured
and in a text format, facts—such as
dates, names and places—must be
extracted, and Hadoop provides the
technology for the parallel processing
that achieves this. Zhukov adds, “In
the DNA pipeline, where we compare
customer DNAs and do matching for
relatives, we look at the percentage
of overlap and ethnicity. That’s a big
data job.” Hadoop also gets used at
this company for the machine learn-
ing that generates the possible family
connections for users. As Zhukov con-
cludes: “We do analytics in more of a
traditional setting, but we are turning
to Hadoop for all business units in the
company.”
Emerging Options
Hadoop does not solve every big
data problem or make the best fit for
every user. In some cases, the actual
implementation of Hadoop creates
the biggest hurdle. As Zhukov says,
“If you compare Hadoop to tradition-
al database systems, it’s in the early
stages of its technology. So there’s
lots of tweaking and twisting for it to
run smoothly.”
In fact, the Hadoop challenges vary
by user. “The biggest challenge really
depends on the use case the customer
has,” says Hu of IBM. For example, he
says, “If they are just analyzing data
in some standalone system and they
have Hadoop clusters that operate in
isolation, then the challenge is just to
keep those clusters up and running.”
Hu also points out some general
concerns with Hadoop. He says, “It
has challenges in single points of
failure, and massive amounts of sys-
tem must be stitched together, which
makes system provisioning and man-
agement a challenge.” Beyond that,
he lists several other issues, including
the latency of execution in an applica-
tion and the inability to execute work-
loads outside the Hadoop paradigm.
To explore the general concerns even
more, Hu directs readers to “Top 5
Challenges for Hadoop MapReduce
in the Enterprise.”
Beyond concerns of implementa-
tion and execution, Zhukov points
out that it’s not easy to hire Hadoop
experts. As he says, “Here in Silicon
Valley, it’s proved to be quit difficult
to hire people with Hadoop skills. So
we’re trying to grow and train our own
people to work with Hadoop.”
Sometimes, users assume that Ha-
doop is the only big data option. As
Snell says, “In many cases, we find
that Hadoop is a technology in search
of solutions.” To explain this, he says,
“Imagine that I’ve designed a new
screwdriver, and I’m trying to find how
many things I can do with it. In prac-
tice, you might be able to drive in a
Phillips screw with a flathead screw-
driver, but the flathead screwdriver
might not have been the best tool.”
Likewise, Hadoop is not the best tool
for every big data situation.
Other options keep emerging for us-
ers. In some cases, companies even
develop new approaches to creating
the new tools. For example, Omni-
bond makes all of its advanced tools
available to the open-source commu-
nity. As Ray Keys, the company’s mar-
keting manager, explains, “We have a
very unique business model. It’s pure
open source, and pushing everything
out to the open-source community
is really powerful.” He adds, “There
is goodwill involved, but it’s goodwill
from people we do development for,
and they all realize that it’s goodwill
for the entire community. As a result,
the community contributes, as well.”
Beyond just good will, this company’s
OrangeFS provides a file system that
can handle millions and millions of files
in a single directory without reducing
the performance of the solution.
End-User Applications
To get a quantitative grasp of how
users apply big data, Intersect360 Re-
search conducted a survey and pro-
duced this report: “Different Perspec-
tives on Big Data: A Broad-Based,
End-User View.”
This study arose from surveying—in
partnership with the Gabriel Consult-
ing Group in Beaverton, Oregon—
more than 300 users of big data
applications, which included both
high-performance computing (HPC)
and non-HPC. The survey sampled a
wide range of users, including com-
mercial, government and academic
(or not-for-profit) users. For example,
the commercial users ranged from
consulting and computing to energy
and engineering, as well as manu-
facturing and medical care. The gov-
ernment users worked in national re-
search laboratories, defense, security
and other areas. The users also came
from a range of group sizes, cover-
ing a range from a few to more than
100,000 employees. For instance,
nearly 20 percent of the groups had
only 1–49 employees, and almost 10
percent had 20,000–99,999.
Such a diverse sample should pro-
vide a broad overview of big data ap-
plications, and the survey results con-
firm that. Among those polled, the big
data applications included enterprise,
research and real-time analytics, plus
MapReduce, complex-event process-
ing, data mining and visualization. For
example, more than two-thirds of those
We have a very
unique business
model. It’s pure
open source, and
pushing everything
out to the open-
source community
is really powerful.
“
”
polled said that they work with big data
applications in research analytics, and
more than half apply data mining to
big data situations. A small portion of
those surveyed—6 percent—reported
no current applications, although they
were planning future ones.
Those surveyed also rated a series of
big data challenges on a scale of 1–5,
ranging from ‘not a challenge’ to an ‘in-
surmountable challenge,’ respectively.
Overall, the users ranked the most signif-
icant challenges as managing many files
and performing computations and anal-
yses fast, where the average scores for
both were 3.7 on the 1–5 scale. Among
the technical users, 70 percent of them
ranked managing many files as a 4 or 5
on the scale—making it feel like an in-
surmountable, or nearly that, challenge
for 7 out of 10 of those polled. Among
business users, 65 percent of them
rated performing computations and
analyses fast as a 4 or 5 on the scale of
big data challenges. In general, though,
most users rated most of the challenges
as significant. For example, users also
reported serious challenges—averages
scores above 3—for managing large
files, giving multiple users simultaneous
access to data and providing long-term
access to data.
This survey also explored where
users get their software for big data
applications. The results show some
unexpected, or at least inefficient, re-
sults: Half of those polled developed
the applications internally; one-quar-
ter of them purchased the applica-
tions; and about another quarter used
open-source applications. “For those
who developed the applications inter-
nally,” says Snell, “those are all one-
offs by definition.” With challenges as
complicated as big data, developing
a unique solution to every problem
does not make for efficient comput-
ing. Moreover, the wide use of one-
offs limits these applications to groups
that maintain the personnel to develop
these advanced applications.
This survey also examined the spe-
cific tools the users apply to big data
scenarios. Out of the 300-plus re-
sponses to this survey, 225 of them
provided details on their applica-
tions, and Hadoop came out on top.
Nonetheless, only 11 percent of the
users reported that they make use of
Hadoop. Nonetheless, says Snell, “If
people have a big data problem, they
often think they must use Hadoop.”
Future Fits
By definition, big data problems
outstrip the capabilities of a user’s
current system. So some of the tools
typically used only in HPC will work
their way into non-HPC environments.
A key example of that could arise from
the use of parallel file systems.
The Intersect360 Research survey
asked users about the file systems
that they employ. The replies included
a diverse set of approaches. Across all
types of users—academic, commer-
cial and government—most selected
the Network File System (NFS), which
was developed by Sun Microsystems
in 1984. Overall, nearly 70 percent of
the users employ NFS. Not many of
the users implement parallel files sys-
tems at this point. For example, less
the 20 percent of the people polled
used the Hadoop Distributed File
System (HDFS), IBM’s General Paral-
lel File System (GPFS) or Lustre.
As Snell says, “Parallel file systems
are still poorly penetrated into com-
mercial markets.” He adds, “Com-
mercial users will need to find solu-
tions for parallel I/O.”
Beyond learning to build the infra-
structure for big data, much remains
to be learned about where to use it.
Gavin Nichols, vice president of strat-
egy and R&D IT at Quintiles in Re-
search Triangle Park, North Carolina,
provides a recent example. During the
2012 influenza season, big data ex-
perts in Scotland sequenced the gen-
otype of every person who contracted
the flu over a two-week period. From
that data, the analysts extracted infor-
mation that revealed who was most
susceptible to the illness, and who
might require hospitalization. Scot-
land’s government then used those
data to adjust its healthcare policy.
“They are using big data in how
they deal with influenza,” Nichols ex-
plains. “This is a good example that we
haven’t even seen the starting point for
what we can do with this technology.”
He adds, “If big data solutions grow in
the right way, then there’s the opportu-
nity to make meaningful changes in the
way drugs are developed, put through
to market and utilized.”
The variety of infrastructures be-
ing employed and the questions being
considered reflect the diversity of ap-
proaches that today’s users take to big
data challenges. This reveals that many
technologies in computing can be modi-
fied to handle big data problems in many
enterprises. Moreover, many technolo-
gies besides Hadoop can be explored.
For any particular user, the best technol-
ogy depends on the big data problem at
hand. Even for the same user, different
questions and applications will often
require different approaches to mak-
ing use of big data. As Snell concludes,
“Hadoop is not all things to all people.
Plus, there is not one perfect answer for
all big data problems.” n
About the Author:
Mike May, Ph.D. is a science and
technology writer, editor and book au-
thor. Mike has published hundreds of
articles within prestigious publications
during his 20 year professional writing
career, and is currently editorial direc-
tor for Scientific American Worldview.
He holds a Ph.D. from Cornell Univer-
sity in neurobiology and behavior.
If big data
solutions grow
in the right way,
then there’s the
opportunity to make
meaningful changes
in the way drugs
are developed, put
through to market
and utilized.
“
”
W
hen a customer runs into a big-
data problem—from struggling
with data velocity, volume or
variety—IBM offers a solution. Such
solutions, however, must supply flex-
ible options, because of the relativity of
the term big in big data. “The ‘big’ in big
data is like the ‘high’ in high performance
computing,” says Yonggang Hu, chief
architect for Platform Computing, IBM
Systems and Technology Group. “It’s dif-
ferent to different people.”
To provide flexibility, IBM orchestrates
a very broad big-data initiative. It
ranges from complete systems—like
the xSeries and pSeries architectures
that handle big-data workflows—to
various aspects of computing solu-
tions. For example, IBM offers vari-
ous forms of storage technology that
benefit big-data situations. The com-
pany also creates networking com-
ponents and interconnects needed
for extremely high-throughput sys-
tems. The data in any system does not
live in isolation. Big data come from a
location and the analysis goes to a des-
tination. So IBM solutions can provide
connectivity between big-data systems
and the databases and data warehous-
es in the enterprise.
The Power of Parallel Files
To provide an overview of big-data
challenges, Hu divides the computer
processing of big-data problems into dif-
ferent patterns. The first data workload
might require only a monthly assessment.
This does not require any immediate anal-
ysis. Instead, data can be collected and
analyzed over time, and the results get
used for decision-making as needed. At
the other end of the spectrum, some ap-
plications—such as a traffic-monitoring
system or software that maintains power
across a utility grid—require real-time de-
cisions. This requires a system with inter-
activity task latencies on the order of mil-
liseconds. “The current Hadoop system
doesn’t have this kind of near real-time
response capability,” Hu says.
Working effectively with this variety of
big-data applications requires a power-
ful file system. IBM’s new General Paral-
lel File System File Placement Optimizer
(GPFS FPO), developed specifically for
big data, allows for a model very similar to
a Hadoop distributed file system (HDFS).
This file system provides a range of
necessary features. For example, it in-
cludes enterprise capability. GPFS FPO
is also extremely scalable, so it can
grow with a user’s data needs. Fur-
thermore, this file system includes no
single points of failure, which provides
the robust computing needed to create
sophisticated data solutions. Moreover,
its POSIX support allows applications to
A to Z for Big-Data
Challenges
IBM provides every tool—from applications to complete
systems—for cost-effective solutions
IBM Platform Symphony brings mature multi-tenancy and low-latency scheduling to Hadoop. Symphony
enables multiple MapReduce workloads to run on the same cluster at the same time, dynamically sharing
resources based on priorities that can change at run-time.
IBM
A globally integrated technology and
consulting company
Founded: 1911 (as Computing
Tabulating Recording Company)
Number of Employees: 433,362
Headquarters: Armonk, NY
Top Executive: Virginia M. Rometty
(chairman, president and CEO)
Sponsor Profile
get the benefits of a UNIX-com-
patible file system.
Orchestrating a Solution
In many cases, big-data applica-
tions require advanced analytics.
For those situations, we created
IBM Platform Symphony. “This
software implements a Hadoop
MapReduce that is very fast,”
says Hu. Moreover, this software
provides very low latency. For in-
stance, the overhead provisioning
for an executed task takes less
than 1 millisecond. “Open-source
tool sets take an order of magni-
tude longer,” Hu says. On average,
performance tests reveal that Plat-
form Symphony is 7.3 times faster
than open-source Hadoop for jobs
comprised of a representative mix
of social media workloads.1
For someone who as been run-
ning Hadoop MapReduce on a
cluster, one challenge involves
efficient utilization. In many cas-
es, the user lacks sufficient jobs
to keep the cluster occupied. In
fact, utilization often runs as low as
20 percent. Platform Symphony helps
users extract more from a cluster by
getting workloads there faster and not
keeping them waiting to run. In ad-
dition, users can run heterogeneous
workloads, such as Hadoop MapRe-
duce mixed with C++ jobs. Moreover,
a business can divide a cluster among
multiple units with Platform Sympho-
ny, giving each a slice of the structure.
Then, the units can operate virtually
independently of each other.
In a recent return-on-investment
study, IBM explored the cost of run-
ning a big-data system, covering ev-
erything from the cost of the hardware
and administration to cooling the data
center. By using Platform Symphony,
instead of a standard Hadoop system,
the results showed a potential savings
of almost $1 million depending on the
customer’s use case.
For Real-Time Needs
IBM’s infosphere offers further options
in big-data analytics. This approach han-
dles big-data applications that need to
run in real or almost real time. It includes
ready-made components for people
who already use Hadoop analytics.
Furthermore, Infosphere’s accelera-
tors and frameworks speed up the de-
velopment of big-data analytics. One
ready-made tool, for example, is Big
Sheets. This provides a spreadsheet-
like interface to the Hadoop stack. It
extracts and manipulates data from
the spreadsheet interface.
Infosphere also lets users com-
bine Big Sheets with other software.
It supports all of the components of
the open-source Hadoop ecosystem.
So a user could select open-source
Hadoop layers in any of the stacks.
Hu adds, “Customers can choose the
layers where they want enterprise-ca-
pable technology.”
Building Relationships
In tailoring big-data solutions to
customers, IBM supplies a wide
range of customer services. This
includes assessing a customer’s
existing infrastructure and pro-
viding business-level needs. This
combines IBM’s technical- and
business-service arms.
Getting the right solution often
requires both levels of analysis.
For example, an efficient big-
data solution requires the right
computing infrastructure and
software systems, as well as
the business knowledge to de-
termine what reports should be
generated from the analysis.
The very basis of big data often
implies an analytical challenge. For
example, Vestas—an energy sup-
plier in Denmark—turned to IBM
to determine the best locations
for electricity-generating wind tur-
bines. The resulting analysis includ-
ed weather information, such as
wind and climate, as well as the lo-
cal geography. This company used
IBM technology to put together a
system that could produce accurate
answers in a timely fashion. The results
generated the best locations for the
wind turbines and how to arrange them
to harness the most energy possible.
As big-data problems grow and
evolve, IBM’s solutions keep adapting
and advancing. It takes the teamwork
of hardware and software infrastruc-
ture plus powerful analytics to perform
the needed computation. In addition,
technical and business solutions must
work together to generate the most
useful solution. From the xSeries and
pSeries systems to powerful intercon-
nects and advanced analytical algo-
rithms, IBM supplies big-data tools
and solutions that extend from high-
performance infrastructure to sophisti-
cated business intelligence and plan-
ning. Only approaches as orchestrated
as this can turn big-data opportunities
into business growth and success. n
Reference
• STAC Report – A Comparison of
IBM Platform Symphony and Apache
Hadoop MapReduce using Berkeley
SWIM 2012
IBM provides a variety of System x and Power-
based solutions optimized to run big-data tools,
like IBM InfoSphere BigInsights, IBM Platform
Symphony and IBM GPFS.
Sponsor Profile
T
o break down big-data problems,
many data scientists are look-
ing to the Hadoop MapReduce
model. Currently those systems run on
dedicated Hadoop clusters. Many com-
panies and researchers have existing
clusters that run all sorts of code but
currently cannot leverage Hadoop in
their shared environments. Dedicated
clusters are required due to the close
ties of Hadoop MapReduce and the Ha-
doop Distributed File System (HDFS). To
solve that problem, Omnibond has
extended the open source Orange
File System (OrangeFS) to natively
support Hadoop MapReduce with-
out the HDFS requirement.
According to Boyd Wilson, Omni-
bond executive, “Our main business
model for OrangeFS is all open-
source so you can download and
start running today.” In addition, Om-
nibond provides support and devel-
opment services. For example, if a cus-
tomer needs a specific feature added to
OrangeFS, Omnibond will price out the
cost and create it. The customer must
agree that the resulting code goes back
into open-source. Furthermore, Orange-
FS offers a wide range of applications
that can be used with high performance
and enterprise computing.
The need to make parallel file sys-
tems easier to use, including leveraging
MapReduce on existing clusters, ignited
the original work on OrangeFS. To bring
such systems to more users, the paral-
lel file system development work needs
to continue. OrangeFS is not limited to
traditional data science alone, as the
broadening list of use cases is expand-
ing from research to finances, entertain-
ment and analytics.
Enhancing the open-source business
model, Omnibond provides everything
that a customer would expect from a
company that develops closed-source
code. “If you have a problem, just call. If
you need a feature, we can add it, and
in addition, you get the peace of mind
in having the source code, too,” Wilson
says. Customers can purchase this ser-
vice and directed development  on an
annual subscription basis, giving them
the assurance of stable development
and support. Moreover, this model pro-
vides a triple-win situation that’s great
for the support customer, Omnibond
and the entire open-source community,
thus assuring the ongoing development
of OrangeFS.
Inside OrangeFS
In general, OrangeFS provides the next-
generation approach to the Parallel Virtual
File System, version 2 (PVFS2). From the
start, PVFS2 delivered
desirable features, in-
cluding a modular de-
sign and distributed
metadata. It also sup-
ports a message pass-
ing interface (MPI).
Other features include
PVFS2’s high scal-
ability, which makes
it a great choice for
large data sets or files.
Beyond working with
such large files, PVFS2
A Big-Data File
System for the Masses
The computing community and Omnibond developed
open-source infrastructure software for storage clusters
Omnibond Systems develops open-source tools that can be used in a wide
range of big-data applications. (Image courtesy of Omnibond Systems.)
Omnibond Systems, LLC
A software engineering and support
company focused on infrastructure
software.
Founded: 1999
Employees: 45
Headquarters: Clemson, SC
Top Executive:  Boyd Wilson
Sponsor Profile
works fast. For example, it outruns—
really outruns—a network file system
(NFS) using the MPI-IO interface.
Omnibond started with those fea-
tures and generated OrangeFS by
adding even more. For example, this
file system comes from a focused de-
velopment team. It also includes im-
provements in working with smaller
files; Omnibond does all of its testing
on 1 megabyte files. In the 2.8.6 ver-
sion, released on June 18, 2012, us-
ers were provided with advanced cli-
ent libraries, Linux kernel support and
Direct Interface libraries for POSIX
calls and low-level system calls. Ad-
ditionally, users can receive support
for WebDAV and S3 via the new Or-
angeFS Web Pack.
Future releases promise even more
features, such as distributed directo-
ries and capability-based security that
will make the access time for large di-
rectories even faster and the entire file
system more secure.
Added Incentives
For many big-data users, the list of
OrangeFS features already mentioned
sounds enticing enough, especially
with the best of open- and closed-
source worlds. Nonetheless, even
more incentives exist.
The object-based design of Or-
angeFS appeals to many users. In
addition, this file system separates
data and metadata. OrangeFS also
provides optimized MPI-IO requests.
Furthermore, this software can be
installed on a range of architectures.
For example, users can apply Or-
angeFS to multiple network architec-
tures. In addition, OrangeFS works on
many commodity hardware platforms.
To ensure file safety, OrangeFS in the
near future will also provide cross-
server redundancy.
Given the open-source nature of
OrangeFS, any documentation might
sound like a rarity, but Omnibond goes
even further. For instance, the recent
OrangeFS Installation Instructions
supply complete documentation—all
available online.
Many applications of big data require
aggregation. For example, a system
might gather real-time data and store it
for access. That stored data might have
access to clustering nodes for Hadoop.
Then, the collected data can be queried
and analyzed later with MapReduce.
Likewise, a big-data system might
require other forms of data aggre-
gation. For instance, real-time data
could be collected, combined with
private data and then processed. On
the other hand, a user might take
‘snapshots’ while processing real-
time data to collect a subset of data,
and then analyze it. OrangeFS can
provide storage for all of these big-
data scenarios.
Advanced Applications
The open-source community stays
at the cutting-edge of computation
and data analysis. Consequently,
experts at Omnibond track all of the
trends in high-performance and big-
data environments. The information
gleaned by Omnibond continues to
impact future versions of OrangeFS.
As an example, ongoing work in
data analysis spawns unimagined ap-
proaches. New ways of even think-
ing about data keep emerging. Social
media–data analysis, for instance,
and other large streams of data, can
trigger entirely new ways of analyzing
information, such as log, customer
and other real and near real-time data.
In other cases, new perspectives on
existing data will generate novel ap-
proaches to data integration that were
never considered.
As Wilson explains, “The new con-
cept of data analysis covers a wide
range. In some cases, it involves huge
amounts of data, and other times not
so much.” He adds, “You can bring
together different data points that you
wouldn’t think would correlate, but
sometimes they do.”
This kind of thinking—seeing to-
day’s challenges and turning them
into tomorrow’s opportunities—lies
at the heart of Omnibond’s philoso-
phy. “We are adding exciting new
features on our development path to
OrangeFS 3.0” says Wilson. “These
should be real game-changing en-
hancements.”
As a company, Omnibond focuses on
building the OrangeFS that customers
want, and then sharing those advances
with the wider community. n
Omnibond’s open source Orange File System (OrangeFS) supports Hadoop MapReduce and pro-
vides a simplified parallel file system. (Image courtesy of Omnibond Systems.)
If you have a problem, just call. If
you need a feature, we can add it, and
in addition, you get the peace of mind in
having the source code, too.
“
”
Sponsor Profile
B
ig data varies from one custom-
er to the next. As Ken Claffey,
senior vice president and gen-
eral manager, ClusterStor at Xyratex,
said, “What’s big data to one guy is tiny
to another.” No matter how you define
big data, achieving analytic results rap-
idly needs one thing: fast, reliable data
storage. And with more than 25 years
of experience as a leading provider of
data storage technology, this is exactly
where Xyratex leads the market.
Xyratex’ innovative technology solutions
help make data storage faster, more reli-
able and more cost-effective. As an
organization, Xyratex provides a range
of products from disk-drive heads and
disk media test equipment to enter-
prise storage platforms and scale-out
storageforhigh-performancecomput-
ing (HPC). For big data applications,
where users need storage that scales
both in capacity and performance,
Xyratex solutions support high-perfor-
mance data processing of relatively small
data quantities to some of the largest su-
percomputers in the world.
“Performance is often overlooked,”
Claffey said, “but it’s the key driver. Once
you have your data together, you need
performance for analytics to transform
that data into information you can use to
make more-informed decisions.”
Combining Features
Xyratex achieves its big data storage
success by providing solutions, not just
parts. The company has developed a
fresh, groundbreaking scale-out stor-
age architecture that meets user per-
formance and scalability needs while
also providing the industry’s highest
levels of integrated solution reliability,
availability and serviceability.
“You can go from a single pair of scal-
able storage nodes with less than 100
terabytes of capacity to solutions that sup-
port more than 30 petabytes of capacity
and deliver terabyte-per-second perfor-
mance,” Claffey said. “Then we embed
the analytics engine right into the storage
solution, helping you make better use of all
the data you are storing and maximize op-
erational efficiency within the solution.”
Adding Engineering
Traditional HPC users often tried to
build their own data storage infrastruc-
ture by stitching together different pieces
of off-the-shelf hardware. This approach
often encountered challenges.
Xyratex saw an opportunity to engineer
a high-performance, singly managed ap-
pliance-like data storage solution called
ClusterStor. ClusterStor is a tightly inte-
grated scale-out solution that is easy to
deploy, scale and manage.
What is now termed big data has been
used in practice for many years within
compute-intensive environments and is
evolving towards an enterprise class of
requirements. However, enterprise users
do not have the time to struggle with the
level of complexity these environments
bring with them. To support exponential
growth in these types of environments
requires data storage solutions that can
keep costs in check. As with general-pur-
pose storage acquisitions, enterprise us-
ers should consider the total cost of own-
ership, including the cost of upgrades,
time to productivity as well as operational
efficiency. ClusterStor allows enterprise-
class users to move into an engineered
high-performance data storage solution
Crafting Storage Solutions
Xyratex builds scalable capacity and performance
into big data systems
XYRATEX
A leading provider of data-storage
technology.
Founded: 1994
Employees: 1,900
Headquarters: Havant,
United Kingdom
To help big-data users,
Xyratex provides its ultra dense 5U84 enclosure.
(Image courtesy of Xyratex.)
Sponsor Profile
with enterprise reliability, serviceabil-
ity and availability without the hidden
costs associated with stitching together
components.
Breadth of Benefits
With a Xyratex ClusterStor storage so-
lution, a user gets an integrated product.
WhileXyratexleveragesopen-sourceand
standards-based hardware platforms, it
also adds its own blend of ease of de-
ployment and reliability. It all adds up to
the fastest and most re-
liable data storage on the market today.
In addition, all ClusterStor systems come
pre-configured from the factory, enabling
turnkey deployment, which significantly
reduces installation time.
Many sectors—such as digital manu-
facturing,government,commercialindus-
try and others—are finding that big-data
initiatives can provide new understanding
into the data they already have, but many
organizations in these sectors lack the in-
house expertise to create and manage
big-data solutions.
“Coming to users with a big-data so-
lution that they can easily deploy and
manage removes a barrier for them to
get access to the technology they need,”
said Claffey. “The complexity of the solu-
tion is completely hidden from the user,
and you don’t see all of the thousands of
man-hours that it took to create that ex-
perience. We help you reduce complex-
ity, get a great out-of-the-box experience
and maximize the uptime of the system.”
“Instead of building and tweaking to
get a system up and running, our turnkey
solutions reduce the time it takes to start
computing. We even have architects
who can come in and design how the
solution will fit into your environment,”
added Steve Paulhus, senior director,
strategic business development at Xyra-
tex. “We are known for high-quality stor-
age products—in the way we engineer,
test, validate and deliver preconfigured
ClusterStor solutions. Through all of this,
we offer the quickest path to the perfor-
mance and results users seek.”
As big data continues to quickly evolve,
users’storageneedswillaswell.Anddata
storage systems and vendors will have to
adapt to stay relevant in the market.
“We continue to leverage our knowl-
edge and experience-based under-
standing of end user data storage needs
to relentlessly push beyond traditional
boundaries,” said Paulhus. “That’s how
we will continue to advance in big data
and provide solutions ideally suited to
serving the needs of the market.” n
That’s how we
will continue to
advance in big
data and provide
solutions ideally
suited to serving
the needs of the
market.
“
”
The ClusterStor 6000 provides the ultimate
integrated HPC data-storage solution. (Image
courtesy of Xyratex.)
Sponsor Profile

Más contenido relacionado

La actualidad más candente

Big Data Trends - WorldFuture 2015 Conference
Big Data Trends - WorldFuture 2015 ConferenceBig Data Trends - WorldFuture 2015 Conference
Big Data Trends - WorldFuture 2015 ConferenceDavid Feinleib
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Oomph! Recruitment
 
Capitalize On Social Media With Big Data Analytics
Capitalize On Social Media With Big Data AnalyticsCapitalize On Social Media With Big Data Analytics
Capitalize On Social Media With Big Data AnalyticsHassan Keshavarz
 
Snowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big DataSnowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big DataSnowball Group
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaSkillspeed
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data ReferencesRob Thomas
 
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)Cisco Service Provider Mobility
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analyticsCapgemini
 
Big Data Systems: Past, Present & (Possibly) Future with @techmilind
Big Data Systems: Past, Present &  (Possibly) Future with @techmilindBig Data Systems: Past, Present &  (Possibly) Future with @techmilind
Big Data Systems: Past, Present & (Possibly) Future with @techmilindEMC
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
 
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Taniya Fansupkar
 
Analysis on big data concepts and applications
Analysis on big data concepts and applicationsAnalysis on big data concepts and applications
Analysis on big data concepts and applicationsIJARIIT
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Gregg Barrett
 

La actualidad más candente (20)

Big Data Trends - WorldFuture 2015 Conference
Big Data Trends - WorldFuture 2015 ConferenceBig Data Trends - WorldFuture 2015 Conference
Big Data Trends - WorldFuture 2015 Conference
 
Big data case study collection
Big data   case study collectionBig data   case study collection
Big data case study collection
 
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
Quick view Big Data, brought by Oomph!, courtesy of our partner Sonovate
 
R180305120123
R180305120123R180305120123
R180305120123
 
Capitalize On Social Media With Big Data Analytics
Capitalize On Social Media With Big Data AnalyticsCapitalize On Social Media With Big Data Analytics
Capitalize On Social Media With Big Data Analytics
 
Snowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big DataSnowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big Data
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
The ABCs of Big Data
The ABCs of Big DataThe ABCs of Big Data
The ABCs of Big Data
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data References
 
Big data survey
Big data surveyBig data survey
Big data survey
 
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
 
Big data upload
Big data uploadBig data upload
Big data upload
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
 
Big Data Systems: Past, Present & (Possibly) Future with @techmilind
Big Data Systems: Past, Present &  (Possibly) Future with @techmilindBig Data Systems: Past, Present &  (Possibly) Future with @techmilind
Big Data Systems: Past, Present & (Possibly) Future with @techmilind
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
 
Big data
Big dataBig data
Big data
 
Analysis on big data concepts and applications
Analysis on big data concepts and applicationsAnalysis on big data concepts and applications
Analysis on big data concepts and applications
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...
 

Similar a Big dataimplementation hadoop_and_beyond

Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects IJMER
 
10 top notch big data trends to watch out for in 2017
10 top notch big data trends to watch out for in 201710 top notch big data trends to watch out for in 2017
10 top notch big data trends to watch out for in 2017Ajeet Singh
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fataSuraj Sawant
 
Gartner eBook on Big Data
Gartner eBook on Big DataGartner eBook on Big Data
Gartner eBook on Big DataJyrki Määttä
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Activating Big Data: The Key To Success with Machine Learning Advanced Analyt...
Activating Big Data: The Key To Success with Machine Learning Advanced Analyt...Activating Big Data: The Key To Success with Machine Learning Advanced Analyt...
Activating Big Data: The Key To Success with Machine Learning Advanced Analyt...Vasu S
 
Guide to big data analytics
Guide to big data analyticsGuide to big data analytics
Guide to big data analyticsGahya Pandian
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAudrey Britton
 
Big Data
Big DataBig Data
Big DataBBDO
 
141900791 big-data
141900791 big-data141900791 big-data
141900791 big-dataglittaz
 
BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013Brian Crotty
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data FundamentalsSmarak Das
 

Similar a Big dataimplementation hadoop_and_beyond (20)

Big Data 2.0
Big Data 2.0Big Data 2.0
Big Data 2.0
 
Big Data
Big DataBig Data
Big Data
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects
 
10 top notch big data trends to watch out for in 2017
10 top notch big data trends to watch out for in 201710 top notch big data trends to watch out for in 2017
10 top notch big data trends to watch out for in 2017
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fata
 
Gartner eBook on Big Data
Gartner eBook on Big DataGartner eBook on Big Data
Gartner eBook on Big Data
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Combining hadoop with big data analytics
Combining hadoop with big data analyticsCombining hadoop with big data analytics
Combining hadoop with big data analytics
 
Activating Big Data: The Key To Success with Machine Learning Advanced Analyt...
Activating Big Data: The Key To Success with Machine Learning Advanced Analyt...Activating Big Data: The Key To Success with Machine Learning Advanced Analyt...
Activating Big Data: The Key To Success with Machine Learning Advanced Analyt...
 
Guide to big data analytics
Guide to big data analyticsGuide to big data analytics
Guide to big data analytics
 
Big data
Big dataBig data
Big data
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
 
Big Data
Big DataBig Data
Big Data
 
141900791 big-data
141900791 big-data141900791 big-data
141900791 big-data
 
BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013
 
Big Data
Big DataBig Data
Big Data
 
Big Data - CRM's Promise Land
Big Data - CRM's Promise LandBig Data - CRM's Promise Land
Big Data - CRM's Promise Land
 
What is big data.pdf
What is big data.pdfWhat is big data.pdf
What is big data.pdf
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Transforming Big Data into business value
Transforming Big Data into business valueTransforming Big Data into business value
Transforming Big Data into business value
 

Más de Patrick Bouillaud

8. myprm comment vendre en interne votre projet de prm
8. myprm   comment vendre en interne votre projet de prm 8. myprm   comment vendre en interne votre projet de prm
8. myprm comment vendre en interne votre projet de prm Patrick Bouillaud
 
Livre blanc la vente indirecte c'est quoi
Livre blanc   la vente indirecte c'est quoiLivre blanc   la vente indirecte c'est quoi
Livre blanc la vente indirecte c'est quoiPatrick Bouillaud
 
Livre blanc un PRM, c'est quoi
Livre blanc un PRM, c'est quoiLivre blanc un PRM, c'est quoi
Livre blanc un PRM, c'est quoiPatrick Bouillaud
 
IBM Services Platform with Watson
IBM Services Platform with WatsonIBM Services Platform with Watson
IBM Services Platform with WatsonPatrick Bouillaud
 
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...Patrick Bouillaud
 
IBM #Softlayer infographic 2016
IBM #Softlayer infographic 2016IBM #Softlayer infographic 2016
IBM #Softlayer infographic 2016Patrick Bouillaud
 
Securité : Le rapport 2Q de la X-Force
Securité : Le rapport 2Q de la X-ForceSecurité : Le rapport 2Q de la X-Force
Securité : Le rapport 2Q de la X-ForcePatrick Bouillaud
 
Ibm security products portfolio
Ibm security products  portfolioIbm security products  portfolio
Ibm security products portfolioPatrick Bouillaud
 
Les serveurs Power8 , la performance assurée .Propositions de valeur
Les serveurs Power8 , la performance assurée .Propositions de valeur Les serveurs Power8 , la performance assurée .Propositions de valeur
Les serveurs Power8 , la performance assurée .Propositions de valeur Patrick Bouillaud
 
IBM Platform Computing Elastic Storage
IBM Platform Computing  Elastic StorageIBM Platform Computing  Elastic Storage
IBM Platform Computing Elastic StoragePatrick Bouillaud
 
IBM Cloud Manager with OpenStack Overview
IBM Cloud Manager with OpenStack OverviewIBM Cloud Manager with OpenStack Overview
IBM Cloud Manager with OpenStack OverviewPatrick Bouillaud
 
Big Data Analytics Infrastructure for Dummies
Big Data Analytics Infrastructure for DummiesBig Data Analytics Infrastructure for Dummies
Big Data Analytics Infrastructure for DummiesPatrick Bouillaud
 
What is IBM Bluemix , Une nouvelle façon de coder , dans le cloud
What is IBM Bluemix , Une nouvelle façon de coder , dans le cloudWhat is IBM Bluemix , Une nouvelle façon de coder , dans le cloud
What is IBM Bluemix , Une nouvelle façon de coder , dans le cloudPatrick Bouillaud
 

Más de Patrick Bouillaud (20)

8. myprm comment vendre en interne votre projet de prm
8. myprm   comment vendre en interne votre projet de prm 8. myprm   comment vendre en interne votre projet de prm
8. myprm comment vendre en interne votre projet de prm
 
Livre blanc la vente indirecte c'est quoi
Livre blanc   la vente indirecte c'est quoiLivre blanc   la vente indirecte c'est quoi
Livre blanc la vente indirecte c'est quoi
 
Livre blanc un PRM, c'est quoi
Livre blanc un PRM, c'est quoiLivre blanc un PRM, c'est quoi
Livre blanc un PRM, c'est quoi
 
Forrester wave q4 2018
Forrester wave q4 2018Forrester wave q4 2018
Forrester wave q4 2018
 
Myprm solution 2018 v1
Myprm solution 2018 v1Myprm solution 2018 v1
Myprm solution 2018 v1
 
IBM Services Platform with Watson
IBM Services Platform with WatsonIBM Services Platform with Watson
IBM Services Platform with Watson
 
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...
 
Api pour les nuls
Api pour les nulsApi pour les nuls
Api pour les nuls
 
IBM #Softlayer infographic 2016
IBM #Softlayer infographic 2016IBM #Softlayer infographic 2016
IBM #Softlayer infographic 2016
 
Hybrid Cloud Pour Dummies
Hybrid Cloud Pour DummiesHybrid Cloud Pour Dummies
Hybrid Cloud Pour Dummies
 
Securité : Le rapport 2Q de la X-Force
Securité : Le rapport 2Q de la X-ForceSecurité : Le rapport 2Q de la X-Force
Securité : Le rapport 2Q de la X-Force
 
Ibm api economie
Ibm api economieIbm api economie
Ibm api economie
 
Ibm security products portfolio
Ibm security products  portfolioIbm security products  portfolio
Ibm security products portfolio
 
Les serveurs Power8 , la performance assurée .Propositions de valeur
Les serveurs Power8 , la performance assurée .Propositions de valeur Les serveurs Power8 , la performance assurée .Propositions de valeur
Les serveurs Power8 , la performance assurée .Propositions de valeur
 
Api for dummies
Api for dummies  Api for dummies
Api for dummies
 
IBM Xforce Q4 2014
IBM Xforce Q4 2014IBM Xforce Q4 2014
IBM Xforce Q4 2014
 
IBM Platform Computing Elastic Storage
IBM Platform Computing  Elastic StorageIBM Platform Computing  Elastic Storage
IBM Platform Computing Elastic Storage
 
IBM Cloud Manager with OpenStack Overview
IBM Cloud Manager with OpenStack OverviewIBM Cloud Manager with OpenStack Overview
IBM Cloud Manager with OpenStack Overview
 
Big Data Analytics Infrastructure for Dummies
Big Data Analytics Infrastructure for DummiesBig Data Analytics Infrastructure for Dummies
Big Data Analytics Infrastructure for Dummies
 
What is IBM Bluemix , Une nouvelle façon de coder , dans le cloud
What is IBM Bluemix , Une nouvelle façon de coder , dans le cloudWhat is IBM Bluemix , Une nouvelle façon de coder , dans le cloud
What is IBM Bluemix , Une nouvelle façon de coder , dans le cloud
 

Último

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Último (20)

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Big dataimplementation hadoop_and_beyond

  • 1. Big Data Implementation: Hadoopand Beyond CONDUCTED BY: Tabor Communications Custom Publishing Group IN CONJUNCTION WITH:
  • 2. Dear Colleague, Welcome to the first in a series of professionally written reports and live online events around topics related to big data, high-performance computing, cloud computing, green computing, and digital manu- facturing that comprise the media portfolio produced by Tabor Communications, Inc. Most upcoming Tabor Communications reports and events will be offered at no charge to viewers through underwriting by a limited group of sponsors. Our first report titled Big Data Implementation: Hadoop and Be- yond, is being graciously underwritten by IBM, Omnibond and Xyratex. More about how these organizations tackle challenges associated with big data can be found in the sponsor section. Tabor Communications has been writing about the most progressive technol- ogies driving science, manufacturing, finance, healthcare, energy, government research, and more for nearly 25 years. The extended depth of our new report series extends beyond the limited space we can devote for coverage on specific topics through our regular media resources. The writers of these reports are typically contributing editors for Tabor Communications, and/or content part- ners like Intersect360 Research from whom research cited within this report was developed. The best topics to cover typically come from the user community. We en- courage you to offer any suggestions for upcoming reports that you feel will be helpful to you in better reaching your business goals. Please send your ideas to our SVP of Digital Demand Services, Alan El Faye (alan@taborcommunications. com). Sponsorship inquiries for future reports would also go to Alan. Thank you for reading Big Data Implementation: Hadoop and Beyond. Please visit our corporate website to learn more about the suite of Tabor pub- lications: http://www.taborcommunications.com/publications/index.htm Cordially, Nicole Hemsoth Editorial Director-Tabor Communications, Inc. Big Data Implementation: Hadoop and Beyond Conducted by: Tabor Communications Custom Publishing Group A Tabor Communications, Inc. (TCI) publication © 2013. This report cannot be duplicated without prior permission from TCI. While every effort is made to assure accuracy of the contents we do not assume liability for that accuracy, its presentation, or for any opinions presented. Study Sponsors OOMMUN NI ICC AT S Media and Research Partners
  • 3. By Mike May T he big data phenomenon is far more than a set of buzzwords and new technologies that fit into a growing niche of application areas. Further, all the hype around big data goes far beyond what many think is the key issue, which is data size. While data volume is a critical issue, the related matters of fast movement and opera- tions on that data, not to mention the incredible diversity of that data demand a new movement—and a new set of complementary technologies. But we’ve heard all of this before, of course. Without context, it’s hard to communicate just how important this movement is. Consider, for example, President Obama’s recent campaign, which many touted as by far the most comprehensive Web and data-driven campaign in history. The team backing the big data efforts for Obama utilized vast, diverse wells of information ranging from GIS, social me- dia, demographic and other open and proprietary data to seek out the most likely voters to lure from challenger Mitt Romney. This approach to wrangling big data for a big win worked, although it’s still a matter of speculation just how much these efforts tipped the scale. So let’s take a more concrete set of Big Data Implementation: Hadoop and Beyond In this new era of data-intensive computing, the next generation of tools to power everything from research to business operations is emerging. Keeping up with the influx of emerging big data technologies presents its own challenges—but the rewards for adoption can be of incredible value. High-performance computing and machine learning help users track their family tree at Ancestry.com. This screen shot features the Shaky Leaf, which employs predictive analytics to suggest possible new information in a user’s family tree. (Image courtesy of Ancestry.com.)
  • 4. measurable examples. Healthcare data are being processed at incredible speed, pulling in data from a number of sources to turn over rapid insights. Insurance, retail and Web companies are also tapping into big, fast data based on transactions. And life sci- ences companies and research enti- ties alike are turning to new big data technologies to enhance the results from their high performance systems by allowing more complex data inges- tion and processing. There is little doubt that the next wave of big data technologies will rev- olutionize computing in business and research environments, but we should back up for a moment and take in the 50,000-foot view to put this all in clearer context. The roots of big data are not in the traditional places one used to look for the cutting edge of computational tech- nologies—they are entrenched in the daily needs of the world’s largest, most complex and earliest web companies. As Jack Norris, VP of Marketing at Hadoop distribution startup, MapR, reminds, “Google is the pioneer of big data and showed the dramatic power and ability to change industries through better analysis.” He adds, “Not only did Google do better analysis, but they did it on a larger dataset and much cheap- er than traditional alternatives.” Mak- ing use of big data requires powerful new computing tools, and perhaps the best-known one is called Hadoop. As the case of Google calls to mind, even though big data are often big, size alone does not adequately define the space. Likewise, the field of big data extends beyond any particular applica- tion, as indicated by the examples giv- en above. As Norris explains, “Big data is not just about the kind or volume of data. It’s a paradigm shift.” He adds, “It’s a new way to process and analyze existing data and new data sources.” The point at which a computing chal- lenge turns into big data depends on the specific situation. For some research and enterprise end users, existing ap- plications are being strained due to the data complexity and volume overload, for others performance is lagging due to these factors. Another group of users simply needs to abandon old applica- tions, systems or even methodologies completely and build again to create a big data-ready environment. As Yonggang Hu, chief architect for Platform Computing, IBM Systems and Technology Group says, “One of the simplest ways to define big data is when an organization outgrows their current data-management systems, and they are experiencing problems with the scalability, speed and reliabil- ity of current systems in the data they are seeing or expect to get.” When faced with such a situation, says Hu, “They need to find a cost-effective way to move forward.” (See ‘A to Z for Big Data Challenges.’) When viewed in total, the scalability, performance, storage, network and other concerns continue to mount— which has opened the door for the new breed of big data technologies. This includes everything from fresh approaches to systems (new appli- ances purpose-built for high-volume and -velocity data ingestion and pro- cessing) to new applications, data- bases and even frameworks (Hadoop being a prime example). The Deluge of Data to Decode There is no one definition of a big data application, but it’s safe to say that many existing applications need to be retooled, reworked or even thrown overboard in the pursuit to keep pace with data volume, variety and velocity needs. Throw in the tricky element of real- time applications—which require high performance on top of complex in- gestion, storage and processing—and Benefits of BigQuery To explore big data, people want to ask questions about the information. “Google’s BigQuery is a tool that allows you to run ad hoc queries,” says Michael Manoochehri, one of Mountain View, California-based Google’s developer programs engineers. “It lets you ask questions on large data sets, like terabytes.” To make this tool accessible to more users, Google implemented a SQL-style syntax. Moreover, the company has used this tool—in a version called Dremel— internally for years before releasing a form of it in their generally available service, BigQuery, in 2012. “It’s the only tool that lets you get results in seconds from tera- bytes,” Manoochehri says. This tool brings big data to many users because it is hosted on Google’s infrastruc- ture. “You take your data, put it in our cloud and use this application,” Manoochehri ex- plains. Because of that simplicity, a diverse group of users already employs BigQuery. As examples, Manoochehri mentions Claritics—another Mountain View–based company, which describes its services as providing “social gaming and social com- merce analytics”—and Blue Box—a company in Brazil that provides mobile adver- tisements. Regarding Blue Box, Manoochehri says, “They started with MySQL, and when the data got too big they moved to Hadoop.” He adds, “Hadoop is cool, but it takes set up and administration.” Then, Blue Box tried BigQuery. “It fit,” Manoocheh- ri says, “because Blue Box is a small shop with lots of data to process. They wanted to run queries on data and integrate it into their application.” Various other industries also use BigQuery. As an example, Manoochehri points out the hospitality industry, which includes hotel and travel data. Overall, he says, “Almost all of the case studies for BigQuery are small shops, like two or three engi- neers, that have no time to set up hardware.” In general, BigQuery’s biggest benefits arise from speed. “It’s optimized for speed,” Manoochehri explains. It’s fast to set up and it brings back results fast, even from very big datasets. They are often looking at complexity, trying to manage complexity with limited resources. “ ”
  • 5. users are left with a lot of questions about how to meet their goals. While there is no shortage of new technolo- gies that are aimed at meeting those targets, the challenges can be over- whelming to many users. When asked about the hurdles that users face today in working with big data, Michael Manoochehri, developer programs engineer at Google head- quarters says, “They are often looking at complexity, trying to manage com- plexity with limited resources.” He adds, “That’s a big problem for lots of people.” In the past, the lack of resources kept many potential users out of big data. As Manoochehri says, “Google and Amazon and Microsoft could handle big data,” but not everyone. Today, big data tools exist for many more users. For example, Manoochehri says, “Our products make these seemingly impos- sible big data challenges more acces- sible.” (See ‘Benefits of BigQuery.’) A prime example of this can be found in large-scale data mining, which can quickly turn into a big data problem, even if it’s a process that used to fit neatly in a relational data- base and commodity system. For life sciences companies, data mining has always been a key area of computa- tional R&D, but the addition of new, diverse, large data sets continues to push them toward new solutions. The data mining, analytics and pro- cessing challenges for life sciences are great. Genomics explores the chromo- somes of a wide range of organisms— everything from seeds to trees and hu- mans. Chromosomes consist of genes, whichcodeforproteins,andnucleotides make up the genes. The nucleotides are arranged, more or less, like beads on a necklace, and the order of the nucleotides and the types—they come in four forms—are like letters spelling out words. In the case of genomics, the words translate into proteins, which control all of the biochemical processes that fuel life. Hidden in those words are complex programming and process- ing challenges as life sciences users attempt to piece these bits together to find “needle in the hackstack” discov- eries amidst millions (and more often, billions) of pieces at a time. Sequencing determines the order and identity of the nucleotides, and that information can be used in basic research, biotechnology, medicine and more. Today’s sequencing machines produce data so fast that this quickly generates big data. In 2010, for ex- ample, BGI—a sequencing center in Beijing, China—bought 128 Illumina HiSeq 2000 sequencing platforms. Ac- cording to specifications from Illumina, one of these machines can produce up to 600 gigabytes of data in 8.5 days. So 128 of these sequencing machines could produce about one-quarter mil- lion gigabytes of data a month, and that’s big data. Today’s sophisticated life science tools and big data applica- tions also impact the healthcare indus- try. (See ‘Boosting Big Pharma.’) The challenges across the com- putational spectrum are not hard to point to in such a case, especially when it comes to large scale genomic sequence analysis. With the terabytes of data rolling off sequencers, the vol- ume side of the big data equation is incredible, necessitating innovative approaches to storage, ingestion and data movement. In terms of the nee- dle in a haystack side of the analytics Boosting Big Pharma “The big data question hasn’t really been fully exploited in big pharma at the mo- ment,” says Gavin Nichols, vice president of strategy and R&D IT at Quintiles in Research Triangle Park, North Carolina. Nonetheless, many big data opportunities exist in healthcare. “If you took the big data around genomics, the ability to master investigator and doctor information, the information from clinical trials and what goes on drug labels and much more, you get a picture of where you could influence the field,” says Nichols. In short, big data could impact the pharmaceutical industry from discovering new drugs to commercializing them. “There is the ability to influence the whole lifecycle,” Nichols says, “but that has not been fully realized.” Quintiles, on the other hand, started working with big data some time age. “For the past seven or eight years,” Nichols says, “we’ve been very strategic about the data that we consider.” This includes both enterprise and technical computing. For ex- ample, Quintiles analyzes large datasets in an effort to improve efficiencies across its various organizations. Likewise, the company takes advantage of electronic health records (EHRs) and registries in an effort to better understand the populations that need new pharmaceuticals. As one crucial example, Quintiles uses big data in an effort to make clinical trials more efficient. In a late-phase trial, for example, better use of EHRs could reduce time and cost. “Instead of going to a trial site and asking 100s of questions, like what drugs the people take,” Nichols explains, “we could ask the EHRs those questions, and maybe we can answer 40 percent of the questions.” Beyond increasing efficien- cy, a similar approach could also enhance the information gained from a clinical trial. “Information from EHRs could help us design more-targeted studies on a smaller population,” Nichols says, “and that’s a win-win for everyone.” Although big data offers a gold mine to the pharmaceutical industry, companies must use this resource wisely. “Big data in the life sciences without good clinical interpretations could be a hazard,” Nichols says. “So we bring data together and enrich it.” Quintiles combines data from, say, clinical trials with the experience of clinical experts. To make the most of big data opportunities, the experts at Quintiles consider vari- ous options. “We look long and hard at things,” Nichols says. “We’re looking at Ha- doop, and we’ve also used Solar on a SQL frontend.” He adds, “We’re also looking at techniques developed by companies that are building on top of Hadoop.” For patient privacy, however, Quintiles segregates analytics from the raw data. In many cases, though, each pharmaceutical question requires its own big data implementation. For efficiency, Quintiles relies heavily on data scientists. “By using research analysts—individuals who sit between the knowledge of what the data are and the analytical approaches—we can see if we’ve examined similar questions in the past,” Nichols says. “Then, we might be able to go back to some application that you’ve used before.” Nonetheless, Nichols adds, “Clinical research has more and more one-off questions.”
  • 6. problem, standard software that fits into neat boxes is no longer able to keep pace with the incredible speed and complexity of the puzzle-piecing. Hence life sciences is one of the key use cases representing the larger challenges of big data operations. Big data goes beyond our bod- ies, another example of data-inten- sive computing in action could soon change our homes. The “smart grid” is another hot ap- plication area for a wide range of big data technologies across the entire stack. The US Department of En- ergy describes the smart grid as “a class of technology people are us- ing to bring utility electricity delivery systems into the 21st century, using computer-based remote control and automation.” Of course, without am- ple processing, frameworks and ap- plications designed from the ground up to handle so many diverse data feeds that power the smart grid, the project couldn’t be a success. Making the smart grid work requires real innovation on the data front, according to Ogi Kavozovic, vice president of strategy and product at Opower in Arlington, Virginia. “Smart grid deployment will create a three- orders-of-magnitude jump in the amount of energy-usage data that the utility industry will have,” he says. The energy exec went on to note that when this data spike happens, “Then there will be a several-orders- of-magnitude jump above that from energy-management devices that go inside people’s homes, like Wi-Fi ther- mostats that read data by the sec- ond.” Opower is already working with Honeywell, the global leader in ther- mostats, to make such devices. “This is all just emerging now, and the utility industry is preparing for it,” Kavozovic says. “The data presents an unprec- edented opportunity to bring energy management and efficiency into the mainstream.” (See ‘Evolution in En- ergy Efficiency.’) For the smart grid and energy sector, moving to big data technologies is a priority because it will enable new, criti- cal efficiencies that can create a new foundation for energy consumption and distribution. But of course, the frame- works, both on the hardware and pro- gramming side, need to be in place. Data-Based Decisions In the field of big data, the wide range of applications mirrors the sorts of computing situations that gener- ate big data challenges. Some users run into big data because of han- dling many files. In other cases, the challenge comes from working with very large files. Big data issues also emerge from extensive data sharing, allowing multiple users to explore or analyze the same data set. “Time plays a role in big data,” says Steve Paulhus, senior director, strate- gic business development at Xyratex. “Depending on how the data is used, it may be valuable today or it may become valuable tomorrow. For instance, with financial information, the data can lose Evolution in Energy Efficiency Utility companies that want to use big energy data to enhance efficiency can turn to Opower in Arlington, Virginia. “We partner with about 80 utilities around the world to help their customers understand and control their energy usage through our tools,” says Anand Babu, lead on all data and platform marketing efforts at Opower. With these tools, customers can compare their energy usage to that of their neighbors and learn to improve their energy efficiency. Likewise, this use of big energy data can help utilities reduce peak energy demands, which can reduce the need to develop or purchase more energy. To suggest some of the volume of big energy data, Babu says, “In the United States alone, consider what happens once every one of our 120 million households has a smart meter and WiFi thermostat. That’s 120 million times 100,000 data points per month.” That’s 12 trillion data points acquired every month. Keep in mind too that most households today only get one monthly meter read. So big energy data includes high volume and fast growth. As utilities start to roll out smart meters, most companies lack the infrastructure to store the data. That’s one place where Opower comes in on big energy data. This software-as-a-service company provides the infrastructure and the analytical exper- tise. “We operate a managed service where utilities securely transfer data to us, and we do the analytics,” Babu explains. To provide this service, Opower uses several tools. Babu says, “We are the biggest user of Hadoop in our industry, operating at the scale of nearly 50 million homes.” In other cases, different tools work better. “For some things, we still use MySQL,” Babu explains. “For example, to manage how we communicate with certain segments of the customer base, that can be done more efficiently using a traditional database.” “The exciting thing here is that customers see value on day one,” Babu says, “and they can start to take action right away.” In “Different Perspectives on Big Data: A Broad-Based, End-User View” from Intersect360 Research, people polled rated a series of big data challenges on a scale of 1–5, ranging from ‘not a challenge’ to an ‘insurmountable challenge,’ respectively. As the results show, users run into the most trouble managing many files and performing computations and analyses fast. (Image courtesy of Intersect360 Research.) Business Technical All Data Challenging dimension of Big Data Avg. %4/5 Avg. %4/5 Avg. %4/5 Mangement of very large files 2.7 25% 3.5 56% 3.3 46% Mangement of very large number of files (many files) 3.5 58% 3.9 70% 3.7 66% Providing concurrent access to data for multiple users 3.1 40% 3.5 52% 3.4 48% Completing computation or analysis within a shirt timeframe (short lifespan of data) 3.8 65% 3.7 61% 3.7 62% Long-term access to data (long lifespan of data) 3.6 61% 3.5 56% 3.5 57% Number of respondents 102 203 305 Intersect360 Research, 2012
  • 7. value if you don’t get it right now. In other cases, like weather data, you might want to look at it over a long period of time.” (See ‘Crafting Storage Solutions.’) Addison Snell, CEO of Intersect360 Research in Sunnyvale, California, de- scribes another example where data must be stored for a long time: “A pe- diatric hospital, for example, keeps in- formation for the life of a patient plus some additional number of years.” Depending on the particular ap- plication and the kind of data that it includes, end-users must select a system that works for them. No sin- gle solution handles every big data problem. That said, most users of big data face one consistent problem: in- put/output scalability. “From a technical perspective,” says Snell, “this is the number one big data problem.” Moreover, I/O scalability im- pacts both technical and en- terprise computing. Technical computing often involves a user’s top-line mission, such as car design for an automotive manufacturer or drug discovery for a pharmaceutical company. Enterprise computing, on the other hand, consists of the steps that keep a company running, such as handling the mechanics of an e-commerce website. In the past, technical comput- ing focused almost entirely on unsolved problems. As Snell ex- plains,“Youdon’tneedtodesign the same bridge twice.” Since this area of computing moves to harder and harder problems, the users quickly adopt new technologies and algorithms. In enterprise computing, by com- parison, once users find solu- tions, they stick with them. “In enterprise computing, there are people using approaches that are 30 years old because they still work,” says Snell. In many ways, the big data era changes this distinction. By definition, big data scenarios push an end-user to a point where previous methods won’t work. This applies to technical and enterprise computing. “So, for the first time, you have enter- prise-class customers viewing performance specifications of infrastructure as the dominant purchase criteria,” Snell says. The purchase criterion used the most is I/O performance. In addition to I/O performance scalability, users must ask other questions about their big-data Computing Family Connections At Ancestry.com, data come in many forms. This family-history company stores data from historical records, such as census data and military and immigration records. This historical information also includes birth and death certificates. Overall, the historical data at Ancestry. com consists of around 11 billion records. In addition, this web-based company stores user- generated data. “When users come to this site, their goal is to build a family tree,” explains Leonid Zhukov, the company’s director of machine learning and development. In this process, a user creates a profile of themselves, plus their parents, grandparents and so on. “We have more than 40 million family trees and 4 billion people represented here,” Zhukov says. Users can also upload images, such as photographs or even handwritten stories, and Ancestry.com stores about 185 million of these records. On top of all of that, this site keeps track of user activity. “On aver- age, Ancestry.com servers handle 40 million searches a day,” says Zhukov, “and we log user searches and use it in machine learning.” The kinds of data being stored at Ancestry.com also evolve over time. As an example of that, the company can now collect a user’s DNA data. The company provides the DNA test and pro- cesses the information. As the company explains: “AncestryDNA uses recent advances in DNA technology to look into your past to the people and places that matter to you most. Your test results will reach back hundreds—maybe even a thousand years—to tell you things that aren’t always in historical records, but are recent enough to be important parts of your family story.” This range of data—a variety of types from various sources—can only be stored and analyzed with powerful tools. “There’s no black-box solution for our data challenges,” Zhukov explains. “We use different pipelines for processing different types of data. Plus, we deal with big and versatile data, so we need to do lots of hand work on it.” He also points out that moving so much data creates challenges. “We must customize servers and how they are connected—where to collect data and where to process it,” Zhukov says. “The logistics of the data are quite important.” Some of the most interesting data analysis at Ancestry.com involves machine learning. This works on two types of objects: a person or a record, which is any document with information about that person. Then, users can find information about ancestors by searching for names, places of birth and so on. These focused searches provide precise results. On the other hand, users can also turn to exploratory analysis. “In this case, a user is looking for suggestions,” says Zhukov. “It’s similar to when you log on Amazon, and they show you the books that you might be interested in.” So instead of a search, the exploratory analysis provides suggestions of records or family-tree information based on what the user has done in the past, and that depends on machine learning. The system at Ancestry.com creates the suggestions for exploratory analysis offline. “We train the system to make suggestions based on the feedback of previous suggestions that were accepted or rejected,” Zhukov explains. “The system looks for similarity. It asks: How similar is a person to a particular record?” The software uses a range of features to make these compari- sons. It can start with a first and last name, but maybe a document includes a misspelling, and then the system relies on a phonetic comparison. Also, a record could have been handwritten, scanned and turned into data with optical character recognition. “You can imagine how many mistakes there are in hand-written historical records,” Zhukov says, “so we do fuzzy matching instead of exact matches.” The features of a record serve as the input and the training data for machine learning come from the users rejecting or accepting the suggestions. For many parts of this processing, Ancestry.com uses Hadoop. As Zhukov says, “Hadoop is really great technology, but it’s not at the point where you can open the box and things miraculously work.” Consequently, this company relies on Hadoop experts and also experts who understand how to search huge databases for similarities between disparate sources of information. As Zhukov adds, “Hadoop is not only for data storage, but it also provides a data- processing pipeline. So you need to feed it with the right things.”
  • 8. infrastructure needs. For example, Ken Claffey, senior vice president and gen- eral manager, ClusterStor at Xyratex, asks: “Once you have all of your data in a common environment you need to think about the reliability, availability and serviceability of that infrastructure as you want that data to be consistently avail- able.” He adds, “You need to think about the constant with big data—multi facet- edgrowthincapacity,datasources,data types, number of clients, applications, workloads and the new ways to use the data. It is key to ask yourself how are you going to efficiently manage such multi faceted exponential growth over the next number of years against the constraints of available data scientists and data cen- ter space, power, cooling and budget. Your big data infrastructure needs to be scalable to meet performance and ca- pacity demands, flexible to respond to change and extremely efficient.”Delving even deeper, Norris noted that when looking at new architecture and what’s possible, the real focus is on what types of applications will be used. What data sources are we going to exploit? What is the desired outcome?” In short, a user must answer questions like these to know where to start. After that, a user decides how to integrate new infrastruc- ture into an existing computing environ- ment. “How does the new computing fit my current security?” Norris asks. “How do I make it enterprise grade?” As Nor- ris points out, a user might not start ask- ing these questions until working on the second or third big data application and sharing it across a group of users. “You can generalize about the best platform, but you have to have it working well with the existing environment and ap- plications,” Norris says. “So you’ll end up with a little bit of a hybrid solution. You’ll find yourself asking: How do I take advantage of this new technology, but on a platform that delivers what I need in terms of reliability, ease of integration and so forth?” Among some of the frameworks that are emerging in some of the key sec- tors we’ve discussed is Hadoop and MapReduce, which were again, born out of the largest-scale data opera- tions the world has seen to date—the search engines and web companies. Mapping the MapReduce Paradigm Forawiderangeofindustries,MapRe- duce makes up one of the most com- mon big data application frameworks. This often gets implemented as Hadoop MapReduce, which makes it easier for a user to develop applications that handle large data sets—even in multiple tera- bytes—on computer clusters. The complete range of big data ap- plications could always remain in flux. As Boyd Wilson, executive at Omnibo- nd in Pendleton, South Carolina, says, “As you iterate on various methods of teasing out the information from the data, that learning can lead to new ar- eas of big data.” As an example, Wilson mentions cross-disciplinary analytics. Overall, he says, “The big data sweet spot is lots of—potentially di- verse—data that you want to com- bine to learn new correlations, and you need high-performance medium, something like OrangeFS, to aggre- gate and analyze in large volume and with high throughput.” (See “A Big data File System for the Masses.’) As this schematic indicates, a wide range of information plays a part in big energy data. Providing benefits to both utilities and customers requires ad- vanced storage and analytics. (Image courtesy of Opower.) With Hadoop, you can recognize patterns much more easily and drive new applications. “ ”
  • 9. That “sweet spot” covers a lot of ground: consumer-behavior analysis, dynamic pricing, inventory and logis- tics, scientific analysis, searching and pattern matching, and more. More- over, these applications of big data arise in many vertical markets, includ- ing financial services, retail, transpor- tation, research and development, in- telligence and the web. In some cases, the use of big data will come from some unexpected plac- es. For instance, Leonid Zhukov, direc- tor of machine learning and develop- ment at Ancestry.com, says, “We’re a technology company, and we do fam- ily history, but we are a big data com- pany.” He adds, “We use data science to drive family-history advances. So we’re hiring data people.” (See ‘Com- puting Family Connections.’) As MapReduce and Hadoop con- tinue to mature, these use cases will continue to extend further into main- stream enterprise contexts, but for now, there are some bigger problems to address. Hadoop’s High Points Although Hadoop only made up slightly more than 10 percent of the applications mentioned in Inter- sect360 Research’s poll (see below for details), many experts endorse this approach. When asked about the benefits of applying a Hadoop solu- tion, Norris of MapR says, “The way to summarize the benefits of Hadoop is: It’s about flexibility; it’s about pow- er; it’s about new insights.” (See ‘M Series from MapR.’) In terms of flexibility, according to Norris, Hadoop does not require that you understand ahead of time what you’re going to ask. “That seems sim- ple,” he says, “but it’s a really dramatic difference in terms of how you set up and establish an application.” He adds, “Setting up a typical data warehouse is initially a large process, because you are asking how you will use the data, and that requires a whole set of pro- cesses.” With Hadoop, on the other hand, a user can just start putting in data, according to Norris. “It’s really flexible in the analysis you can perform and the data you can perform it on, and that really surprises people.” Norris points about several fea- tures of the power behind Hadoop. “You can use different data sets, and the volume that you can use is quite broad.” He adds, “This eliminates the sampling. For example, you can store a whole year of transactions instead of just the last 10 days. Instead of sam- pling, you just analyze everything.” By using such large volumes of data in analytical procedures, users can apply new kinds of analysis and also more easily identify outliers. Because of that ability to work with all of the data, says Norris, “Relatively simple approaches can produce some dra- matic results.” For instance, just the added ability to explore outliers could reveal some pleasant surprises, may- be some tactic that generated an un- expectedly high amount of revenue. On the other hand, analyzing outli- ers can reveal failures. As an exam- ple, Norris says, “You might find an anomaly like an expensive part of the supply chain that you want to elimi- nate. Overall, analyzing these outliers makes dramatic returns possible.” In terms of new insights, says Norris, “With Hadoop, you can recognize pat- terns much more easily and drive new applications.” He also adds that Hadoop M’s from MapR MapR in San Jose, California, developed a series of Apache Hadoop solutions, including M3, M5 and the latest M7. The M7 version adds HBase capabilities, which is the Hadoop database. “This provides real-time database operations side by side with analytics,” says Jack Norris, the company’s vice president of marketing. “This provides lots of flexibility for the developers, because you can run applications from deep analytics to real-time database operations.” All of the MapR tools make it possible to treat a Hadoop cluster like standard stor- age. “Consequently, you can use industry-standard protocols and existing tools to access and manipulate data in the Hadoop clusters,” Norris explains. “This makes it dramatically simpler and easier to support and use.” This tool can also be used as an enterprise-grade solution. For all of the MapR editions, the developers aimed at making Hadoop easy, fast and reliable to use. For example, these tools provide full data protection. Norris adds, “In terms of hardware systems, the beauty of MapR is that it supports standard, off-the-shelf hardware. So you don’t need a specialized server.” He adds, “You get better hardware utilization with MapR, because, for example, you can use a greater density of disk drives and faster drives.” To get an idea of how MapR can be applied, consider these case studies with M5. “One of the largest credit-card issuers is using M5 to create new products and services that better detect fraud,” says Norris. “This includes some broad uses of analytics that benefit the cardholders.” Some large online retailers also use M5 to better analyze shopping behavior and then use that to recommend items or simplify the user’s experience. Other websites also use M5 in digital advertising and digital- media platforms. For example, says Norris, “Ancestry.com is using it as part of their service.” Norris also points out: “Manufacturing customers use M5 for basically ana- lyzing failures. They collect sensor data on a global basis and detect precursors to problems and schedule maintenance to avoid or prevent downtime.” In summary, Norris calls the MapR users “really, really broad.” A user just starting with big data might start with Apache Hadoop. “When you really put this into production use,” Norris explains, “having enterprise grade with advanced features is a real benefit.” Some of the advanced features of MapR tools include a reduced need for service. For instance, the MapR platforms are self-heal- ing. To ensure that MapR platforms include the features that users really want and need, the company works with an extensive network of partners. “In some case, these partners provide vertical expertise within a market or even do consulting work to identify likely use cases to help us select the best implementations of MapR,” Norris says. In addition, this partner network also helps develop some of the training available from the MapR Academy, which provides free training. “We want to ensure that users can easily share information and applications on a cluster,” Norris says. “That is basically a unique value proposition of MapR.”
  • 10. can be an economical solution. “A high- end data warehouse can cost $1 million per terabyte, whereas Hadoop is only hundreds of dollars per terabyte.” At Ancestry.com, Zhukov and his colleagues use Hadoop to solve many problems. “For different data types and different business units, we use Hadoop in different ways,” Zhu- kov explains. For example, the team that provides optical character rec- ognition for documents scanned into Ancestry.com uses Hadoop for that. Also, when this company acquires newspaper data that are structured and in a text format, facts—such as dates, names and places—must be extracted, and Hadoop provides the technology for the parallel processing that achieves this. Zhukov adds, “In the DNA pipeline, where we compare customer DNAs and do matching for relatives, we look at the percentage of overlap and ethnicity. That’s a big data job.” Hadoop also gets used at this company for the machine learn- ing that generates the possible family connections for users. As Zhukov con- cludes: “We do analytics in more of a traditional setting, but we are turning to Hadoop for all business units in the company.” Emerging Options Hadoop does not solve every big data problem or make the best fit for every user. In some cases, the actual implementation of Hadoop creates the biggest hurdle. As Zhukov says, “If you compare Hadoop to tradition- al database systems, it’s in the early stages of its technology. So there’s lots of tweaking and twisting for it to run smoothly.” In fact, the Hadoop challenges vary by user. “The biggest challenge really depends on the use case the customer has,” says Hu of IBM. For example, he says, “If they are just analyzing data in some standalone system and they have Hadoop clusters that operate in isolation, then the challenge is just to keep those clusters up and running.” Hu also points out some general concerns with Hadoop. He says, “It has challenges in single points of failure, and massive amounts of sys- tem must be stitched together, which makes system provisioning and man- agement a challenge.” Beyond that, he lists several other issues, including the latency of execution in an applica- tion and the inability to execute work- loads outside the Hadoop paradigm. To explore the general concerns even more, Hu directs readers to “Top 5 Challenges for Hadoop MapReduce in the Enterprise.” Beyond concerns of implementa- tion and execution, Zhukov points out that it’s not easy to hire Hadoop experts. As he says, “Here in Silicon Valley, it’s proved to be quit difficult to hire people with Hadoop skills. So we’re trying to grow and train our own people to work with Hadoop.” Sometimes, users assume that Ha- doop is the only big data option. As Snell says, “In many cases, we find that Hadoop is a technology in search of solutions.” To explain this, he says, “Imagine that I’ve designed a new screwdriver, and I’m trying to find how many things I can do with it. In prac- tice, you might be able to drive in a Phillips screw with a flathead screw- driver, but the flathead screwdriver might not have been the best tool.” Likewise, Hadoop is not the best tool for every big data situation. Other options keep emerging for us- ers. In some cases, companies even develop new approaches to creating the new tools. For example, Omni- bond makes all of its advanced tools available to the open-source commu- nity. As Ray Keys, the company’s mar- keting manager, explains, “We have a very unique business model. It’s pure open source, and pushing everything out to the open-source community is really powerful.” He adds, “There is goodwill involved, but it’s goodwill from people we do development for, and they all realize that it’s goodwill for the entire community. As a result, the community contributes, as well.” Beyond just good will, this company’s OrangeFS provides a file system that can handle millions and millions of files in a single directory without reducing the performance of the solution. End-User Applications To get a quantitative grasp of how users apply big data, Intersect360 Re- search conducted a survey and pro- duced this report: “Different Perspec- tives on Big Data: A Broad-Based, End-User View.” This study arose from surveying—in partnership with the Gabriel Consult- ing Group in Beaverton, Oregon— more than 300 users of big data applications, which included both high-performance computing (HPC) and non-HPC. The survey sampled a wide range of users, including com- mercial, government and academic (or not-for-profit) users. For example, the commercial users ranged from consulting and computing to energy and engineering, as well as manu- facturing and medical care. The gov- ernment users worked in national re- search laboratories, defense, security and other areas. The users also came from a range of group sizes, cover- ing a range from a few to more than 100,000 employees. For instance, nearly 20 percent of the groups had only 1–49 employees, and almost 10 percent had 20,000–99,999. Such a diverse sample should pro- vide a broad overview of big data ap- plications, and the survey results con- firm that. Among those polled, the big data applications included enterprise, research and real-time analytics, plus MapReduce, complex-event process- ing, data mining and visualization. For example, more than two-thirds of those We have a very unique business model. It’s pure open source, and pushing everything out to the open- source community is really powerful. “ ”
  • 11. polled said that they work with big data applications in research analytics, and more than half apply data mining to big data situations. A small portion of those surveyed—6 percent—reported no current applications, although they were planning future ones. Those surveyed also rated a series of big data challenges on a scale of 1–5, ranging from ‘not a challenge’ to an ‘in- surmountable challenge,’ respectively. Overall, the users ranked the most signif- icant challenges as managing many files and performing computations and anal- yses fast, where the average scores for both were 3.7 on the 1–5 scale. Among the technical users, 70 percent of them ranked managing many files as a 4 or 5 on the scale—making it feel like an in- surmountable, or nearly that, challenge for 7 out of 10 of those polled. Among business users, 65 percent of them rated performing computations and analyses fast as a 4 or 5 on the scale of big data challenges. In general, though, most users rated most of the challenges as significant. For example, users also reported serious challenges—averages scores above 3—for managing large files, giving multiple users simultaneous access to data and providing long-term access to data. This survey also explored where users get their software for big data applications. The results show some unexpected, or at least inefficient, re- sults: Half of those polled developed the applications internally; one-quar- ter of them purchased the applica- tions; and about another quarter used open-source applications. “For those who developed the applications inter- nally,” says Snell, “those are all one- offs by definition.” With challenges as complicated as big data, developing a unique solution to every problem does not make for efficient comput- ing. Moreover, the wide use of one- offs limits these applications to groups that maintain the personnel to develop these advanced applications. This survey also examined the spe- cific tools the users apply to big data scenarios. Out of the 300-plus re- sponses to this survey, 225 of them provided details on their applica- tions, and Hadoop came out on top. Nonetheless, only 11 percent of the users reported that they make use of Hadoop. Nonetheless, says Snell, “If people have a big data problem, they often think they must use Hadoop.” Future Fits By definition, big data problems outstrip the capabilities of a user’s current system. So some of the tools typically used only in HPC will work their way into non-HPC environments. A key example of that could arise from the use of parallel file systems. The Intersect360 Research survey asked users about the file systems that they employ. The replies included a diverse set of approaches. Across all types of users—academic, commer- cial and government—most selected the Network File System (NFS), which was developed by Sun Microsystems in 1984. Overall, nearly 70 percent of the users employ NFS. Not many of the users implement parallel files sys- tems at this point. For example, less the 20 percent of the people polled used the Hadoop Distributed File System (HDFS), IBM’s General Paral- lel File System (GPFS) or Lustre. As Snell says, “Parallel file systems are still poorly penetrated into com- mercial markets.” He adds, “Com- mercial users will need to find solu- tions for parallel I/O.” Beyond learning to build the infra- structure for big data, much remains to be learned about where to use it. Gavin Nichols, vice president of strat- egy and R&D IT at Quintiles in Re- search Triangle Park, North Carolina, provides a recent example. During the 2012 influenza season, big data ex- perts in Scotland sequenced the gen- otype of every person who contracted the flu over a two-week period. From that data, the analysts extracted infor- mation that revealed who was most susceptible to the illness, and who might require hospitalization. Scot- land’s government then used those data to adjust its healthcare policy. “They are using big data in how they deal with influenza,” Nichols ex- plains. “This is a good example that we haven’t even seen the starting point for what we can do with this technology.” He adds, “If big data solutions grow in the right way, then there’s the opportu- nity to make meaningful changes in the way drugs are developed, put through to market and utilized.” The variety of infrastructures be- ing employed and the questions being considered reflect the diversity of ap- proaches that today’s users take to big data challenges. This reveals that many technologies in computing can be modi- fied to handle big data problems in many enterprises. Moreover, many technolo- gies besides Hadoop can be explored. For any particular user, the best technol- ogy depends on the big data problem at hand. Even for the same user, different questions and applications will often require different approaches to mak- ing use of big data. As Snell concludes, “Hadoop is not all things to all people. Plus, there is not one perfect answer for all big data problems.” n About the Author: Mike May, Ph.D. is a science and technology writer, editor and book au- thor. Mike has published hundreds of articles within prestigious publications during his 20 year professional writing career, and is currently editorial direc- tor for Scientific American Worldview. He holds a Ph.D. from Cornell Univer- sity in neurobiology and behavior. If big data solutions grow in the right way, then there’s the opportunity to make meaningful changes in the way drugs are developed, put through to market and utilized. “ ”
  • 12. W hen a customer runs into a big- data problem—from struggling with data velocity, volume or variety—IBM offers a solution. Such solutions, however, must supply flex- ible options, because of the relativity of the term big in big data. “The ‘big’ in big data is like the ‘high’ in high performance computing,” says Yonggang Hu, chief architect for Platform Computing, IBM Systems and Technology Group. “It’s dif- ferent to different people.” To provide flexibility, IBM orchestrates a very broad big-data initiative. It ranges from complete systems—like the xSeries and pSeries architectures that handle big-data workflows—to various aspects of computing solu- tions. For example, IBM offers vari- ous forms of storage technology that benefit big-data situations. The com- pany also creates networking com- ponents and interconnects needed for extremely high-throughput sys- tems. The data in any system does not live in isolation. Big data come from a location and the analysis goes to a des- tination. So IBM solutions can provide connectivity between big-data systems and the databases and data warehous- es in the enterprise. The Power of Parallel Files To provide an overview of big-data challenges, Hu divides the computer processing of big-data problems into dif- ferent patterns. The first data workload might require only a monthly assessment. This does not require any immediate anal- ysis. Instead, data can be collected and analyzed over time, and the results get used for decision-making as needed. At the other end of the spectrum, some ap- plications—such as a traffic-monitoring system or software that maintains power across a utility grid—require real-time de- cisions. This requires a system with inter- activity task latencies on the order of mil- liseconds. “The current Hadoop system doesn’t have this kind of near real-time response capability,” Hu says. Working effectively with this variety of big-data applications requires a power- ful file system. IBM’s new General Paral- lel File System File Placement Optimizer (GPFS FPO), developed specifically for big data, allows for a model very similar to a Hadoop distributed file system (HDFS). This file system provides a range of necessary features. For example, it in- cludes enterprise capability. GPFS FPO is also extremely scalable, so it can grow with a user’s data needs. Fur- thermore, this file system includes no single points of failure, which provides the robust computing needed to create sophisticated data solutions. Moreover, its POSIX support allows applications to A to Z for Big-Data Challenges IBM provides every tool—from applications to complete systems—for cost-effective solutions IBM Platform Symphony brings mature multi-tenancy and low-latency scheduling to Hadoop. Symphony enables multiple MapReduce workloads to run on the same cluster at the same time, dynamically sharing resources based on priorities that can change at run-time. IBM A globally integrated technology and consulting company Founded: 1911 (as Computing Tabulating Recording Company) Number of Employees: 433,362 Headquarters: Armonk, NY Top Executive: Virginia M. Rometty (chairman, president and CEO) Sponsor Profile
  • 13. get the benefits of a UNIX-com- patible file system. Orchestrating a Solution In many cases, big-data applica- tions require advanced analytics. For those situations, we created IBM Platform Symphony. “This software implements a Hadoop MapReduce that is very fast,” says Hu. Moreover, this software provides very low latency. For in- stance, the overhead provisioning for an executed task takes less than 1 millisecond. “Open-source tool sets take an order of magni- tude longer,” Hu says. On average, performance tests reveal that Plat- form Symphony is 7.3 times faster than open-source Hadoop for jobs comprised of a representative mix of social media workloads.1 For someone who as been run- ning Hadoop MapReduce on a cluster, one challenge involves efficient utilization. In many cas- es, the user lacks sufficient jobs to keep the cluster occupied. In fact, utilization often runs as low as 20 percent. Platform Symphony helps users extract more from a cluster by getting workloads there faster and not keeping them waiting to run. In ad- dition, users can run heterogeneous workloads, such as Hadoop MapRe- duce mixed with C++ jobs. Moreover, a business can divide a cluster among multiple units with Platform Sympho- ny, giving each a slice of the structure. Then, the units can operate virtually independently of each other. In a recent return-on-investment study, IBM explored the cost of run- ning a big-data system, covering ev- erything from the cost of the hardware and administration to cooling the data center. By using Platform Symphony, instead of a standard Hadoop system, the results showed a potential savings of almost $1 million depending on the customer’s use case. For Real-Time Needs IBM’s infosphere offers further options in big-data analytics. This approach han- dles big-data applications that need to run in real or almost real time. It includes ready-made components for people who already use Hadoop analytics. Furthermore, Infosphere’s accelera- tors and frameworks speed up the de- velopment of big-data analytics. One ready-made tool, for example, is Big Sheets. This provides a spreadsheet- like interface to the Hadoop stack. It extracts and manipulates data from the spreadsheet interface. Infosphere also lets users com- bine Big Sheets with other software. It supports all of the components of the open-source Hadoop ecosystem. So a user could select open-source Hadoop layers in any of the stacks. Hu adds, “Customers can choose the layers where they want enterprise-ca- pable technology.” Building Relationships In tailoring big-data solutions to customers, IBM supplies a wide range of customer services. This includes assessing a customer’s existing infrastructure and pro- viding business-level needs. This combines IBM’s technical- and business-service arms. Getting the right solution often requires both levels of analysis. For example, an efficient big- data solution requires the right computing infrastructure and software systems, as well as the business knowledge to de- termine what reports should be generated from the analysis. The very basis of big data often implies an analytical challenge. For example, Vestas—an energy sup- plier in Denmark—turned to IBM to determine the best locations for electricity-generating wind tur- bines. The resulting analysis includ- ed weather information, such as wind and climate, as well as the lo- cal geography. This company used IBM technology to put together a system that could produce accurate answers in a timely fashion. The results generated the best locations for the wind turbines and how to arrange them to harness the most energy possible. As big-data problems grow and evolve, IBM’s solutions keep adapting and advancing. It takes the teamwork of hardware and software infrastruc- ture plus powerful analytics to perform the needed computation. In addition, technical and business solutions must work together to generate the most useful solution. From the xSeries and pSeries systems to powerful intercon- nects and advanced analytical algo- rithms, IBM supplies big-data tools and solutions that extend from high- performance infrastructure to sophisti- cated business intelligence and plan- ning. Only approaches as orchestrated as this can turn big-data opportunities into business growth and success. n Reference • STAC Report – A Comparison of IBM Platform Symphony and Apache Hadoop MapReduce using Berkeley SWIM 2012 IBM provides a variety of System x and Power- based solutions optimized to run big-data tools, like IBM InfoSphere BigInsights, IBM Platform Symphony and IBM GPFS. Sponsor Profile
  • 14. T o break down big-data problems, many data scientists are look- ing to the Hadoop MapReduce model. Currently those systems run on dedicated Hadoop clusters. Many com- panies and researchers have existing clusters that run all sorts of code but currently cannot leverage Hadoop in their shared environments. Dedicated clusters are required due to the close ties of Hadoop MapReduce and the Ha- doop Distributed File System (HDFS). To solve that problem, Omnibond has extended the open source Orange File System (OrangeFS) to natively support Hadoop MapReduce with- out the HDFS requirement. According to Boyd Wilson, Omni- bond executive, “Our main business model for OrangeFS is all open- source so you can download and start running today.” In addition, Om- nibond provides support and devel- opment services. For example, if a cus- tomer needs a specific feature added to OrangeFS, Omnibond will price out the cost and create it. The customer must agree that the resulting code goes back into open-source. Furthermore, Orange- FS offers a wide range of applications that can be used with high performance and enterprise computing. The need to make parallel file sys- tems easier to use, including leveraging MapReduce on existing clusters, ignited the original work on OrangeFS. To bring such systems to more users, the paral- lel file system development work needs to continue. OrangeFS is not limited to traditional data science alone, as the broadening list of use cases is expand- ing from research to finances, entertain- ment and analytics. Enhancing the open-source business model, Omnibond provides everything that a customer would expect from a company that develops closed-source code. “If you have a problem, just call. If you need a feature, we can add it, and in addition, you get the peace of mind in having the source code, too,” Wilson says. Customers can purchase this ser- vice and directed development  on an annual subscription basis, giving them the assurance of stable development and support. Moreover, this model pro- vides a triple-win situation that’s great for the support customer, Omnibond and the entire open-source community, thus assuring the ongoing development of OrangeFS. Inside OrangeFS In general, OrangeFS provides the next- generation approach to the Parallel Virtual File System, version 2 (PVFS2). From the start, PVFS2 delivered desirable features, in- cluding a modular de- sign and distributed metadata. It also sup- ports a message pass- ing interface (MPI). Other features include PVFS2’s high scal- ability, which makes it a great choice for large data sets or files. Beyond working with such large files, PVFS2 A Big-Data File System for the Masses The computing community and Omnibond developed open-source infrastructure software for storage clusters Omnibond Systems develops open-source tools that can be used in a wide range of big-data applications. (Image courtesy of Omnibond Systems.) Omnibond Systems, LLC A software engineering and support company focused on infrastructure software. Founded: 1999 Employees: 45 Headquarters: Clemson, SC Top Executive:  Boyd Wilson Sponsor Profile
  • 15. works fast. For example, it outruns— really outruns—a network file system (NFS) using the MPI-IO interface. Omnibond started with those fea- tures and generated OrangeFS by adding even more. For example, this file system comes from a focused de- velopment team. It also includes im- provements in working with smaller files; Omnibond does all of its testing on 1 megabyte files. In the 2.8.6 ver- sion, released on June 18, 2012, us- ers were provided with advanced cli- ent libraries, Linux kernel support and Direct Interface libraries for POSIX calls and low-level system calls. Ad- ditionally, users can receive support for WebDAV and S3 via the new Or- angeFS Web Pack. Future releases promise even more features, such as distributed directo- ries and capability-based security that will make the access time for large di- rectories even faster and the entire file system more secure. Added Incentives For many big-data users, the list of OrangeFS features already mentioned sounds enticing enough, especially with the best of open- and closed- source worlds. Nonetheless, even more incentives exist. The object-based design of Or- angeFS appeals to many users. In addition, this file system separates data and metadata. OrangeFS also provides optimized MPI-IO requests. Furthermore, this software can be installed on a range of architectures. For example, users can apply Or- angeFS to multiple network architec- tures. In addition, OrangeFS works on many commodity hardware platforms. To ensure file safety, OrangeFS in the near future will also provide cross- server redundancy. Given the open-source nature of OrangeFS, any documentation might sound like a rarity, but Omnibond goes even further. For instance, the recent OrangeFS Installation Instructions supply complete documentation—all available online. Many applications of big data require aggregation. For example, a system might gather real-time data and store it for access. That stored data might have access to clustering nodes for Hadoop. Then, the collected data can be queried and analyzed later with MapReduce. Likewise, a big-data system might require other forms of data aggre- gation. For instance, real-time data could be collected, combined with private data and then processed. On the other hand, a user might take ‘snapshots’ while processing real- time data to collect a subset of data, and then analyze it. OrangeFS can provide storage for all of these big- data scenarios. Advanced Applications The open-source community stays at the cutting-edge of computation and data analysis. Consequently, experts at Omnibond track all of the trends in high-performance and big- data environments. The information gleaned by Omnibond continues to impact future versions of OrangeFS. As an example, ongoing work in data analysis spawns unimagined ap- proaches. New ways of even think- ing about data keep emerging. Social media–data analysis, for instance, and other large streams of data, can trigger entirely new ways of analyzing information, such as log, customer and other real and near real-time data. In other cases, new perspectives on existing data will generate novel ap- proaches to data integration that were never considered. As Wilson explains, “The new con- cept of data analysis covers a wide range. In some cases, it involves huge amounts of data, and other times not so much.” He adds, “You can bring together different data points that you wouldn’t think would correlate, but sometimes they do.” This kind of thinking—seeing to- day’s challenges and turning them into tomorrow’s opportunities—lies at the heart of Omnibond’s philoso- phy. “We are adding exciting new features on our development path to OrangeFS 3.0” says Wilson. “These should be real game-changing en- hancements.” As a company, Omnibond focuses on building the OrangeFS that customers want, and then sharing those advances with the wider community. n Omnibond’s open source Orange File System (OrangeFS) supports Hadoop MapReduce and pro- vides a simplified parallel file system. (Image courtesy of Omnibond Systems.) If you have a problem, just call. If you need a feature, we can add it, and in addition, you get the peace of mind in having the source code, too. “ ” Sponsor Profile
  • 16. B ig data varies from one custom- er to the next. As Ken Claffey, senior vice president and gen- eral manager, ClusterStor at Xyratex, said, “What’s big data to one guy is tiny to another.” No matter how you define big data, achieving analytic results rap- idly needs one thing: fast, reliable data storage. And with more than 25 years of experience as a leading provider of data storage technology, this is exactly where Xyratex leads the market. Xyratex’ innovative technology solutions help make data storage faster, more reli- able and more cost-effective. As an organization, Xyratex provides a range of products from disk-drive heads and disk media test equipment to enter- prise storage platforms and scale-out storageforhigh-performancecomput- ing (HPC). For big data applications, where users need storage that scales both in capacity and performance, Xyratex solutions support high-perfor- mance data processing of relatively small data quantities to some of the largest su- percomputers in the world. “Performance is often overlooked,” Claffey said, “but it’s the key driver. Once you have your data together, you need performance for analytics to transform that data into information you can use to make more-informed decisions.” Combining Features Xyratex achieves its big data storage success by providing solutions, not just parts. The company has developed a fresh, groundbreaking scale-out stor- age architecture that meets user per- formance and scalability needs while also providing the industry’s highest levels of integrated solution reliability, availability and serviceability. “You can go from a single pair of scal- able storage nodes with less than 100 terabytes of capacity to solutions that sup- port more than 30 petabytes of capacity and deliver terabyte-per-second perfor- mance,” Claffey said. “Then we embed the analytics engine right into the storage solution, helping you make better use of all the data you are storing and maximize op- erational efficiency within the solution.” Adding Engineering Traditional HPC users often tried to build their own data storage infrastruc- ture by stitching together different pieces of off-the-shelf hardware. This approach often encountered challenges. Xyratex saw an opportunity to engineer a high-performance, singly managed ap- pliance-like data storage solution called ClusterStor. ClusterStor is a tightly inte- grated scale-out solution that is easy to deploy, scale and manage. What is now termed big data has been used in practice for many years within compute-intensive environments and is evolving towards an enterprise class of requirements. However, enterprise users do not have the time to struggle with the level of complexity these environments bring with them. To support exponential growth in these types of environments requires data storage solutions that can keep costs in check. As with general-pur- pose storage acquisitions, enterprise us- ers should consider the total cost of own- ership, including the cost of upgrades, time to productivity as well as operational efficiency. ClusterStor allows enterprise- class users to move into an engineered high-performance data storage solution Crafting Storage Solutions Xyratex builds scalable capacity and performance into big data systems XYRATEX A leading provider of data-storage technology. Founded: 1994 Employees: 1,900 Headquarters: Havant, United Kingdom To help big-data users, Xyratex provides its ultra dense 5U84 enclosure. (Image courtesy of Xyratex.) Sponsor Profile
  • 17. with enterprise reliability, serviceabil- ity and availability without the hidden costs associated with stitching together components. Breadth of Benefits With a Xyratex ClusterStor storage so- lution, a user gets an integrated product. WhileXyratexleveragesopen-sourceand standards-based hardware platforms, it also adds its own blend of ease of de- ployment and reliability. It all adds up to the fastest and most re- liable data storage on the market today. In addition, all ClusterStor systems come pre-configured from the factory, enabling turnkey deployment, which significantly reduces installation time. Many sectors—such as digital manu- facturing,government,commercialindus- try and others—are finding that big-data initiatives can provide new understanding into the data they already have, but many organizations in these sectors lack the in- house expertise to create and manage big-data solutions. “Coming to users with a big-data so- lution that they can easily deploy and manage removes a barrier for them to get access to the technology they need,” said Claffey. “The complexity of the solu- tion is completely hidden from the user, and you don’t see all of the thousands of man-hours that it took to create that ex- perience. We help you reduce complex- ity, get a great out-of-the-box experience and maximize the uptime of the system.” “Instead of building and tweaking to get a system up and running, our turnkey solutions reduce the time it takes to start computing. We even have architects who can come in and design how the solution will fit into your environment,” added Steve Paulhus, senior director, strategic business development at Xyra- tex. “We are known for high-quality stor- age products—in the way we engineer, test, validate and deliver preconfigured ClusterStor solutions. Through all of this, we offer the quickest path to the perfor- mance and results users seek.” As big data continues to quickly evolve, users’storageneedswillaswell.Anddata storage systems and vendors will have to adapt to stay relevant in the market. “We continue to leverage our knowl- edge and experience-based under- standing of end user data storage needs to relentlessly push beyond traditional boundaries,” said Paulhus. “That’s how we will continue to advance in big data and provide solutions ideally suited to serving the needs of the market.” n That’s how we will continue to advance in big data and provide solutions ideally suited to serving the needs of the market. “ ” The ClusterStor 6000 provides the ultimate integrated HPC data-storage solution. (Image courtesy of Xyratex.) Sponsor Profile