Big dataimplementation hadoop_and_beyond

Big Data
Implementation:
Hadoopand
Beyond
CONDUCTED BY:
Tabor Communications Custom Publishing Group
IN CONJUNCTION WITH:

Dear Colleague,
Welcome to the first in a series of professionally
written reports and live online events around topics
related to big data, high-performance computing,
cloud computing, green computing, and digital manu-
facturing that comprise the media portfolio produced
by Tabor Communications, Inc.
Most upcoming Tabor Communications reports and
events will be offered at no charge to viewers through
underwriting by a limited group of sponsors. Our first
report titled Big Data Implementation: Hadoop and Be-
yond, is being graciously underwritten by IBM, Omnibond and Xyratex. More
about how these organizations tackle challenges associated with big data can
be found in the sponsor section.
Tabor Communications has been writing about the most progressive technol-
ogies driving science, manufacturing, finance, healthcare, energy, government
research, and more for nearly 25 years. The extended depth of our new report
series extends beyond the limited space we can devote for coverage on specific
topics through our regular media resources. The writers of these reports are
typically contributing editors for Tabor Communications, and/or content part-
ners like Intersect360 Research from whom research cited within this report was
developed.
The best topics to cover typically come from the user community. We en-
courage you to offer any suggestions for upcoming reports that you feel will be
helpful to you in better reaching your business goals. Please send your ideas to
our SVP of Digital Demand Services, Alan El Faye (alan@taborcommunications.
com). Sponsorship inquiries for future reports would also go to Alan.
Thank you for reading Big Data Implementation: Hadoop and Beyond.
Please visit our corporate website to learn more about the suite of Tabor pub-
lications: http://www.taborcommunications.com/publications/index.htm
Cordially,
Nicole Hemsoth
Editorial Director-Tabor Communications, Inc.
Big Data Implementation: Hadoop and Beyond
Conducted by: Tabor Communications Custom Publishing Group
A Tabor Communications, Inc. (TCI) publication © 2013. This report cannot be duplicated without prior permission from TCI. While every effort is made
to assure accuracy of the contents we do not assume liability for that accuracy, its presentation, or for any opinions presented.
Study Sponsors
OOMMUN NI ICC AT S
Media and Research Partners

By Mike May
T
he big data phenomenon is far
more than a set of buzzwords
and new technologies that fit into
a growing niche of application areas.
Further, all the hype around big data
goes far beyond what many think is the
key issue, which is data size. While data
volume is a critical issue, the related
matters of fast movement and opera-
tions on that data, not to mention the
incredible diversity of that data demand
a new movement—and a new set of
complementary technologies.
But we’ve heard all of this before, of
course. Without context, it’s hard to
communicate just how important this
movement is. Consider, for example,
President Obama’s recent campaign,
which many touted as by far the most
comprehensive Web and data-driven
campaign in history.
The team backing the big data efforts
for Obama utilized vast, diverse wells of
information ranging from GIS, social me-
dia, demographic and other open and
proprietary data to seek out the most
likely voters to lure from challenger Mitt
Romney. This approach to wrangling
big data for a big win worked, although
it’s still a matter of speculation just how
much these efforts tipped the scale.
So let’s take a more concrete set of
Big Data Implementation:
Hadoop and Beyond
In this new era of data-intensive computing, the next generation of tools
to power everything from research to business operations is emerging.
Keeping up with the influx of emerging big data technologies presents its
own challenges—but the rewards for adoption can be of incredible value.
High-performance computing and machine learning help users track their family tree at Ancestry.com.
This screen shot features the Shaky Leaf, which employs predictive analytics to suggest possible new
information in a user’s family tree. (Image courtesy of Ancestry.com.)

measurable examples. Healthcare
data are being processed at incredible
speed, pulling in data from a number
of sources to turn over rapid insights.
Insurance, retail and Web companies
are also tapping into big, fast data
based on transactions. And life sci-
ences companies and research enti-
ties alike are turning to new big data
technologies to enhance the results
from their high performance systems
by allowing more complex data inges-
tion and processing.
There is little doubt that the next
wave of big data technologies will rev-
olutionize computing in business and
research environments, but we should
back up for a moment and take in
the 50,000-foot view to put this all in
clearer context.
The roots of big data are not in the
traditional places one used to look for
the cutting edge of computational tech-
nologies—they are entrenched in the
daily needs of the world’s largest, most
complex and earliest web companies.
As Jack Norris, VP of Marketing at
Hadoop distribution startup, MapR,
reminds, “Google is the pioneer of big
data and showed the dramatic power
and ability to change industries through
better analysis.” He adds, “Not only did
Google do better analysis, but they did
it on a larger dataset and much cheap-
er than traditional alternatives.” Mak-
ing use of big data requires powerful
new computing tools, and perhaps the
best-known one is called Hadoop.
As the case of Google calls to mind,
even though big data are often big, size
alone does not adequately define the
space. Likewise, the field of big data
extends beyond any particular applica-
tion, as indicated by the examples giv-
en above. As Norris explains, “Big data
is not just about the kind or volume of
data. It’s a paradigm shift.” He adds,
“It’s a new way to process and analyze
existing data and new data sources.”
The point at which a computing chal-
lenge turns into big data depends on the
specific situation. For some research
and enterprise end users, existing ap-
plications are being strained due to the
data complexity and volume overload,
for others performance is lagging due
to these factors. Another group of users
simply needs to abandon old applica-
tions, systems or even methodologies
completely and build again to create a
big data-ready environment.
As Yonggang Hu, chief architect for
Platform Computing, IBM Systems
and Technology Group says, “One of
the simplest ways to define big data is
when an organization outgrows their
current data-management systems,
and they are experiencing problems
with the scalability, speed and reliabil-
ity of current systems in the data they
are seeing or expect to get.” When
faced with such a situation, says Hu,
“They need to find a cost-effective
way to move forward.” (See ‘A to Z for
Big Data Challenges.’)
When viewed in total, the scalability,
performance, storage, network and
other concerns continue to mount—
which has opened the door for the
new breed of big data technologies.
This includes everything from fresh
approaches to systems (new appli-
ances purpose-built for high-volume
and -velocity data ingestion and pro-
cessing) to new applications, data-
bases and even frameworks (Hadoop
being a prime example).
The Deluge of Data
to Decode
There is no one definition of a big
data application, but it’s safe to say
that many existing applications need
to be retooled, reworked or even
thrown overboard in the pursuit to
keep pace with data volume, variety
and velocity needs.
Throw in the tricky element of real-
time applications—which require high
performance on top of complex in-
gestion, storage and processing—and
Benefits of BigQuery
To explore big data, people want to ask questions about the information. “Google’s
BigQuery is a tool that allows you to run ad hoc queries,” says Michael Manoochehri,
one of Mountain View, California-based Google’s developer programs engineers. “It
lets you ask questions on large data sets, like terabytes.”
To make this tool accessible to more users, Google implemented a SQL-style
syntax. Moreover, the company has used this tool—in a version called Dremel—
internally for years before releasing a form of it in their generally available service,
BigQuery, in 2012. “It’s the only tool that lets you get results in seconds from tera-
bytes,” Manoochehri says.
This tool brings big data to many users because it is hosted on Google’s infrastruc-
ture. “You take your data, put it in our cloud and use this application,” Manoochehri ex-
plains. Because of that simplicity, a diverse group of users already employs BigQuery.
As examples, Manoochehri mentions Claritics—another Mountain View–based
company, which describes its services as providing “social gaming and social com-
merce analytics”—and Blue Box—a company in Brazil that provides mobile adver-
tisements. Regarding Blue Box, Manoochehri says, “They started with MySQL, and
when the data got too big they moved to Hadoop.” He adds, “Hadoop is cool, but it
takes set up and administration.” Then, Blue Box tried BigQuery. “It fit,” Manoocheh-
ri says, “because Blue Box is a small shop with lots of data to process. They wanted
to run queries on data and integrate it into their application.”
Various other industries also use BigQuery. As an example, Manoochehri points
out the hospitality industry, which includes hotel and travel data. Overall, he says,
“Almost all of the case studies for BigQuery are small shops, like two or three engi-
neers, that have no time to set up hardware.”
In general, BigQuery’s biggest benefits arise from speed. “It’s optimized for speed,”
Manoochehri explains. It’s fast to set up and it brings back results fast, even from
very big datasets.
They are often
looking at complexity,
trying to manage
complexity with
limited resources.
“
”

users are left with a lot of questions
about how to meet their goals. While
there is no shortage of new technolo-
gies that are aimed at meeting those
targets, the challenges can be over-
whelming to many users.
When asked about the hurdles that
users face today in working with big
data, Michael Manoochehri, developer
programs engineer at Google head-
quarters says, “They are often looking
at complexity, trying to manage com-
plexity with limited resources.” He adds,
“That’s a big problem for lots of people.”
In the past, the lack of resources kept
many potential users out of big data.
As Manoochehri says, “Google and
Amazon and Microsoft could handle
big data,” but not everyone. Today, big
data tools exist for many more users.
For example, Manoochehri says, “Our
products make these seemingly impos-
sible big data challenges more acces-
sible.” (See ‘Benefits of BigQuery.’)
A prime example of this can be
found in large-scale data mining,
which can quickly turn into a big data
problem, even if it’s a process that
used to fit neatly in a relational data-
base and commodity system. For life
sciences companies, data mining has
always been a key area of computa-
tional R&D, but the addition of new,
diverse, large data sets continues to
push them toward new solutions.
The data mining, analytics and pro-
cessing challenges for life sciences are
great. Genomics explores the chromo-
somes of a wide range of organisms—
everything from seeds to trees and hu-
mans. Chromosomes consist of genes,
whichcodeforproteins,andnucleotides
make up the genes. The nucleotides
are arranged, more or less, like beads
on a necklace, and the order of the
nucleotides and the types—they come
in four forms—are like letters spelling
out words. In the case of genomics,
the words translate into proteins, which
control all of the biochemical processes
that fuel life. Hidden in those words are
complex programming and process-
ing challenges as life sciences users
attempt to piece these bits together to
find “needle in the hackstack” discov-
eries amidst millions (and more often,
billions) of pieces at a time.
Sequencing determines the order
and identity of the nucleotides, and
that information can be used in basic
research, biotechnology, medicine and
more. Today’s sequencing machines
produce data so fast that this quickly
generates big data. In 2010, for ex-
ample, BGI—a sequencing center in
Beijing, China—bought 128 Illumina
HiSeq 2000 sequencing platforms. Ac-
cording to specifications from Illumina,
one of these machines can produce up
to 600 gigabytes of data in 8.5 days.
So 128 of these sequencing machines
could produce about one-quarter mil-
lion gigabytes of data a month, and
that’s big data. Today’s sophisticated
life science tools and big data applica-
tions also impact the healthcare indus-
try. (See ‘Boosting Big Pharma.’)
The challenges across the com-
putational spectrum are not hard to
point to in such a case, especially
when it comes to large scale genomic
sequence analysis. With the terabytes
of data rolling off sequencers, the vol-
ume side of the big data equation is
incredible, necessitating innovative
approaches to storage, ingestion and
data movement. In terms of the nee-
dle in a haystack side of the analytics
Boosting Big Pharma
“The big data question hasn’t really been fully exploited in big pharma at the mo-
ment,” says Gavin Nichols, vice president of strategy and R&D IT at Quintiles in
Research Triangle Park, North Carolina. Nonetheless, many big data opportunities
exist in healthcare. “If you took the big data around genomics, the ability to master
investigator and doctor information, the information from clinical trials and what goes
on drug labels and much more, you get a picture of where you could influence the
field,” says Nichols. In short, big data could impact the pharmaceutical industry from
discovering new drugs to commercializing them. “There is the ability to influence the
whole lifecycle,” Nichols says, “but that has not been fully realized.”
Quintiles, on the other hand, started working with big data some time age. “For the
past seven or eight years,” Nichols says, “we’ve been very strategic about the data
that we consider.” This includes both enterprise and technical computing. For ex-
ample, Quintiles analyzes large datasets in an effort to improve efficiencies across its
various organizations. Likewise, the company takes advantage of electronic health
records (EHRs) and registries in an effort to better understand the populations that
need new pharmaceuticals.
As one crucial example, Quintiles uses big data in an effort to make clinical trials
more efficient. In a late-phase trial, for example, better use of EHRs could reduce
time and cost. “Instead of going to a trial site and asking 100s of questions, like what
drugs the people take,” Nichols explains, “we could ask the EHRs those questions,
and maybe we can answer 40 percent of the questions.” Beyond increasing efficien-
cy, a similar approach could also enhance the information gained from a clinical trial.
“Information from EHRs could help us design more-targeted studies on a smaller
population,” Nichols says, “and that’s a win-win for everyone.”
Although big data offers a gold mine to the pharmaceutical industry, companies
must use this resource wisely. “Big data in the life sciences without good clinical
interpretations could be a hazard,” Nichols says. “So we bring data together and
enrich it.” Quintiles combines data from, say, clinical trials with the experience of
clinical experts.
To make the most of big data opportunities, the experts at Quintiles consider vari-
ous options. “We look long and hard at things,” Nichols says. “We’re looking at Ha-
doop, and we’ve also used Solar on a SQL frontend.” He adds, “We’re also looking at
techniques developed by companies that are building on top of Hadoop.” For patient
privacy, however, Quintiles segregates analytics from the raw data.
In many cases, though, each pharmaceutical question requires its own big data
implementation. For efficiency, Quintiles relies heavily on data scientists. “By using
research analysts—individuals who sit between the knowledge of what the data are
and the analytical approaches—we can see if we’ve examined similar questions in
the past,” Nichols says. “Then, we might be able to go back to some application that
you’ve used before.” Nonetheless, Nichols adds, “Clinical research has more and
more one-off questions.”

problem, standard software that fits
into neat boxes is no longer able to
keep pace with the incredible speed
and complexity of the puzzle-piecing.
Hence life sciences is one of the key
use cases representing the larger
challenges of big data operations.
Big data goes beyond our bod-
ies, another example of data-inten-
sive computing in action could soon
change our homes.
The “smart grid” is another hot ap-
plication area for a wide range of big
data technologies across the entire
stack. The US Department of En-
ergy describes the smart grid as “a
class of technology people are us-
ing to bring utility electricity delivery
systems into the 21st century, using
computer-based remote control and
automation.” Of course, without am-
ple processing, frameworks and ap-
plications designed from the ground
up to handle so many diverse data
feeds that power the smart grid, the
project couldn’t be a success.
Making the smart grid work requires
real innovation on the data front,
according to Ogi Kavozovic, vice
president of strategy and product at
Opower in Arlington, Virginia. “Smart
grid deployment will create a three-
orders-of-magnitude jump in the
amount of energy-usage data that the
utility industry will have,” he says.
The energy exec went on to note
that when this data spike happens,
“Then there will be a several-orders-
of-magnitude jump above that from
energy-management devices that go
inside people’s homes, like Wi-Fi ther-
mostats that read data by the sec-
ond.” Opower is already working with
Honeywell, the global leader in ther-
mostats, to make such devices. “This
is all just emerging now, and the utility
industry is preparing for it,” Kavozovic
says. “The data presents an unprec-
edented opportunity to bring energy
management and efficiency into the
mainstream.” (See ‘Evolution in En-
ergy Efficiency.’)
For the smart grid and energy sector,
moving to big data technologies is a
priority because it will enable new, criti-
cal efficiencies that can create a new
foundation for energy consumption and
distribution. But of course, the frame-
works, both on the hardware and pro-
gramming side, need to be in place.
Data-Based Decisions
In the field of big data, the wide
range of applications mirrors the sorts
of computing situations that gener-
ate big data challenges. Some users
run into big data because of han-
dling many files. In other cases, the
challenge comes from working with
very large files. Big data issues also
emerge from extensive data sharing,
allowing multiple users to explore or
analyze the same data set.
“Time plays a role in big data,” says
Steve Paulhus, senior director, strate-
gic business development at Xyratex.
“Depending on how the data is used, it
may be valuable today or it may become
valuable tomorrow. For instance, with
financial information, the data can lose
Evolution in Energy Efficiency
Utility companies that want to use big energy data to enhance efficiency can turn
to Opower in Arlington, Virginia. “We partner with about 80 utilities around the world
to help their customers understand and control their energy usage through our tools,”
says Anand Babu, lead on all data and platform marketing efforts at Opower. With
these tools, customers can compare their energy usage to that of their neighbors
and learn to improve their energy efficiency. Likewise, this use of big energy data can
help utilities reduce peak energy demands, which can reduce the need to develop or
purchase more energy.
To suggest some of the volume of big energy data, Babu says, “In the United
States alone, consider what happens once every one of our 120 million households
has a smart meter and WiFi thermostat. That’s 120 million times 100,000 data points
per month.” That’s 12 trillion data points acquired every month. Keep in mind too
that most households today only get one monthly meter read. So big energy data
includes high volume and fast growth.
As utilities start to roll out smart meters, most companies lack the infrastructure
to store the data. That’s one place where Opower comes in on big energy data. This
software-as-a-service company provides the infrastructure and the analytical exper-
tise. “We operate a managed service where utilities securely transfer data to us, and
we do the analytics,” Babu explains.
To provide this service, Opower uses several tools. Babu says, “We are the biggest
user of Hadoop in our industry, operating at the scale of nearly 50 million homes.” In
other cases, different tools work better. “For some things, we still use MySQL,” Babu
explains. “For example, to manage how we communicate with certain segments of
the customer base, that can be done more efficiently using a traditional database.”
“The exciting thing here is that customers see value on day one,” Babu says, “and
they can start to take action right away.”
In “Different Perspectives on Big Data: A Broad-Based, End-User View” from Intersect360 Research,
people polled rated a series of big data challenges on a scale of 1–5, ranging from ‘not a challenge’ to an
‘insurmountable challenge,’ respectively. As the results show, users run into the most trouble managing
many files and performing computations and analyses fast. (Image courtesy of Intersect360 Research.)
Business Technical All Data
Challenging dimension of Big Data Avg. %4/5 Avg. %4/5 Avg. %4/5
Mangement of very large files 2.7 25% 3.5 56% 3.3 46%
Mangement of very large number of files
(many files) 3.5 58% 3.9 70% 3.7 66%
Providing concurrent access to data for
multiple users 3.1 40% 3.5 52% 3.4 48%
Completing computation or analysis within
a shirt timeframe (short lifespan of data) 3.8 65% 3.7 61% 3.7 62%
Long-term access to data (long lifespan of data) 3.6 61% 3.5 56% 3.5 57%
Number of respondents 102 203 305
Intersect360 Research, 2012

value if you don’t get it right now. In other
cases, like weather data, you might want
to look at it over a long period of time.”
(See ‘Crafting Storage Solutions.’)
Addison Snell, CEO of Intersect360
Research in Sunnyvale, California, de-
scribes another example where data
must be stored for a long time: “A pe-
diatric hospital, for example, keeps in-
formation for the life of a patient plus
some additional number of years.”
Depending on the particular ap-
plication and the kind of data that it
includes, end-users must select a
system that works for them. No sin-
gle solution handles every big data
problem. That said, most users of big
data face one consistent problem: in-
put/output scalability. “From a
technical perspective,” says
Snell, “this is the number one
big data problem.”
Moreover, I/O scalability im-
pacts both technical and en-
terprise computing. Technical
computing often involves a
user’s top-line mission, such
as car design for an automotive
manufacturer or drug discovery
for a pharmaceutical company.
Enterprise computing, on the
other hand, consists of the steps
that keep a company running,
such as handling the mechanics
of an e-commerce website.
In the past, technical comput-
ing focused almost entirely on
unsolved problems. As Snell ex-
plains,“Youdon’tneedtodesign
the same bridge twice.” Since
this area of computing moves
to harder and harder problems,
the users quickly adopt new
technologies and algorithms. In
enterprise computing, by com-
parison, once users find solu-
tions, they stick with them. “In
enterprise computing, there are
people using approaches that
are 30 years old because they
still work,” says Snell.
In many ways, the big data
era changes this distinction. By
definition, big data scenarios
push an end-user to a point
where previous methods won’t
work. This applies to technical
and enterprise computing. “So,
for the first time, you have enter-
prise-class customers viewing
performance specifications of
infrastructure as the dominant
purchase criteria,” Snell says.
The purchase criterion used the
most is I/O performance.
In addition to I/O performance
scalability, users must ask other
questions about their big-data
Computing Family Connections
At Ancestry.com, data come in many forms. This family-history company stores data from
historical records, such as census data and military and immigration records. This historical
information also includes birth and death certificates. Overall, the historical data at Ancestry.
com consists of around 11 billion records. In addition, this web-based company stores user-
generated data.
“When users come to this site, their goal is to build a family tree,” explains Leonid Zhukov,
the company’s director of machine learning and development. In this process, a user creates
a profile of themselves, plus their parents, grandparents and so on. “We have more than 40
million family trees and 4 billion people represented here,” Zhukov says. Users can also upload
images, such as photographs or even handwritten stories, and Ancestry.com stores about 185
million of these records. On top of all of that, this site keeps track of user activity. “On aver-
age, Ancestry.com servers handle 40 million searches a day,” says Zhukov, “and we log user
searches and use it in machine learning.”
The kinds of data being stored at Ancestry.com also evolve over time. As an example of that,
the company can now collect a user’s DNA data. The company provides the DNA test and pro-
cesses the information. As the company explains: “AncestryDNA uses recent advances in DNA
technology to look into your past to the people and places that matter to you most. Your test
results will reach back hundreds—maybe even a thousand years—to tell you things that aren’t
always in historical records, but are recent enough to be important parts of your family story.”
This range of data—a variety of types from various sources—can only be stored and analyzed
with powerful tools. “There’s no black-box solution for our data challenges,” Zhukov explains. “We
use different pipelines for processing different types of data. Plus, we deal with big and versatile
data, so we need to do lots of hand work on it.” He also points out that moving so much data
creates challenges. “We must customize servers and how they are connected—where to collect
data and where to process it,” Zhukov says. “The logistics of the data are quite important.”
Some of the most interesting data analysis at Ancestry.com involves machine learning. This
works on two types of objects: a person or a record, which is any document with information
about that person. Then, users can find information about ancestors by searching for names,
places of birth and so on. These focused searches provide precise results. On the other hand,
users can also turn to exploratory analysis. “In this case, a user is looking for suggestions,”
says Zhukov. “It’s similar to when you log on Amazon, and they show you the books that you
might be interested in.” So instead of a search, the exploratory analysis provides suggestions
of records or family-tree information based on what the user has done in the past, and that
depends on machine learning.
The system at Ancestry.com creates the suggestions for exploratory analysis offline. “We
train the system to make suggestions based on the feedback of previous suggestions that were
accepted or rejected,” Zhukov explains. “The system looks for similarity. It asks: How similar is
a person to a particular record?” The software uses a range of features to make these compari-
sons. It can start with a first and last name, but maybe a document includes a misspelling, and
then the system relies on a phonetic comparison. Also, a record could have been handwritten,
scanned and turned into data with optical character recognition. “You can imagine how many
mistakes there are in hand-written historical records,” Zhukov says, “so we do fuzzy matching
instead of exact matches.” The features of a record serve as the input and the training data for
machine learning come from the users rejecting or accepting the suggestions.
For many parts of this processing, Ancestry.com uses Hadoop. As Zhukov says, “Hadoop
is really great technology, but it’s not at the point where you can open the box and things
miraculously work.” Consequently, this company relies on Hadoop experts and also experts
who understand how to search huge databases for similarities between disparate sources of
information. As Zhukov adds, “Hadoop is not only for data storage, but it also provides a data-
processing pipeline. So you need to feed it with the right things.”

infrastructure needs. For example, Ken
Claffey, senior vice president and gen-
eral manager, ClusterStor at Xyratex,
asks: “Once you have all of your data
in a common environment you need to
think about the reliability, availability and
serviceability of that infrastructure as you
want that data to be consistently avail-
able.” He adds, “You need to think about
the constant with big data—multi facet-
edgrowthincapacity,datasources,data
types, number of clients, applications,
workloads and the new ways to use the
data. It is key to ask yourself how are you
going to efficiently manage such multi
faceted exponential growth over the next
number of years against the constraints
of available data scientists and data cen-
ter space, power, cooling and budget.
Your big data infrastructure needs to be
scalable to meet performance and ca-
pacity demands, flexible to respond to
change and extremely efficient.”Delving
even deeper, Norris noted that when
looking at new architecture and what’s
possible, the real focus is on what types
of applications will be used. What data
sources are we going to exploit? What
is the desired outcome?” In short, a
user must answer questions like these
to know where to start. After that, a user
decides how to integrate new infrastruc-
ture into an existing computing environ-
ment. “How does the new computing fit
my current security?” Norris asks. “How
do I make it enterprise grade?” As Nor-
ris points out, a user might not start ask-
ing these questions until working on the
second or third big data application and
sharing it across a group of users. “You
can generalize about the best platform,
but you have to have it working well
with the existing environment and ap-
plications,” Norris says. “So you’ll end
up with a little bit of a hybrid solution.
You’ll find yourself asking: How do I take
advantage of this new technology, but
on a platform that delivers what I need
in terms of reliability, ease of integration
and so forth?”
Among some of the frameworks that
are emerging in some of the key sec-
tors we’ve discussed is Hadoop and
MapReduce, which were again, born
out of the largest-scale data opera-
tions the world has seen to date—the
search engines and web companies.
Mapping the MapReduce
Paradigm
Forawiderangeofindustries,MapRe-
duce makes up one of the most com-
mon big data application frameworks.
This often gets implemented as Hadoop
MapReduce, which makes it easier for a
user to develop applications that handle
large data sets—even in multiple tera-
bytes—on computer clusters.
The complete range of big data ap-
plications could always remain in flux.
As Boyd Wilson, executive at Omnibo-
nd in Pendleton, South Carolina, says,
“As you iterate on various methods of
teasing out the information from the
data, that learning can lead to new ar-
eas of big data.” As an example, Wilson
mentions cross-disciplinary analytics.
Overall, he says, “The big data
sweet spot is lots of—potentially di-
verse—data that you want to com-
bine to learn new correlations, and
you need high-performance medium,
something like OrangeFS, to aggre-
gate and analyze in large volume and
with high throughput.” (See “A Big
data File System for the Masses.’)
As this schematic indicates, a wide range of information plays a part in big energy data. Providing benefits to both utilities and customers requires ad-
vanced storage and analytics. (Image courtesy of Opower.)
With Hadoop,
you can recognize
patterns much
more easily
and drive new
applications.
“
”

That “sweet spot” covers a lot of
ground: consumer-behavior analysis,
dynamic pricing, inventory and logis-
tics, scientific analysis, searching and
pattern matching, and more. More-
over, these applications of big data
arise in many vertical markets, includ-
ing financial services, retail, transpor-
tation, research and development, in-
telligence and the web.
In some cases, the use of big data
will come from some unexpected plac-
es. For instance, Leonid Zhukov, direc-
tor of machine learning and develop-
ment at Ancestry.com, says, “We’re a
technology company, and we do fam-
ily history, but we are a big data com-
pany.” He adds, “We use data science
to drive family-history advances. So
we’re hiring data people.” (See ‘Com-
puting Family Connections.’)
As MapReduce and Hadoop con-
tinue to mature, these use cases will
continue to extend further into main-
stream enterprise contexts, but for
now, there are some bigger problems
to address.
Hadoop’s High Points
Although Hadoop only made up
slightly more than 10 percent of the
applications mentioned in Inter-
sect360 Research’s poll (see below
for details), many experts endorse
this approach. When asked about the
benefits of applying a Hadoop solu-
tion, Norris of MapR says, “The way
to summarize the benefits of Hadoop
is: It’s about flexibility; it’s about pow-
er; it’s about new insights.” (See ‘M
Series from MapR.’)
In terms of flexibility, according to
Norris, Hadoop does not require that
you understand ahead of time what
you’re going to ask. “That seems sim-
ple,” he says, “but it’s a really dramatic
difference in terms of how you set up
and establish an application.” He adds,
“Setting up a typical data warehouse is
initially a large process, because you
are asking how you will use the data,
and that requires a whole set of pro-
cesses.” With Hadoop, on the other
hand, a user can just start putting in
data, according to Norris. “It’s really
flexible in the analysis you can perform
and the data you can perform it on,
and that really surprises people.”
Norris points about several fea-
tures of the power behind Hadoop.
“You can use different data sets, and
the volume that you can use is quite
broad.” He adds, “This eliminates the
sampling. For example, you can store
a whole year of transactions instead of
just the last 10 days. Instead of sam-
pling, you just analyze everything.”
By using such large volumes of data
in analytical procedures, users can
apply new kinds of analysis and also
more easily identify outliers. Because
of that ability to work with all of the
data, says Norris, “Relatively simple
approaches can produce some dra-
matic results.” For instance, just the
added ability to explore outliers could
reveal some pleasant surprises, may-
be some tactic that generated an un-
expectedly high amount of revenue.
On the other hand, analyzing outli-
ers can reveal failures. As an exam-
ple, Norris says, “You might find an
anomaly like an expensive part of the
supply chain that you want to elimi-
nate. Overall, analyzing these outliers
makes dramatic returns possible.”
In terms of new insights, says Norris,
“With Hadoop, you can recognize pat-
terns much more easily and drive new
applications.” He also adds that Hadoop
M’s from MapR
MapR in San Jose, California, developed a series of Apache Hadoop solutions,
including M3, M5 and the latest M7. The M7 version adds HBase capabilities, which
is the Hadoop database. “This provides real-time database operations side by side
with analytics,” says Jack Norris, the company’s vice president of marketing. “This
provides lots of flexibility for the developers, because you can run applications from
deep analytics to real-time database operations.”
All of the MapR tools make it possible to treat a Hadoop cluster like standard stor-
age. “Consequently, you can use industry-standard protocols and existing tools to
access and manipulate data in the Hadoop clusters,” Norris explains. “This makes
it dramatically simpler and easier to support and use.” This tool can also be used as
an enterprise-grade solution.
For all of the MapR editions, the developers aimed at making Hadoop easy, fast
and reliable to use. For example, these tools provide full data protection. Norris adds,
“In terms of hardware systems, the beauty of MapR is that it supports standard,
off-the-shelf hardware. So you don’t need a specialized server.” He adds, “You get
better hardware utilization with MapR, because, for example, you can use a greater
density of disk drives and faster drives.”
To get an idea of how MapR can be applied, consider these case studies with
M5. “One of the largest credit-card issuers is using M5 to create new products and
services that better detect fraud,” says Norris. “This includes some broad uses of
analytics that benefit the cardholders.” Some large online retailers also use M5 to
better analyze shopping behavior and then use that to recommend items or simplify
the user’s experience. Other websites also use M5 in digital advertising and digital-
media platforms. For example, says Norris, “Ancestry.com is using it as part of their
service.” Norris also points out: “Manufacturing customers use M5 for basically ana-
lyzing failures. They collect sensor data on a global basis and detect precursors to
problems and schedule maintenance to avoid or prevent downtime.” In summary,
Norris calls the MapR users “really, really broad.”
A user just starting with big data might start with Apache Hadoop. “When you
really put this into production use,” Norris explains, “having enterprise grade with
advanced features is a real benefit.” Some of the advanced features of MapR tools
include a reduced need for service. For instance, the MapR platforms are self-heal-
ing.
To ensure that MapR platforms include the features that users really want and
need, the company works with an extensive network of partners. “In some case,
these partners provide vertical expertise within a market or even do consulting work
to identify likely use cases to help us select the best implementations of MapR,”
Norris says. In addition, this partner network also helps develop some of the training
available from the MapR Academy, which provides free training. “We want to ensure
that users can easily share information and applications on a cluster,” Norris says.
“That is basically a unique value proposition of MapR.”

can be an economical solution. “A high-
end data warehouse can cost $1 million
per terabyte, whereas Hadoop is only
hundreds of dollars per terabyte.”
At Ancestry.com, Zhukov and his
colleagues use Hadoop to solve
many problems. “For different data
types and different business units, we
use Hadoop in different ways,” Zhu-
kov explains. For example, the team
that provides optical character rec-
ognition for documents scanned into
Ancestry.com uses Hadoop for that.
Also, when this company acquires
newspaper data that are structured
and in a text format, facts—such as
dates, names and places—must be
extracted, and Hadoop provides the
technology for the parallel processing
that achieves this. Zhukov adds, “In
the DNA pipeline, where we compare
customer DNAs and do matching for
relatives, we look at the percentage
of overlap and ethnicity. That’s a big
data job.” Hadoop also gets used at
this company for the machine learn-
ing that generates the possible family
connections for users. As Zhukov con-
cludes: “We do analytics in more of a
traditional setting, but we are turning
to Hadoop for all business units in the
company.”
Emerging Options
Hadoop does not solve every big
data problem or make the best fit for
every user. In some cases, the actual
implementation of Hadoop creates
the biggest hurdle. As Zhukov says,
“If you compare Hadoop to tradition-
al database systems, it’s in the early
stages of its technology. So there’s
lots of tweaking and twisting for it to
run smoothly.”
In fact, the Hadoop challenges vary
by user. “The biggest challenge really
depends on the use case the customer
has,” says Hu of IBM. For example, he
says, “If they are just analyzing data
in some standalone system and they
have Hadoop clusters that operate in
isolation, then the challenge is just to
keep those clusters up and running.”
Hu also points out some general
concerns with Hadoop. He says, “It
has challenges in single points of
failure, and massive amounts of sys-
tem must be stitched together, which
makes system provisioning and man-
agement a challenge.” Beyond that,
he lists several other issues, including
the latency of execution in an applica-
tion and the inability to execute work-
loads outside the Hadoop paradigm.
To explore the general concerns even
more, Hu directs readers to “Top 5
Challenges for Hadoop MapReduce
in the Enterprise.”
Beyond concerns of implementa-
tion and execution, Zhukov points
out that it’s not easy to hire Hadoop
experts. As he says, “Here in Silicon
Valley, it’s proved to be quit difficult
to hire people with Hadoop skills. So
we’re trying to grow and train our own
people to work with Hadoop.”
Sometimes, users assume that Ha-
doop is the only big data option. As
Snell says, “In many cases, we find
that Hadoop is a technology in search
of solutions.” To explain this, he says,
“Imagine that I’ve designed a new
screwdriver, and I’m trying to find how
many things I can do with it. In prac-
tice, you might be able to drive in a
Phillips screw with a flathead screw-
driver, but the flathead screwdriver
might not have been the best tool.”
Likewise, Hadoop is not the best tool
for every big data situation.
Other options keep emerging for us-
ers. In some cases, companies even
develop new approaches to creating
the new tools. For example, Omni-
bond makes all of its advanced tools
available to the open-source commu-
nity. As Ray Keys, the company’s mar-
keting manager, explains, “We have a
very unique business model. It’s pure
open source, and pushing everything
out to the open-source community
is really powerful.” He adds, “There
is goodwill involved, but it’s goodwill
from people we do development for,
and they all realize that it’s goodwill
for the entire community. As a result,
the community contributes, as well.”
Beyond just good will, this company’s
OrangeFS provides a file system that
can handle millions and millions of files
in a single directory without reducing
the performance of the solution.
End-User Applications
To get a quantitative grasp of how
users apply big data, Intersect360 Re-
search conducted a survey and pro-
duced this report: “Different Perspec-
tives on Big Data: A Broad-Based,
End-User View.”
This study arose from surveying—in
partnership with the Gabriel Consult-
ing Group in Beaverton, Oregon—
more than 300 users of big data
applications, which included both
high-performance computing (HPC)
and non-HPC. The survey sampled a
wide range of users, including com-
mercial, government and academic
(or not-for-profit) users. For example,
the commercial users ranged from
consulting and computing to energy
and engineering, as well as manu-
facturing and medical care. The gov-
ernment users worked in national re-
search laboratories, defense, security
and other areas. The users also came
from a range of group sizes, cover-
ing a range from a few to more than
100,000 employees. For instance,
nearly 20 percent of the groups had
only 1–49 employees, and almost 10
percent had 20,000–99,999.
Such a diverse sample should pro-
vide a broad overview of big data ap-
plications, and the survey results con-
firm that. Among those polled, the big
data applications included enterprise,
research and real-time analytics, plus
MapReduce, complex-event process-
ing, data mining and visualization. For
example, more than two-thirds of those
We have a very
unique business
model. It’s pure
open source, and
pushing everything
out to the open-
source community
is really powerful.
“
”

polled said that they work with big data
applications in research analytics, and
more than half apply data mining to
big data situations. A small portion of
those surveyed—6 percent—reported
no current applications, although they
were planning future ones.
Those surveyed also rated a series of
big data challenges on a scale of 1–5,
ranging from ‘not a challenge’ to an ‘in-
surmountable challenge,’ respectively.
Overall, the users ranked the most signif-
icant challenges as managing many files
and performing computations and anal-
yses fast, where the average scores for
both were 3.7 on the 1–5 scale. Among
the technical users, 70 percent of them
ranked managing many files as a 4 or 5
on the scale—making it feel like an in-
surmountable, or nearly that, challenge
for 7 out of 10 of those polled. Among
business users, 65 percent of them
rated performing computations and
analyses fast as a 4 or 5 on the scale of
big data challenges. In general, though,
most users rated most of the challenges
as significant. For example, users also
reported serious challenges—averages
scores above 3—for managing large
files, giving multiple users simultaneous
access to data and providing long-term
access to data.
This survey also explored where
users get their software for big data
applications. The results show some
unexpected, or at least inefficient, re-
sults: Half of those polled developed
the applications internally; one-quar-
ter of them purchased the applica-
tions; and about another quarter used
open-source applications. “For those
who developed the applications inter-
nally,” says Snell, “those are all one-
offs by definition.” With challenges as
complicated as big data, developing
a unique solution to every problem
does not make for efficient comput-
ing. Moreover, the wide use of one-
offs limits these applications to groups
that maintain the personnel to develop
these advanced applications.
This survey also examined the spe-
cific tools the users apply to big data
scenarios. Out of the 300-plus re-
sponses to this survey, 225 of them
provided details on their applica-
tions, and Hadoop came out on top.
Nonetheless, only 11 percent of the
users reported that they make use of
Hadoop. Nonetheless, says Snell, “If
people have a big data problem, they
often think they must use Hadoop.”
Future Fits
By definition, big data problems
outstrip the capabilities of a user’s
current system. So some of the tools
typically used only in HPC will work
their way into non-HPC environments.
A key example of that could arise from
the use of parallel file systems.
The Intersect360 Research survey
asked users about the file systems
that they employ. The replies included
a diverse set of approaches. Across all
types of users—academic, commer-
cial and government—most selected
the Network File System (NFS), which
was developed by Sun Microsystems
in 1984. Overall, nearly 70 percent of
the users employ NFS. Not many of
the users implement parallel files sys-
tems at this point. For example, less
the 20 percent of the people polled
used the Hadoop Distributed File
System (HDFS), IBM’s General Paral-
lel File System (GPFS) or Lustre.
As Snell says, “Parallel file systems
are still poorly penetrated into com-
mercial markets.” He adds, “Com-
mercial users will need to find solu-
tions for parallel I/O.”
Beyond learning to build the infra-
structure for big data, much remains
to be learned about where to use it.
Gavin Nichols, vice president of strat-
egy and R&D IT at Quintiles in Re-
search Triangle Park, North Carolina,
provides a recent example. During the
2012 influenza season, big data ex-
perts in Scotland sequenced the gen-
otype of every person who contracted
the flu over a two-week period. From
that data, the analysts extracted infor-
mation that revealed who was most
susceptible to the illness, and who
might require hospitalization. Scot-
land’s government then used those
data to adjust its healthcare policy.
“They are using big data in how
they deal with influenza,” Nichols ex-
plains. “This is a good example that we
haven’t even seen the starting point for
what we can do with this technology.”
He adds, “If big data solutions grow in
the right way, then there’s the opportu-
nity to make meaningful changes in the
way drugs are developed, put through
to market and utilized.”
The variety of infrastructures be-
ing employed and the questions being
considered reflect the diversity of ap-
proaches that today’s users take to big
data challenges. This reveals that many
technologies in computing can be modi-
fied to handle big data problems in many
enterprises. Moreover, many technolo-
gies besides Hadoop can be explored.
For any particular user, the best technol-
ogy depends on the big data problem at
hand. Even for the same user, different
questions and applications will often
require different approaches to mak-
ing use of big data. As Snell concludes,
“Hadoop is not all things to all people.
Plus, there is not one perfect answer for
all big data problems.” n
About the Author:
Mike May, Ph.D. is a science and
technology writer, editor and book au-
thor. Mike has published hundreds of
articles within prestigious publications
during his 20 year professional writing
career, and is currently editorial direc-
tor for Scientific American Worldview.
He holds a Ph.D. from Cornell Univer-
sity in neurobiology and behavior.
If big data
solutions grow
in the right way,
then there’s the
opportunity to make
meaningful changes
in the way drugs
are developed, put
through to market
and utilized.
“
”

W
hen a customer runs into a big-
data problem—from struggling
with data velocity, volume or
variety—IBM offers a solution. Such
solutions, however, must supply flex-
ible options, because of the relativity of
the term big in big data. “The ‘big’ in big
data is like the ‘high’ in high performance
computing,” says Yonggang Hu, chief
architect for Platform Computing, IBM
Systems and Technology Group. “It’s dif-
ferent to different people.”
To provide flexibility, IBM orchestrates
a very broad big-data initiative. It
ranges from complete systems—like
the xSeries and pSeries architectures
that handle big-data workflows—to
various aspects of computing solu-
tions. For example, IBM offers vari-
ous forms of storage technology that
benefit big-data situations. The com-
pany also creates networking com-
ponents and interconnects needed
for extremely high-throughput sys-
tems. The data in any system does not
live in isolation. Big data come from a
location and the analysis goes to a des-
tination. So IBM solutions can provide
connectivity between big-data systems
and the databases and data warehous-
es in the enterprise.
The Power of Parallel Files
To provide an overview of big-data
challenges, Hu divides the computer
processing of big-data problems into dif-
ferent patterns. The first data workload
might require only a monthly assessment.
This does not require any immediate anal-
ysis. Instead, data can be collected and
analyzed over time, and the results get
used for decision-making as needed. At
the other end of the spectrum, some ap-
plications—such as a traffic-monitoring
system or software that maintains power
across a utility grid—require real-time de-
cisions. This requires a system with inter-
activity task latencies on the order of mil-
liseconds. “The current Hadoop system
doesn’t have this kind of near real-time
response capability,” Hu says.
Working effectively with this variety of
big-data applications requires a power-
ful file system. IBM’s new General Paral-
lel File System File Placement Optimizer
(GPFS FPO), developed specifically for
big data, allows for a model very similar to
a Hadoop distributed file system (HDFS).
This file system provides a range of
necessary features. For example, it in-
cludes enterprise capability. GPFS FPO
is also extremely scalable, so it can
grow with a user’s data needs. Fur-
thermore, this file system includes no
single points of failure, which provides
the robust computing needed to create
sophisticated data solutions. Moreover,
its POSIX support allows applications to
A to Z for Big-Data
Challenges
IBM provides every tool—from applications to complete
systems—for cost-effective solutions
IBM Platform Symphony brings mature multi-tenancy and low-latency scheduling to Hadoop. Symphony
enables multiple MapReduce workloads to run on the same cluster at the same time, dynamically sharing
resources based on priorities that can change at run-time.
IBM
A globally integrated technology and
consulting company
Founded: 1911 (as Computing
Tabulating Recording Company)
Number of Employees: 433,362
Headquarters: Armonk, NY
Top Executive: Virginia M. Rometty
(chairman, president and CEO)
Sponsor Profile

get the benefits of a UNIX-com-
patible file system.
Orchestrating a Solution
In many cases, big-data applica-
tions require advanced analytics.
For those situations, we created
IBM Platform Symphony. “This
software implements a Hadoop
MapReduce that is very fast,”
says Hu. Moreover, this software
provides very low latency. For in-
stance, the overhead provisioning
for an executed task takes less
than 1 millisecond. “Open-source
tool sets take an order of magni-
tude longer,” Hu says. On average,
performance tests reveal that Plat-
form Symphony is 7.3 times faster
than open-source Hadoop for jobs
comprised of a representative mix
of social media workloads.1
For someone who as been run-
ning Hadoop MapReduce on a
cluster, one challenge involves
efficient utilization. In many cas-
es, the user lacks sufficient jobs
to keep the cluster occupied. In
fact, utilization often runs as low as
20 percent. Platform Symphony helps
users extract more from a cluster by
getting workloads there faster and not
keeping them waiting to run. In ad-
dition, users can run heterogeneous
workloads, such as Hadoop MapRe-
duce mixed with C++ jobs. Moreover,
a business can divide a cluster among
multiple units with Platform Sympho-
ny, giving each a slice of the structure.
Then, the units can operate virtually
independently of each other.
In a recent return-on-investment
study, IBM explored the cost of run-
ning a big-data system, covering ev-
erything from the cost of the hardware
and administration to cooling the data
center. By using Platform Symphony,
instead of a standard Hadoop system,
the results showed a potential savings
of almost $1 million depending on the
customer’s use case.
For Real-Time Needs
IBM’s infosphere offers further options
in big-data analytics. This approach han-
dles big-data applications that need to
run in real or almost real time. It includes
ready-made components for people
who already use Hadoop analytics.
Furthermore, Infosphere’s accelera-
tors and frameworks speed up the de-
velopment of big-data analytics. One
ready-made tool, for example, is Big
Sheets. This provides a spreadsheet-
like interface to the Hadoop stack. It
extracts and manipulates data from
the spreadsheet interface.
Infosphere also lets users com-
bine Big Sheets with other software.
It supports all of the components of
the open-source Hadoop ecosystem.
So a user could select open-source
Hadoop layers in any of the stacks.
Hu adds, “Customers can choose the
layers where they want enterprise-ca-
pable technology.”
Building Relationships
In tailoring big-data solutions to
customers, IBM supplies a wide
range of customer services. This
includes assessing a customer’s
existing infrastructure and pro-
viding business-level needs. This
combines IBM’s technical- and
business-service arms.
Getting the right solution often
requires both levels of analysis.
For example, an efficient big-
data solution requires the right
computing infrastructure and
software systems, as well as
the business knowledge to de-
termine what reports should be
generated from the analysis.
The very basis of big data often
implies an analytical challenge. For
example, Vestas—an energy sup-
plier in Denmark—turned to IBM
to determine the best locations
for electricity-generating wind tur-
bines. The resulting analysis includ-
ed weather information, such as
wind and climate, as well as the lo-
cal geography. This company used
IBM technology to put together a
system that could produce accurate
answers in a timely fashion. The results
generated the best locations for the
wind turbines and how to arrange them
to harness the most energy possible.
As big-data problems grow and
evolve, IBM’s solutions keep adapting
and advancing. It takes the teamwork
of hardware and software infrastruc-
ture plus powerful analytics to perform
the needed computation. In addition,
technical and business solutions must
work together to generate the most
useful solution. From the xSeries and
pSeries systems to powerful intercon-
nects and advanced analytical algo-
rithms, IBM supplies big-data tools
and solutions that extend from high-
performance infrastructure to sophisti-
cated business intelligence and plan-
ning. Only approaches as orchestrated
as this can turn big-data opportunities
into business growth and success. n
Reference
• STAC Report – A Comparison of
IBM Platform Symphony and Apache
Hadoop MapReduce using Berkeley
SWIM 2012
IBM provides a variety of System x and Power-
based solutions optimized to run big-data tools,
like IBM InfoSphere BigInsights, IBM Platform
Symphony and IBM GPFS.
Sponsor Profile

T
o break down big-data problems,
many data scientists are look-
ing to the Hadoop MapReduce
model. Currently those systems run on
dedicated Hadoop clusters. Many com-
panies and researchers have existing
clusters that run all sorts of code but
currently cannot leverage Hadoop in
their shared environments. Dedicated
clusters are required due to the close
ties of Hadoop MapReduce and the Ha-
doop Distributed File System (HDFS). To
solve that problem, Omnibond has
extended the open source Orange
File System (OrangeFS) to natively
support Hadoop MapReduce with-
out the HDFS requirement.
According to Boyd Wilson, Omni-
bond executive, “Our main business
model for OrangeFS is all open-
source so you can download and
start running today.” In addition, Om-
nibond provides support and devel-
opment services. For example, if a cus-
tomer needs a specific feature added to
OrangeFS, Omnibond will price out the
cost and create it. The customer must
agree that the resulting code goes back
into open-source. Furthermore, Orange-
FS offers a wide range of applications
that can be used with high performance
and enterprise computing.
The need to make parallel file sys-
tems easier to use, including leveraging
MapReduce on existing clusters, ignited
the original work on OrangeFS. To bring
such systems to more users, the paral-
lel file system development work needs
to continue. OrangeFS is not limited to
traditional data science alone, as the
broadening list of use cases is expand-
ing from research to finances, entertain-
ment and analytics.
Enhancing the open-source business
model, Omnibond provides everything
that a customer would expect from a
company that develops closed-source
code. “If you have a problem, just call. If
you need a feature, we can add it, and
in addition, you get the peace of mind
in having the source code, too,” Wilson
says. Customers can purchase this ser-
vice and directed development on an
annual subscription basis, giving them
the assurance of stable development
and support. Moreover, this model pro-
vides a triple-win situation that’s great
for the support customer, Omnibond
and the entire open-source community,
thus assuring the ongoing development
of OrangeFS.
Inside OrangeFS
In general, OrangeFS provides the next-
generation approach to the Parallel Virtual
File System, version 2 (PVFS2). From the
start, PVFS2 delivered
desirable features, in-
cluding a modular de-
sign and distributed
metadata. It also sup-
ports a message pass-
ing interface (MPI).
Other features include
PVFS2’s high scal-
ability, which makes
it a great choice for
large data sets or files.
Beyond working with
such large files, PVFS2
A Big-Data File
System for the Masses
The computing community and Omnibond developed
open-source infrastructure software for storage clusters
Omnibond Systems develops open-source tools that can be used in a wide
range of big-data applications. (Image courtesy of Omnibond Systems.)
Omnibond Systems, LLC
A software engineering and support
company focused on infrastructure
software.
Founded: 1999
Employees: 45
Headquarters: Clemson, SC
Top Executive: Boyd Wilson
Sponsor Profile

works fast. For example, it outruns—
really outruns—a network file system
(NFS) using the MPI-IO interface.
Omnibond started with those fea-
tures and generated OrangeFS by
adding even more. For example, this
file system comes from a focused de-
velopment team. It also includes im-
provements in working with smaller
files; Omnibond does all of its testing
on 1 megabyte files. In the 2.8.6 ver-
sion, released on June 18, 2012, us-
ers were provided with advanced cli-
ent libraries, Linux kernel support and
Direct Interface libraries for POSIX
calls and low-level system calls. Ad-
ditionally, users can receive support
for WebDAV and S3 via the new Or-
angeFS Web Pack.
Future releases promise even more
features, such as distributed directo-
ries and capability-based security that
will make the access time for large di-
rectories even faster and the entire file
system more secure.
Added Incentives
For many big-data users, the list of
OrangeFS features already mentioned
sounds enticing enough, especially
with the best of open- and closed-
source worlds. Nonetheless, even
more incentives exist.
The object-based design of Or-
angeFS appeals to many users. In
addition, this file system separates
data and metadata. OrangeFS also
provides optimized MPI-IO requests.
Furthermore, this software can be
installed on a range of architectures.
For example, users can apply Or-
angeFS to multiple network architec-
tures. In addition, OrangeFS works on
many commodity hardware platforms.
To ensure file safety, OrangeFS in the
near future will also provide cross-
server redundancy.
Given the open-source nature of
OrangeFS, any documentation might
sound like a rarity, but Omnibond goes
even further. For instance, the recent
OrangeFS Installation Instructions
supply complete documentation—all
available online.
Many applications of big data require
aggregation. For example, a system
might gather real-time data and store it
for access. That stored data might have
access to clustering nodes for Hadoop.
Then, the collected data can be queried
and analyzed later with MapReduce.
Likewise, a big-data system might
require other forms of data aggre-
gation. For instance, real-time data
could be collected, combined with
private data and then processed. On
the other hand, a user might take
‘snapshots’ while processing real-
time data to collect a subset of data,
and then analyze it. OrangeFS can
provide storage for all of these big-
data scenarios.
Advanced Applications
The open-source community stays
at the cutting-edge of computation
and data analysis. Consequently,
experts at Omnibond track all of the
trends in high-performance and big-
data environments. The information
gleaned by Omnibond continues to
impact future versions of OrangeFS.
As an example, ongoing work in
data analysis spawns unimagined ap-
proaches. New ways of even think-
ing about data keep emerging. Social
media–data analysis, for instance,
and other large streams of data, can
trigger entirely new ways of analyzing
information, such as log, customer
and other real and near real-time data.
In other cases, new perspectives on
existing data will generate novel ap-
proaches to data integration that were
never considered.
As Wilson explains, “The new con-
cept of data analysis covers a wide
range. In some cases, it involves huge
amounts of data, and other times not
so much.” He adds, “You can bring
together different data points that you
wouldn’t think would correlate, but
sometimes they do.”
This kind of thinking—seeing to-
day’s challenges and turning them
into tomorrow’s opportunities—lies
at the heart of Omnibond’s philoso-
phy. “We are adding exciting new
features on our development path to
OrangeFS 3.0” says Wilson. “These
should be real game-changing en-
hancements.”
As a company, Omnibond focuses on
building the OrangeFS that customers
want, and then sharing those advances
with the wider community. n
Omnibond’s open source Orange File System (OrangeFS) supports Hadoop MapReduce and pro-
vides a simplified parallel file system. (Image courtesy of Omnibond Systems.)
If you have a problem, just call. If
you need a feature, we can add it, and
in addition, you get the peace of mind in
having the source code, too.
“
”
Sponsor Profile

B
ig data varies from one custom-
er to the next. As Ken Claffey,
senior vice president and gen-
eral manager, ClusterStor at Xyratex,
said, “What’s big data to one guy is tiny
to another.” No matter how you define
big data, achieving analytic results rap-
idly needs one thing: fast, reliable data
storage. And with more than 25 years
of experience as a leading provider of
data storage technology, this is exactly
where Xyratex leads the market.
Xyratex’ innovative technology solutions
help make data storage faster, more reli-
able and more cost-effective. As an
organization, Xyratex provides a range
of products from disk-drive heads and
disk media test equipment to enter-
prise storage platforms and scale-out
storageforhigh-performancecomput-
ing (HPC). For big data applications,
where users need storage that scales
both in capacity and performance,
Xyratex solutions support high-perfor-
mance data processing of relatively small
data quantities to some of the largest su-
percomputers in the world.
“Performance is often overlooked,”
Claffey said, “but it’s the key driver. Once
you have your data together, you need
performance for analytics to transform
that data into information you can use to
make more-informed decisions.”
Combining Features
Xyratex achieves its big data storage
success by providing solutions, not just
parts. The company has developed a
fresh, groundbreaking scale-out stor-
age architecture that meets user per-
formance and scalability needs while
also providing the industry’s highest
levels of integrated solution reliability,
availability and serviceability.
“You can go from a single pair of scal-
able storage nodes with less than 100
terabytes of capacity to solutions that sup-
port more than 30 petabytes of capacity
and deliver terabyte-per-second perfor-
mance,” Claffey said. “Then we embed
the analytics engine right into the storage
solution, helping you make better use of all
the data you are storing and maximize op-
erational efficiency within the solution.”
Adding Engineering
Traditional HPC users often tried to
build their own data storage infrastruc-
ture by stitching together different pieces
of off-the-shelf hardware. This approach
often encountered challenges.
Xyratex saw an opportunity to engineer
a high-performance, singly managed ap-
pliance-like data storage solution called
ClusterStor. ClusterStor is a tightly inte-
grated scale-out solution that is easy to
deploy, scale and manage.
What is now termed big data has been
used in practice for many years within
compute-intensive environments and is
evolving towards an enterprise class of
requirements. However, enterprise users
do not have the time to struggle with the
level of complexity these environments
bring with them. To support exponential
growth in these types of environments
requires data storage solutions that can
keep costs in check. As with general-pur-
pose storage acquisitions, enterprise us-
ers should consider the total cost of own-
ership, including the cost of upgrades,
time to productivity as well as operational
efficiency. ClusterStor allows enterprise-
class users to move into an engineered
high-performance data storage solution
Crafting Storage Solutions
Xyratex builds scalable capacity and performance
into big data systems
XYRATEX
A leading provider of data-storage
technology.
Founded: 1994
Employees: 1,900
Headquarters: Havant,
United Kingdom
To help big-data users,
Xyratex provides its ultra dense 5U84 enclosure.
(Image courtesy of Xyratex.)
Sponsor Profile

with enterprise reliability, serviceabil-
ity and availability without the hidden
costs associated with stitching together
components.
Breadth of Benefits
With a Xyratex ClusterStor storage so-
lution, a user gets an integrated product.
WhileXyratexleveragesopen-sourceand
standards-based hardware platforms, it
also adds its own blend of ease of de-
ployment and reliability. It all adds up to
the fastest and most re-
liable data storage on the market today.
In addition, all ClusterStor systems come
pre-configured from the factory, enabling
turnkey deployment, which significantly
reduces installation time.
Many sectors—such as digital manu-
facturing,government,commercialindus-
try and others—are finding that big-data
initiatives can provide new understanding
into the data they already have, but many
organizations in these sectors lack the in-
house expertise to create and manage
big-data solutions.
“Coming to users with a big-data so-
lution that they can easily deploy and
manage removes a barrier for them to
get access to the technology they need,”
said Claffey. “The complexity of the solu-
tion is completely hidden from the user,
and you don’t see all of the thousands of
man-hours that it took to create that ex-
perience. We help you reduce complex-
ity, get a great out-of-the-box experience
and maximize the uptime of the system.”
“Instead of building and tweaking to
get a system up and running, our turnkey
solutions reduce the time it takes to start
computing. We even have architects
who can come in and design how the
solution will fit into your environment,”
added Steve Paulhus, senior director,
strategic business development at Xyra-
tex. “We are known for high-quality stor-
age products—in the way we engineer,
test, validate and deliver preconfigured
ClusterStor solutions. Through all of this,
we offer the quickest path to the perfor-
mance and results users seek.”
As big data continues to quickly evolve,
users’storageneedswillaswell.Anddata
storage systems and vendors will have to
adapt to stay relevant in the market.
“We continue to leverage our knowl-
edge and experience-based under-
standing of end user data storage needs
to relentlessly push beyond traditional
boundaries,” said Paulhus. “That’s how
we will continue to advance in big data
and provide solutions ideally suited to
serving the needs of the market.” n
That’s how we
will continue to
advance in big
data and provide
solutions ideally
suited to serving
the needs of the
market.
“
”
The ClusterStor 6000 provides the ultimate
integrated HPC data-storage solution. (Image
courtesy of Xyratex.)
Sponsor Profile

Big dataimplementation hadoop_and_beyond

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Big dataimplementation hadoop_and_beyond

Similar a Big dataimplementation hadoop_and_beyond (20)

Más de Patrick Bouillaud

Más de Patrick Bouillaud (20)

Último

Último (20)

Big dataimplementation hadoop_and_beyond