More Related Content Similar to Ironfan: Your Foundation for Flexible Big Data Infrastructure (20) More from Infochimps, a CSC Big Data Business (9) Ironfan: Your Foundation for Flexible Big Data Infrastructure1. Ironfan
Your Foundation for Flexible Big Data Infrastructure
Benefits Infochimps brings the power of Big
With Ironfan, you can expect:
Data infrastructure to your fingertips.
Traditional systems configuration is a time-consuming process,
• Reduced cycle time.
vulnerable to human error. Infochimps leverages the power and
Provision servers in
minutes not days. simplicity of Ironfan as its provisioning and deployment layer, al-
lowing users to easily launch and orchestrate repeatable
• Improved visibility. infrastructure.
Increased transparency
means faster problem The Infochimps Platform reduces cycle time to provision a server
solving and sharing. from days or weeks to minutes, enabling simple scaling and rapid
system evolution, dramatically lowering the cost of starting new
• Lower support costs.
data analysis jobs. Infochimps even enables continual monitoring
Experience fewer reactive
support issues. of your system through automated machine provisioning. Spend
your time finding insights, not building infrastructure.
• Lower network costs.
Only use the nodes you
need for the job you are
running.
• Lower risk, more agility.
Deploy and manage a big
data stack with minimal
resources.
© 2012 Infochimps, Inc. All rights reserved. 1
2. Why Infochimps?
Specialized. Ironfan, Infochimps’ systems configuration tool, le-
verages three years of internal development and external
contributions to its code base. This specialized experience helps
organizations reduce the initial adoption cost and experimentation
necessary to produce well-tuned clusters.
Integrated. Infochimps’ tool development and Big Data expertise
means our team understands and is equipped with the tools to
successfully navigate and troubleshoot the entire Big Data eco-
system of an organization.
Flexible Cost. Infochimps’ Ironfan lets you take advantage of
IaaS (Infrastructure as a Service) providers such as Amazon Web
Services. This allows for all infrastructure costs to be treated as
operating expenses (use what you need) and not capital
expenditures (pay whether you need it or not). Switching from
CapEx to OpEx can dramatically lower the funding barrier to
adopting Big Data internally in an enterprise.
Context. Perhaps best of all, the Infochimps Platform, enabled by
Ironfan, can be used to provide context to an enterprise’s
internal data, whether through public opinion mining (via social
networks), geo-located information, word corpus training for
machine learning, and other commonly useful (but difficult to
accumulate) data. All of these capabilities combine to make
Infochimps a great choice for providing Big Data services to the
budget and process-conscious enterprise customer.
© 2012 Infochimps, Inc. All rights reserved. 2
3. Understanding the Tools
What is Chef? Chef is a configuration management system,
designed to be a general purpose tool for building repeatable
infrastructure. It uses a Ruby DSL (Domain Specific Language)
allowing you to write out specifications (as cookbooks, roles, etc.)
for infrastructure that is fully composable.
Chef can be used in a number of ways, allowing it to fit into a
variety of existing architectures. Its flexibility, however, means that
it cannot as easily build higher-level abstractions on top of the
architecture it provides.
What is Ironfan? Ironfan, the foundation of The Infochimps
Platform, is a systems provisioning and deployment tool. Ironfan
automates not only machine configuration, but entire systems
configuration to enable the entire Big Data stack, including tools
for data ingestion, scraping, storage, computation, and
monitoring.
Ironfan builds on Chef, but is opinionated about its
architecture, which allows broader integration between
components. It assumes a source repository, a central Chef
Server, and a modern POSIX-compliant operating system for a
base image. Currently, it works best with Git, Amazon Web
Services and Ubuntu 11.04, with exploration into other
virtualization platforms (Vagrant, etc.) and operating systems
(Centos, FreeBSD, etc.) ongoing, both inside and outside of
Infochimps.
© 2012 Infochimps, Inc. All rights reserved. 3
4. Benefits for the Entire Team
For Systems Administrators, Ironfan removes the guesswork
from building systems, because it reduces the cycle time to build
a server from days or weeks to minutes. Instead of
following long lists of manual processes, a system administrator
makes changes to their Ironfan homebase, and then ushers those
changes into the appropriate systems with the Chef knife and
client programs. This enables rapid iterative development, a
practice of Agile programming shops for years. Up until recently,
this kind of fast-paced development was unavailable to the
average systems administrator. Ironfan also enables repeatable
architecture, another powerful tool. Now, replacing malfunction-
ing components with completely new ones, built from scratch and
loaded with data from live exports or backups is a simple, reliable,
and rapid process, instead of a last-ditch solution. Finally, Ironfan
allows you to make infrastructure inevitable: you can write
definitions, which automatically attach new servers to your
existing architecture, instead of wiring into central services like
monitoring, log ingestion, or orchestration manually, without the
attendant risk of human error.
For Data Scientists or Business Intelligence Teams,
Ironfan can currently build a Hadoop cluster from scratch in less
than an hour with just a small handful of commands, and expand
it in minutes with a single command. Other large scale cluster
technologies (HBase, ElasticSearch, Redis, Flume, etc.) are just
as simple to build. This dramatically reduces the cost of start-
ing new data analysis jobs, allowing for greater experimentation.
Because the underlying architecture is rented by machine-hour,
jobs with predictable costs in machine-hours can be optimized for
rapid execution without large increases in cost. Should the
underlying processing time prove greater than anticipated,
clusters can be scaled up while in use, to improve the chances of
hitting deadlines.
© 2012 Infochimps, Inc. All rights reserved. 4
5. Benefits for the Entire Team
For Systems Architects or Core Infrastructure Team,
Ironfan allows you to build the repeatable architecture
recommended by ITIL (Information Technology Infrastructure
Library) for reliable IT infrastructure. It becomes simpler to scale
or evolve systems rapidly. Ironfan takes the grunt-work out of
distributing those changes, allowing architects to spend more of
their focus on design details, instead of implementation details.
Since everything is stored in source control, both architects and
administrators can make changes to the infrastructure, confident
that they are not obliterating important history. Also, the same
code can be used to create development, staging, and
production environments, the usual barriers to deployment
caused by differences in the underlying architectures and
deployment mechanisms are significantly reduced.
Because starting new instances with Ironfan is trivial, and paid for
by the hour, capacity can be managed as OpEx rather than
CapEx. This also means that problems with huge capacity spikes
can be considered; turning up a thousand nodes for three days,
then turning them off again, is no longer a laughable fantasy.
Migrations also become significantly easier, as new infrastructure
can be spun up in parallel with the old, without a long term
increase in expense.
© 2012 Infochimps, Inc. All rights reserved. 5
6. Case Study
How Infochimps Uses Ironfan to Create TrstRank
Since the launch of Twitter, people have clamored for ways to
access and “slice and dice” its data. One of the most common
ways people use the Twitter data corpus is to measure a person’s
importance and influence. Klout is an example of one product that
specializes in this kind of “influencer” data.
What is A few years ago, we created our own special version of Klout,
TrstRank? one that took advantage of our vast historical record of the
relationships to create an accurate number describing how
TrstRank is an Infochimps influential a Twitter user is. It’s called TrstRank and it ranks a user
developed dataset and API on a scale of 1-10, with 10 being the most influential you
that provides Twitter influence can get.
metrics. This API provides
Twitter influence metrics with Coming up with such a number like TrstRank is no small task.
the click of a button! TrstRank Setting aside the issues of getting the data, there are some very
measures Twitter user real Big Data problems surrounding the product that require
reputation, importance and special tools for getting it done efficiently. And when you’re a
influence in a far more bootstrapped startup, like we were at the time, you have to be
robust way than counting the resourceful if you are going to get by.
number of followers. It is a
sophisticated measure of a The biggest issue with pursuing a new data product like TrstRank
user’s relative importance is the same one any company faces when they decide to venture
within the entire Twitter into new territory - the high risks of wasting time and money.
network.
Wasting Time
One of the first problems you run into as a small team trying your
hand at data science is the excess time spent on server and ma-
chine configuration, instead of focusing on modeling, algorithms,
and manipulating the data.
Ramp-up time for even the first phase of a project like TrstRank
can be a whole day or more of engineering time.
© 2012 Infochimps, Inc. All rights reserved. 6
7. Case Study (continued)
How Infochimps Uses Ironfan to Create TrstRank
Wasting Money
From our earliest days Infochimps has been based on Amazon
Web Services’ (AWS) cloud, taking advantage of the flexibility
and scalability it provides. With AWS, you pay for what you use,
so you are always inclined to eliminate waste. In our early days
we even created decision trees for when to shut down a cluster or
not, depending on how many hours it was to be up but not used.
This can set conflicting goals for the data scientist who would
prefer to leave a cluster up overnight, even if it’s unused, so they
don’t have to deal with setting everything up again the next day!
Enter Ironfan
We created Ironfan to solve our own problems of how to save
time and money during our data science operations in the cloud.
When we came up with the idea for TrstRank, it was a simple
operation to spin up a cluster for early analysis and experimenta-
tion. We could validate some of our algorithms and ideas on a
simple cluster before moving to something more heavyweight.
Ironfan and TrstRank, Now
Ironfan has continued as a key tool for our monthly TrstRank
operation. We continue to scrape Twitter for follower information,
and with the updated data every month we crunch the TrstRank
numbers again.
With Ironfan, we’re able to run a multiple step operation on
8 billion tweets on clusters of 30 m1.xlarge EC2 machines,
while only running the resources we need when they’re needed.
TrstRank takes 72 hours to complete, with resources being paid
for commensurately. Without Ironfan, we’d be looking at 2-3x the
costs in time and money!
© 2012 Infochimps, Inc. All rights reserved. 7
8. About Infochimps
Our mission is to make the world’s data more accessible.
Infochimps helps companies understand their data. We provide
tools and services that connect their internal data, leverage the
power of cloud computing and new technologies such as Hadoop,
and provide a wealth of external datasets, which organizations
can connect to their own data.
Contact Us
Infochimps, Inc.
1214 W 6th St. Suite 202
Austin, TX 78703
1-855-DATA-FUN (1-855-328-2386)
www.infochimps.com
info@infochimps.com
Twitter: @infochimps
Get a free Big Data consultation
Let’s talk Big Data in the enterprise!
Get a free conference with the leading big data experts regarding your enterprise big data
project. Meet with leading data scientists Flip Kromer and/or Dhruv Bansal to talk shop
about your project objectives, design, infrastructure, tools, etc. Find out how other compa-
nies are solving similar problems. Learn best practices and get recommendations — free.
© 2012 Infochimps, Inc. All rights reserved. 8