The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Brust hadoopecosystem
1. Hadoop and its Ecosystem
Components in Action
Andrew J. Brust
CEO and Founder
Blue Badge Insights
Level: Intermediate
2. Meet Andrew
• CEO and Founder, Blue Badge Insights
• Big Data blogger for ZDNet
• Microsoft Regional Director, MVP
• Co-chair VSLive! and 17 years as a speaker
• Founder, Microsoft BI User Group of NYC
– http://www.msbinyc.com
• Co-moderator, NYC .NET Developers Group
– http://www.nycdotnetdev.com
• “Redmond Review” columnist for
Visual Studio Magazine and Redmond Developer
News
• brustblog.com, Twitter: @andrewbrust
7. A MapReduce Example
• Count by suite, on each floor
• Send per-suite, per platform totals to lobby
• Sort totals by platform
• Send two platform packets to 10th, 20th, 30th floor
• Tally up
each
• Collect the tallies platform
• Merge tallies into one spreadsheet
8. What’s a Distributed File System?
• One where data gets distributed over
commodity drives on commodity servers
• Data is replicated
• If one box goes down, no data lost
– Except the name node = SPOF!
• BUT: HDFS is immutable
– Files can only be written to once
– So updates require drop + re-write (slow)
9. Hadoop = MapReduce + HDFS
• Modeled after Google MapReduce + GFS
• Have more data? Just add more nodes to
cluster.
– Mappers execute in parallel
– Hardware is commodity
– “Scaling out”
• Use of HDFS means data may well be local
to mapper processing
• So, not just parallel, but minimal data
movement, which avoids network
bottlenecks
10. The Hadoop Stack
• Hadoop
– MapReduce, HDFS
• Hbase
– NoSQL Database
– Lesser extent: Cassandra, HyperTable
• Hive, Pig
– SQL-like “data warehouse” system
– Data transformation language
• Sqoop
– Import/export between HDFS, HBase,
Hive and relational data warehouses
• Flume
– Log file integration
• Mahout
– Data Mining
11. Ways to work
• Microsoft Hadoop on Azure
– Visit www.hadooponazure.com
– Request invite
– Free, for now
• Amazon Web Services Elastic MapReduce
– Create AWS account
– Select Elastic MapReduce in Dashboard
– Cheap for experimenting, but not free
• Cloudera CDH VM image
– Download as .tar.gz file
– “Un-tar” (can use WinRAR, 7zip)
– Run via VMWare Player or Virtual Box
– Everything’s free
12. Microsoft Hadoop on Azure
• Much simpler than the others
• Browser-based portal
– Provisioning cluster, managing ports, MapReduce jobs
– Gathering external data
Configure Azure Storage, Amazon S3
Import from DataMarket to Hive
• Interactive JavaScript console
– HDFS, Pig, light data visualization
• Interactive Hive console
– Hive commands and metadata discovery
• From Portal page you can RDP directly to
Hadoop head node
– Double click desktop shortcut for CLI access
– Certain environment variables may need to be set
14. Amazon Elastic MapReduce
• Lots of steps!
• At a high level:
– Setup AWS account and S3 “buckets”
– Generate Key Pair and PEM file
– Install Ruby and EMR Command Line Interface
– Provision the cluster using CLI
A batch file can work very well here
– Setup and run SSH/PuTTY
– Work interactively at command line
15. Amazon EMR – Prep Steps
• Create an AWS account
• Create an S3 bucket for log storage
– with list permissions for authenticated users
• Create a Key Pair and save PEM file
• Install Ruby
• Install Amazon Web Services Elastic
MapReduce Command Line Interface
– aka AWS EMR CLI
• Create credentials.json in EMR CLI folder
– Associate with same region as where key pair created
16. Amazon – Security and Startup
• Security
– Download PuTTYgen and run it
– Click Load and browse to PEM file
– Save it in PPK format
– Exit PuTTYgen
• In a command window, navigate to EMR CLI
folder and enter command:
– ruby elastic-mapreduce --create --alive [--num-instance xx]
[--pig-interactive] [--hive-interactive] [--hbase --instance-type
m1.large]
• In AWS Console, go to EC2 Dashboard and
click Instances on left nav bar
• Wait until instance is running and get its
Public DNS name
– Use Compatibility View in IE or copy may not work
17. Connect!
• Download and run PuTTY
• Paste DNS name of EC2 instance into hostname
field
• In Treeview, drill down and navigate to
ConnectionSSHAuth, browse to PPK file
• Once EC2 instance(s) running, click Open
• Click Yes to “The server’s host key is not cached
in the registry…” PuTTY Security Alert
• When prompted for user name, type “hadoop” and
hit Enter
• cd bin, then hive, pig, hbase shell
• Right-click to paste from clipboard; option to go
full-screen
• (Kill EC2 instance(s) from Dashboard when done)
19. Cloudera CDH4 Virtual Machine
• Get it for free, in VMWare and Virtual Box
versions.
– VMWare player and Virtual Box are free too
• Run it, and configure it to have its own IP on
your network.
• Assuming IP of 192.168.1.59, open browser on
your own (host ) machine and navigate to:
– http://192.168.1.59:8888
• Can also use browser in VM and hit:
– http://localhost:8888
• Work in “Hue”…
20. Hue
• Browser based UI,
with front ends
for:
– HDFS (w/ upload &
download)
– MapReduce job
creation and
monitoring
– Hive (“Beeswax”)
• And in-browser
command line
shells for:
– HBase
– Pig (“Grunt”)
27. Hive
• Used by most BI products which connect
to Hadoop
• Provides a SQL-like abstraction over
Hadoop
– Officially HiveQL, or HQL
• Works on own tables, but also on HBase
• Query generates MapReduce job, output
of which becomes result set
• Microsoft has Hive ODBC driver
– Connects Excel, Reporting Services, PowerPivot,
Analysis Services Tabular Mode (only)
28. Hive, Continued
• Load data from flat HDFS files
– LOAD DATA [LOCAL] INPATH 'myfile'
INTO TABLE mytable;
• SQL Queries
– CREATE, ALTER, DROP
– INSERT OVERWRITE (creates whole tables)
– SELECT, JOIN, WHERE, GROUP BY
– SORT BY, but ordering data is tricky!
– MAP/REDUCE/TRANSFORM…USING allows for custom
map, reduce steps utilizing Java or streaming code
30. Pig
• Instead of SQL, employs a language (“Pig
Latin”) that accommodates data flow
expressions
– Do a combo of Query and ETL
• “10 lines of Pig Latin ≈ 200 lines of Java.”
• Works with structured or unstructured data
• Operations
– As with Hive, a MapReduce job is generated
– Unlike Hive, output is only flat file to HDFS or text at
command line console
– With MS Hadoop, can easily convert to JavaScript array,
then manipulate
• Use command line (“Grunt”) or build scripts
31. Example
• A = LOAD 'myfile'
AS (x, y, z);
B = FILTER A by x > 0;
C = GROUP B BY x;
D = FOREACH A GENERATE
x, COUNT(B);
STORE D INTO 'output';
32. Pig Latin Examples
• Imperative, file system commands
– LOAD, STORE
Schema specified on LOAD
• Declarative, query commands (SQL-like)
– xxx = file or data set
– FOREACH xxx GENERATE (SELECT…FROM xxx)
– JOIN (WHERE/INNER JOIN)
– FILTER xxx BY (WHERE)
– ORDER xxx BY (ORDER BY)
– GROUP xxx BY / GENERATE COUNT(xxx)
(SELECT COUNT(*) GROUP BY)
– DISTINCT (SELECT DISTINCT)
• Syntax is assignment statement-based:
– MyCusts = FILTER Custs BY SalesPerson eq 15;
• Access Hbase
– CpuMetrics = LOAD 'hbase://SystemMetrics' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cp
u:','-loadKey -returnTuple');
36. Flume NG
• Source
– Avro (data serialization system – can read json-
encoded data files, and can work over RPC)
– Exec (reads from stdout of long-running process)
• Sinks
– HDFS, HBase, Avro
• Channels
– Memory, JDBC, file
37. Flume NG (next generation)
• Setup conf/flume.conf
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind = 0.0.0.0
agent1.sources.avro-source1.port = 41414
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type = logger
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1
• From the command line:
flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
38. Mahout Algorithms
• Recommendation
– Your info + community info
– Give users/items/ratings; get user-user/item-item
– itemsimilarity
• Classification/Categorization
– Drop into buckets
– Naïve Bayes, Complementary Naïve Bayes, Decision
Forests
• Clustering
– Like classification, but with categories unknown
– K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-
Shift
41. The Truth About Mahout
• Mahout is really just an algorithm engine
• Its output is almost unusable by non-
statisticians/non-data scientists
• You need a staff or a product to visualize, or
make into a usable prediction model
• Investigate Predixion Software
– CTO, Jamie MacLennan, used to lead SQL Server Data
Mining team
– Excel add-in can use Mahout remotely, visualize its output,
run predictive analyses
– Also integrates with SQL Server, Greenplum, MapReduce
– http://www.predixionsoftware.com
42. Resources
• Big On Data blog
– http://www.zdnet.com/blog/big-data
• Apache Hadoop home page
– http://hadoop.apache.org/
• Hive & Pig home pages
– http://hive.apache.org/
– http://pig.apache.org/
• Hadoop on Azure home page
– https://www.hadooponazure.com/
• Cloudera CDH 4 download
– https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads
• SQL Server 2012 Big Data
– http://bit.ly/sql2012bigdata
43. Thank you
• andrew.brust@bluebadgeinsights.com
• @andrewbrust on twitter
• Get Blue Badge’s free briefings
– Text “bluebadge” to 22828