Más contenido relacionado
Similar a Hadoop for carrier (20)
Hadoop for carrier
- 3. 600- 800 GB of CDR per day
◦ GPRS Signaling 50GB/day
◦ 3G Signaling 300GB/day
◦ Voice 100GB/day
◦ SMS 200GB/day
100 - 200 GB/day of Web Data
Mammoth Data
Data Analysis
Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 3
- 6. Framework for distributed processing of large data sets
across clusters
Consists of
◦ Hadoop Distributed File System aka HDFS (File system)
◦ Hadoop MapReduce (programming model )
Characteristics
◦ Performance shall scale linearly
◦ Compute should move to data
◦ Simple core, Modular and Extensible
Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 6
- 7. Current Bottleneck
◦ Data resides in multiple nodes/zones/VM instance & no elegant,
reliable and efficient way of extracting data
◦ Loading terabytes of data into database is slow
◦ Parallel computing not a possibility in Conventional BI ETL
◦ User profile and application data resides in DB which can scale
only vertically
Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 7
- 8. Structured Data
sqoop --connect jdbc:mysql://db.example.com/website --table USERS --as-
sequencefile
Un Structured Data
Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 8
- 9. A Distributed data Collection server
◦ Scalable
◦ Configurable
◦ Extensible
◦ Manageable
Built around the concept of flows
◦ A single flow corresponds to a type of data source
◦ Supports compression, batching & reliability setups per flow
Data come in through a source
◦ Optionally processed by one or more decorators
◦ And transmitted out via sink
Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 9
- 12. Map Reduce is very powerful, but:
◦ It requires a Java programmer
◦ User has to re-invent common
◦ functionality (join, filter, etc.)
Execution engine atop Hadoop
Pig provides a higher level language Pig Latin
Opens the system to non-Java programmers
Provides common operations like join, group, filter, sort
Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 12
- 13. Web log processing.
Data processing for web search platforms.
Ad hoc queries across large data sets.
Rapid prototyping of algorithms for processing large data
sets.
Pig runs on local machine and job gets executed in hadoop
cluster
$ cd /usr/share/cloudera/pig/
$ bin/pig –x local
grunt>
Log = LOAD ‘excite-small.log’ AS (user, timestamp, query);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE group, COUNT(log);
STORE cntd INTO ‘output’;
Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 13
- 14. System for querying and managing structured data
Built on top of hadoop
Uses map reduce for execution
SQL like syntax; supports
◦ From clause subquery
◦ ANSO Join (equi join )
◦ Multi-table insert
◦ Multi group-by
◦ Sampling
◦ Object traversal
Engagement
◦ Summarization
◦ Ad hoc analysis
◦ Spam detection
Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 14
- 16. Feature Hive Pig
Language SQL-like PigLatin
Schemas/Types Yes (explicit) Yes (implicit)
Partitions Yes No
Server Optional(thirft) No
User Defined Functions Yes Yes
Custom Serializer/Deserializer Yes Yes
DFS Direct Access Yes (implicit) Yes (explicit)
Join/Order/Sort Yes Yes
Shell Yes Yes
Streaming Yes No
Web Interface Yes No
JDBC/ODBC Yes (limited) No
Copyright © 2011 Flytxt B.V. All rights reserved. 1/17/2012 16