"Analyzing Twitter Data with Hadoop - Live Demo", presented at Oracle Open World 2014. The repository for the slides is in https://github.com/cloudera/cdh-twitter-example
2. IOUG SIG Meetings at OpenWorld
All meetings located in Moscone South - Room 208
Monday, September 29
Exadata SIG: 2:00 p.m. - 3:00 p.m.
BIWA SIG: 5:00 p.m. – 6:00 p.m.
Tuesday, September 30
Internet of Things SIG: 11:00 a.m. - 12:00 p.m.
Storage SIG: 4:00 p.m. - 5:00 p.m.
SPARC/Solaris SIG: 5:00 p.m. - 6:00 p.m.
Wednesday, October 1
Oracle Enterprise Manager SIG: 8:00 a.m. - 9:00 a.m.
Big Data SIG: 10:30 a.m. - 11:30 a.m.
Oracle 12c SIG: 2:00 p.m. – 3:00 p.m.
Oracle Spatial and Graph SIG: 4:00 p.m. (*OTN lounge)
3. • Save more than $1,000 on education offerings like pre-conference workshops
• Access the brand-new, specialized IOUG Strategic Leadership Program
• Priority access to the hands-on labs with Oracle ACE support
• Advance access to supplemental session material and presentations
• Special IOUG activities with no "ante in" needed - evening networking opportunities
and more
COLLABORATE 15 – IOUG Forum
April 12-16, 2015
Mandalay Bay Resort and Casino
Las Vegas, NV
The IOUG Forum Advantage
www.collaborate.ioug.org
Follow us on Twitter at @IOUG or via the conference
hashtag #C15LV!
17. But Wait! There’s More!
• Many sources – directory, files, log4j, net, JMS
• Interceptors – process data in flight
• Selectors – choose which sink
• Many channels – Memory, file
• Many sinks – HDFS, Hbase, Solr
17
18. High Level Pipeline Architecture
Web App
18
Flume Avro
Client
Web App
Flume Avro
Client
Web App
Flume Avro
Client
Web App
Flume Avro
Client
Web App
Flume Avro
Client
Web App
Flume Avro
Client
Web App
Flume Avro
Client
Web App
Flume Avro
Client
Flume Agent
Flume Agent
Flume Agent
Flume Agent
SparkStreaming HBase
HDFS
Report App
Fan-in
Pattern
Multi Agents for
Failover and rolling restarts
SparkStreaming data is
sub set of whole events
ML Map/Reduce
Jobs
Batch Report Updates
Pull Near Real
Time Results
Query With Hbase
API Or Impala
Client providing, multi-threading,
compression,
encryption, and batching
23. Hive Details
• Metastore contains table definitions
• Stored in a relational database
• Basically a data dictionary
• SerDes parse data
• and converts to table/column structure
• SerDe:
• CSV, XML, JSON, Avro, Parquet, OCR files
• Or write your own (We created one for CopyBook)
23
29. I don’t like our data
• Lots of small files
• JSON – requires parsing
• Can’t compress
• Sensitive to changes
29
30. I’d rather use Avro
• Few large files containing records
• Schema in file
• Schema evolution
• Can compress
• Well supported in Hadoop
• Clients in other languages
30
31. Lets convert
• Create table AVRO_TWEETS
• Insert into Avro_tweets
select …. From tweets
31
37. Oracle Connectors for Hadoop
• Oracle Loader for
Hadoop
• Oracle SQL Connector
for Hadoop
• BigData SQL
38. Oracle Loader for Hadoop
• Load data from Hadoop into Oracle
• Map-Reduce job inside Hadoop
• Converts data types, partitions and sorts
• Direct path loads
• Reduces CPU utilization on database
• Supports Avro and compression
39. Oracle SQL Connector for Hadoop
• Run a Java app
• Creates an external table
• Runs MapReduce when external table is queries
• Can use Hive Metastore for schema
• Optimized for parallel queries
• Supports Avro and compression
40. Big Data SQL
• Also external table
• Can also use Hive metastore for schema
• But …. NO MapReduce
• Instead – an agent will do SMART SCANS
• Bloom filters
• Storage indexes
• Filters
• Supports any Hadoop data format
40