SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Efficient	
  processing	
  of	
  large	
  and	
  
complex	
  XML	
  documents	
  in	
  Hadoop	
  
	
  
Sujoe	
  Bose	
  
Senior	
  Principal,	
  
Sabre	
  Holdings	
  
June,	
  2013	
  
Presenta.on	
  Outline	
  
§  MoBvaBon	
  
§  ETL	
  vs.	
  ELT	
  
§  Avro	
  Format	
  
§  Mapping	
  from	
  XML	
  to	
  Avro	
  
§  Interfaces	
  to	
  access	
  Avro	
  
§  Performance	
  and	
  Storage	
  consideraBons	
  
§  Other	
  types	
  of	
  storage/processing	
  formats	
  
confidenBal	
   2	
  
You	
  will	
  learn	
  about	
  …	
  
§  A	
  method	
  to	
  store	
  and	
  process	
  complex	
  XML	
  data	
  in	
  
Hadoop	
  as	
  Avro	
  files	
  
§  Interfaces	
  to	
  access	
  and	
  analyze	
  data	
  in	
  Avro	
  from	
  
Hive,	
  Java	
  and	
  Pig	
  
§  VariaBons	
  of	
  the	
  method	
  and	
  their	
  relaBve	
  trade-­‐offs	
  
in	
  storage	
  and	
  processing	
  
confidenBal	
   3	
  
Mo.va.on	
  
§  Prevalence	
  of	
  XML	
  and	
  its	
  derivaBves	
  
–  Spurred	
  by	
  WebServices	
  and	
  SOA	
  
–  Preferred	
  communicaBon	
  format	
  unBl	
  newer	
  formats	
  
entered	
  
–  Data	
  and	
  logs	
  represented	
  in	
  XML	
  
§  XML	
  –	
  metadata	
  combined	
  data	
  	
  
–  Flexibility	
  vs.	
  Complexity	
  
§  Could	
  be	
  arbitrarily	
  nested	
  and	
  large	
  
§  Volumes	
  of	
  documents	
  –	
  Big	
  Data	
  
confidenBal	
   4	
  
Challenges	
  
§  Parsing	
  XML	
  is	
  CPU	
  Intensive	
  
§  Certain	
  parsers/parsing	
  methods	
  result	
  in	
  more	
  
memory	
  consumpBon	
  
§  Repeated	
  parsing	
  for	
  each	
  query	
  
§  Large	
  and	
  deeply	
  nested	
  XMLs	
  makes	
  problem	
  worse	
  
§  Presence	
  of	
  tags	
  in	
  data	
  result	
  in	
  high	
  I/O	
  due	
  to	
  
storage	
  size	
  
§  Special	
  handling	
  of	
  opBonal	
  fields	
  
confidenBal	
   5	
  
ETL	
  vs.	
  ELT	
  
confidenBal	
   6	
  
§  Hadoop	
  generally	
  built	
  for	
  EL	
  –	
  T	
  
–  aka	
  Schema-­‐on-­‐Read	
  
–  Load	
  as-­‐is	
  
–  Transform	
  on	
  Access/Query	
  
§  Compare	
  with	
  Data	
  Warehouse	
  ETL	
  
–  Aka	
  Schema-­‐on-­‐Write	
  
–  Transform	
  and	
  Load	
  
–  Queries	
  are	
  lot	
  simpler	
  
–  TransformaBon	
  and	
  cleansing	
  done	
  a	
  priori	
  
Mix	
  of	
  ETL	
  and	
  ELT	
  
§  Generally	
  beaer	
  in	
  
Flexibility	
  
§  More	
  suitable	
  for	
  simpler	
  
and	
  well-­‐defined	
  formats	
  
§  More	
  applicable	
  for	
  
experimentaBon	
  
§  XML	
  data	
  parsed	
  on	
  
demand	
  for	
  every	
  query	
  
confidenBal	
   7	
  
§  Generally	
  beaer	
  in	
  
Performance	
  
§  More	
  suitable	
  when	
  
substanBal	
  cleansing	
  and	
  
reformacng	
  is	
  needed	
  
§  RepeBBve	
  queries	
  and	
  
producBon	
  workloads	
  
§  XML	
  Data	
  pre-­‐parsed	
  to	
  
minimize	
  resource	
  usage	
  
ELT	
   ETL	
  
Approaches	
  
confidenBal	
   8	
  
XML	
  Files	
  
Avro	
  Files	
  
ETL	
  
Pre-­‐parsing	
  
Pig	
  
UDF	
  
Avro	
  
Schema	
  
On-­‐demand	
  
Parsing	
  
Interfaces	
  Processing	
  Data	
  
Hive	
  
SerDe	
  
MapReduce	
   Pig	
  
UDF	
  
Hive	
  
SerDe	
  
MapReduce	
  
ELT	
  
confidenBal	
   9	
  confidenBal	
   9	
  
XML	
  Files	
  
Avro	
  Files	
  
ETL	
  
Pre-­‐parsing	
  
Pig	
  
UDF	
  
Avro	
  
Schema	
  
On-­‐demand	
  
Parsing	
  
Interfaces	
  Processing	
  Data	
  
Hive	
  
SerDe	
  
MapReduce	
   Pig	
  
UDF	
  
Hive	
  
SerDe	
  
MapReduce	
  
ETL	
  
confidenBal	
   10	
  confidenBal	
   10	
  
XML	
  Files	
  
Avro	
  Files	
  
ETL	
  
Pre-­‐parsing	
  
Pig	
  
UDF	
  
Avro	
  
Schema	
  
On-­‐demand	
  
Parsing	
  
Interfaces	
  Processing	
  Data	
  
Hive	
  
SerDe	
  
MapReduce	
   Pig	
  
UDF	
  
Hive	
  
SerDe	
  
MapReduce	
  
XML	
  Pre-­‐parsing	
  
§  Nested	
  Elements	
  and	
  Aaributes	
  
§  RepresentaBon	
  of	
  parsed	
  XML	
  Structure	
  
§  Enter	
  Avro!	
  
confidenBal	
   11	
  
Avro	
  
§  Data	
  serializaBon	
  system	
  
§  Specifically	
  designed	
  for	
  Hadoop,	
  but	
  used	
  in	
  other	
  
environments	
  also	
  
§  Rich	
  data	
  structures:	
  Arrays,	
  Records,	
  Maps	
  etc.	
  
§  Compact,	
  fast,	
  binary	
  data	
  format	
  
§  Metadata	
  stored	
  at	
  file	
  level	
  –	
  not	
  record	
  level	
  
§  	
  Split-­‐able	
  –	
  Ideal	
  for	
  Map-­‐Reduce	
  
confidenBal	
   12	
  
Avro	
  APIs	
  
§  Generic	
  Objects	
  and	
  Pre-­‐generated	
  Objects	
  
–  Easy	
  API	
  including	
  simple	
  gets	
  and	
  puts	
  
§  APIs	
  in	
  several	
  languages	
  
–  Java	
  
–  C#	
  
–  C/C++	
  
–  Python	
  
–  Ruby	
  
confidenBal	
   13	
  
Use-­‐case	
  
§  FIXML	
  –	
  Financial	
  InformaBon	
  eXchange	
  
–  hap://www.fixprotocol.org/specificaBons/	
  
§  XML	
  Database	
  Benchmark	
  
–  hap://tpox.sourceforge.net/	
  
§  Provides	
  sample	
  data	
  for	
  benchmarking	
  
§  Data	
  Generator	
  for	
  generaBng	
  large	
  and	
  predictable	
  
datasets	
  
confidenBal	
   14	
  
FIXML	
  
§  XML	
  Data	
  Generator	
  
–  hap://tpox.sourceforge.net/tpoxdata.htm	
  
§  Order:	
  Buy	
  and	
  sell	
  order	
  of	
  securiBes	
  
confidenBal	
   15	
  
Simple	
  mapping	
  
confidenBal	
   16	
  
XML	
   Avro	
   Pig	
  
Elements	
  with	
  repeated	
  
nested	
  elements	
  
Array	
   Bag	
  
Elements	
  with	
  aaributes	
  and	
  
text	
  elements	
  
Record	
   Tuple	
  
Aaributes	
  and	
  Text	
  Elements	
   Field	
   Field	
  
Avro	
  Schema	
  
{
"type": "record",
"name": "FIXOrder",
"namespace": "com.sabre.fixml",
"doc": "Definition and mapping for FIX Orders",
"mapping": "/FIXML",
"fields":
[
{ "name":"v", "type":"string", "mapping":"@v"},
{ "name":"r", "type":"string", "mapping":"@r"},
{ "name":"s", "type":"string", "mapping":"@s"},
{ "name":"Order", "mapping":"Order", "type":
{
"name":"OrderRecord", "mapping":"Order", "type": "record", "fields":
[
{ "name":"ID", "type":"string", "mapping":"@ID"},
{ "name":"ID2", "type":"string", "mapping":"@ID2"},
{ "name":"OrignDt", "type":"string", "mapping":"@OrignDt"},
{ "name":"TrdDt", "type":"string", "mapping":"@TrdDt"},
{ "name":"Acct", "type":"string", "mapping":"@Acct"},
{ "name":"AcctTyp", "type":"string", "mapping":"@AcctTyp"},
{ "name":"DayBkngInst", "type":"string", "mapping":"@DayBkngInst"},
{ "name":"BkngUnit", "type":"string", "mapping":"@BkngUnit"},
{ "name":"PreallocMeth", "type":"string", "mapping":"@PreallocMeth"},
{ "name":"AllocID", "type":"string", "mapping":"@AllocID"},
{ "name":"CshMgn", "type":"string", "mapping":"@CshMgn"},
{ "name":"ClrFeeInd", "type":"string", "mapping":"@ClrFeeInd"},
...
	
  
confidenBal	
   17	
  
Pig	
  Schema	
  
FIXOrder: tuple (
v: chararray,
r: chararray,
s: chararray,
Order: tuple (
ID: chararray,
ID2: chararray,
OrignDt: chararray,
TrdDt: chararray,
Acct: chararray,
AcctTyp: chararray,
DayBkngInst: chararray,
BkngUnit: chararray,
PreallocMeth: chararray,
AllocID: chararray,
CshMgn: chararray,
ClrFeeInd: chararray,
confidenBal	
   18	
  
Avro	
  –	
  Access	
  Methods	
  
§  Direct	
  support	
  for	
  access	
  from	
  Hive	
  (using	
  SerDe)	
  
	
  
CREATE EXTERNAL TABLE <TableName>!
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’!
STORED as INPUTFORMAT
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’!
OUTPUTFORMAT!
‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’!
LOCATION ‘location-of-avro-files’!
TBLPROPERTIES ('avro.schema.url'=‘location-of-schema-
file.avsc')	
  
§  Access	
  via	
  Pig	
  -­‐	
  AvroStorage	
  
§  Avro	
  API	
  -­‐	
  Java	
  MapReduce	
  
confidenBal	
   19	
  
Test	
  Data	
  
§  Base	
  SecuriBes	
  Order	
  file	
  500,000	
  records	
  
§  Replicated	
  for	
  volume	
  
–  15x	
  -­‐	
  7.5	
  million	
  records	
  
–  30x	
  -­‐	
  15	
  million	
  records	
  
–  45x	
  -­‐	
  22.5	
  million	
  records	
  
–  60x	
  –	
  30	
  million	
  records	
  
–  75x	
  –	
  37.5	
  million	
  records	
  
	
  
confidenBal	
   20	
  
Comparison	
  
confidenBal	
   21	
  
XML	
  Files	
  
Avro	
  Files	
  
ETL	
  
Pre-­‐parsing	
  
Pig	
  
UDF	
  
Avro	
  
Schema	
  
On-­‐demand	
  
Parsing	
  
Interfaces	
  Processing	
  Data	
  
Hive	
  
SerDe	
  
MapReduce	
   Pig	
  
UDF	
  
Hive	
  
SerDe	
  
MapReduce	
  
File	
  sizes:	
  Orders	
  
§  Base	
  Data	
  
–  XML	
  file	
  size	
  as	
  is:	
  749,337,916	
  (750MB)	
  	
  
–  Gzip	
  Compressed:	
  182,687,654	
  (183MB)	
  	
  
§  Applied	
  Avro	
  conversion	
  
–  Avro	
  Snappy:	
  151,647,926	
  (152MB)	
  	
  
–  Avro	
  Gzip:	
  107,898,177	
  (108MB)	
  	
  
confidenBal	
   22	
  
Storage	
  Size	
  Comparison	
  
confidenBal	
   23	
  
Test	
  Environment	
  
§  18	
  Nodes	
  
§  Node	
  configuraBon:	
  
–  12	
  cores	
  per	
  node	
  
–  48GB	
  memory	
  
–  	
  36	
  TB	
  with	
  12	
  disks	
  of	
  3TB	
  each	
  
§  CDH	
  4.1.2	
  
confidenBal	
   24	
  
Sample	
  Query	
  
§  Security	
  Orders	
  per	
  Account	
  
order_records	
  =	
  LOAD	
  '$AVRO_INPUT'	
  using	
  AVRO_LOAD	
  AS	
  (	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  Pig	
  Schema	
  goes	
  here	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
);	
  
	
  
order_projecBon	
  =	
  FOREACH	
  order_records	
  GENERATE	
  Order.Acct	
  as	
  Account,	
  Order.OrdQty.Qty	
  
as	
  QuanBty;	
  
	
  
order_group	
  =	
  GROUP	
  order_projecBon	
  BY	
  Account;	
  
	
  
order_count	
  =	
  FOREACH	
  order_group	
  GENERATE	
  group,	
  SUM(order_projecBon.QuanBty);	
  
	
  
STORE	
  order_count	
  INTO	
  '$PIG_OUTPUT'	
  Using	
  PigStorage(',');	
  
confidenBal	
   25	
  
Run	
  Types	
  
§  Pre-­‐parsed	
  approach:	
  
–  XML	
  to	
  Avro	
  materializaBon:	
  xml-­‐to-­‐avro	
  
•  XML	
  to	
  Avro	
  is	
  run	
  only	
  once	
  on	
  the	
  data	
  
–  Avro	
  to	
  Pig	
  via	
  UDF:	
  avro-­‐to-­‐pig	
  
§  Parse	
  on	
  demand	
  
–  XML	
  parsing	
  using	
  Pig	
  UDF:	
  xml-­‐to-­‐pig	
  
confidenBal	
   26	
  
confidenBal	
   27	
  
Run	
  .me	
  in	
  Seconds	
  
Analysis	
  on	
  raw	
  XML:	
  
XML	
  to	
  Pig	
  
Pre-­‐parsing	
  XML:	
  
XML	
  to	
  Avro	
  
Analysis	
  on	
  parsed	
  XML:	
  
Avro	
  to	
  Pig	
  
confidenBal	
   28	
  
CPU	
  Usage	
  Comparison	
  
Analysis	
  on	
  raw	
  XML:	
  
XML	
  to	
  Pig	
  
Pre-­‐parsing	
  XML:	
  
XML	
  to	
  Avro	
  
Analysis	
  on	
  parsed	
  XML:	
  
Avro	
  to	
  Pig	
  
confidenBal	
   29	
  confidenBal	
   29	
  
Memory	
  Usage	
  Comparison:	
  Total	
  Memused	
  (GB)	
  
Analysis	
  on	
  raw	
  XML:	
  
XML	
  to	
  Pig	
  
Pre-­‐parsing	
  XML:	
  
XML	
  to	
  Avro	
  
Analysis	
  on	
  parsed	
  XML:	
  
Avro	
  to	
  Pig	
  
Results	
  
§  Analysis	
  on	
  pre-­‐parsed	
  data	
  compared	
  raw	
  XML	
  
–  RunBme	
  reducBon	
  by	
  more	
  than	
  50%	
  
–  Memory	
  and	
  CPU	
  consumpBon	
  reduced	
  by	
  about	
  50%	
  
§  Pre-­‐parsing	
  stage	
  takes	
  more	
  resources	
  and	
  Bme	
  
than	
  on-­‐demand	
  parsing	
  
§  RepeBBve	
  queries	
  will	
  benefit	
  from	
  one-­‐Bme	
  pre-­‐
parsing	
  
confidenBal	
   30	
  
Caveats	
  
§  Not	
  all	
  fields	
  were	
  extracted	
  from	
  the	
  XML	
  input	
  
(opBonal	
  elements)	
  
§  Challenge	
  in	
  keeping-­‐up	
  with	
  versions/changes	
  of	
  
XML	
  
§  Performance	
  numbers	
  can	
  depend	
  on	
  the	
  type	
  of	
  
data	
  and	
  the	
  mapping	
  used	
  
confidenBal	
   31	
  
Alterna.ves	
  
§  Formats	
  other	
  than	
  Avro	
  may	
  be	
  more	
  suitable	
  
§  Record	
  Columnar	
  formats	
  (RC	
  Files	
  &	
  ORC	
  Files)	
  
§  Trevni:	
  a	
  column	
  file	
  format	
  supporBng	
  Avro	
  
§  Parquet:	
  another	
  columnar	
  storage	
  for	
  Hadoop	
  
confidenBal	
   32	
  
Mo.va.on	
  for	
  Columnar	
  Format	
  
§  Map	
  Reduce	
  capability	
  
§  Column	
  ProjecBons	
  reduce	
  I/O	
  
§  Column	
  Compression	
  due	
  to	
  similarity	
  of	
  data	
  
further	
  reduces	
  I/O	
  
confidenBal	
   33	
  
Summary	
  
§  Materialized	
  version	
  well-­‐suited	
  for	
  repeated	
  queries	
  
§  For	
  ad-­‐hoc/experimental	
  queries	
  parse-­‐on-­‐demand	
  
is	
  beaer	
  
§  Mapping	
  from	
  XML	
  to	
  Avro	
  can	
  be	
  automated	
  
§  Hive,	
  Pig	
  and	
  MapReduce	
  Interfaces	
  to	
  access	
  Avro	
  
Files	
  
§  RelaBve	
  trade-­‐offs	
  between	
  flexibility	
  and	
  
performance/storage	
  
confidenBal	
   34	
  
Ques.ons	
  &	
  Comments	
  
confidenBal	
   35	
  
Thanks	
  for	
  Listening	
  
	
  sujoe.bose@sabre.com	
  
	
  

Más contenido relacionado

La actualidad más candente

Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Redis persistence in practice
Redis persistence in practiceRedis persistence in practice
Redis persistence in practiceEugene Fidelin
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 
Sql vs NoSQL-Presentation
 Sql vs NoSQL-Presentation Sql vs NoSQL-Presentation
Sql vs NoSQL-PresentationShubham Tomar
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
 
Entity Relationship Model
Entity Relationship ModelEntity Relationship Model
Entity Relationship ModelNeil Neelesh
 
Pegasus In Depth (2018/10)
Pegasus In Depth (2018/10)Pegasus In Depth (2018/10)
Pegasus In Depth (2018/10)涛 吴
 
Advanced MySQL Query Tuning
Advanced MySQL Query TuningAdvanced MySQL Query Tuning
Advanced MySQL Query TuningAlexander Rubin
 
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...Beat Signer
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance TuningPuneet Behl
 

La actualidad más candente (20)

Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Erd examples
Erd examplesErd examples
Erd examples
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Dbms
DbmsDbms
Dbms
 
Redis persistence in practice
Redis persistence in practiceRedis persistence in practice
Redis persistence in practice
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Sql vs NoSQL-Presentation
 Sql vs NoSQL-Presentation Sql vs NoSQL-Presentation
Sql vs NoSQL-Presentation
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
D word kurs_other_2186
D word kurs_other_2186D word kurs_other_2186
D word kurs_other_2186
 
Entity Relationship Model
Entity Relationship ModelEntity Relationship Model
Entity Relationship Model
 
Pegasus In Depth (2018/10)
Pegasus In Depth (2018/10)Pegasus In Depth (2018/10)
Pegasus In Depth (2018/10)
 
Advanced MySQL Query Tuning
Advanced MySQL Query TuningAdvanced MySQL Query Tuning
Advanced MySQL Query Tuning
 
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
Structured Query Language (SQL) - Lecture 5 - Introduction to Databases (1007...
 
MongoDB Performance Tuning
MongoDB Performance TuningMongoDB Performance Tuning
MongoDB Performance Tuning
 

Destacado

XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map ReduceEdureka!
 
Pig - Processing XML data
Pig - Processing XML dataPig - Processing XML data
Pig - Processing XML dataRam Kedem
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoopskaluska
 
A poster version of HadoopXML
A poster version of HadoopXMLA poster version of HadoopXML
A poster version of HadoopXMLKyong-Ha Lee
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchCloudera, Inc.
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestionVinod Nayal
 
How to Profit from Factoring 2015
How to Profit from Factoring 2015How to Profit from Factoring 2015
How to Profit from Factoring 2015Michael Ponomarew
 
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paul
Fish Sticks by Stephen C Lundin, John Christensen and Harry PaulFish Sticks by Stephen C Lundin, John Christensen and Harry Paul
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paulvandananicky
 
Rate zonal centrifugation and Its applications
Rate zonal centrifugation and Its applicationsRate zonal centrifugation and Its applications
Rate zonal centrifugation and Its applicationsPaul singh
 
What is system level analysis
What is system level analysisWhat is system level analysis
What is system level analysisCAST
 
Top 10 team coordinator interview questions and answers
Top 10 team coordinator interview questions and answersTop 10 team coordinator interview questions and answers
Top 10 team coordinator interview questions and answersjanritari
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesDataWorks Summit
 
Financial aspects of marketing management
Financial aspects of marketing managementFinancial aspects of marketing management
Financial aspects of marketing managementBabasab Patil
 
Moving From a Selenium Grid to the Cloud - A Real Life Story
Moving From a Selenium Grid to the Cloud - A Real Life StoryMoving From a Selenium Grid to the Cloud - A Real Life Story
Moving From a Selenium Grid to the Cloud - A Real Life StorySauce Labs
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
 
IT Strategic Planning (Case Studies)
IT Strategic Planning (Case Studies)IT Strategic Planning (Case Studies)
IT Strategic Planning (Case Studies)Nurhazman Abdul Aziz
 

Destacado (20)

XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
Pig - Processing XML data
Pig - Processing XML dataPig - Processing XML data
Pig - Processing XML data
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
 
A poster version of HadoopXML
A poster version of HadoopXMLA poster version of HadoopXML
A poster version of HadoopXML
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestion
 
How to Profit from Factoring 2015
How to Profit from Factoring 2015How to Profit from Factoring 2015
How to Profit from Factoring 2015
 
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paul
Fish Sticks by Stephen C Lundin, John Christensen and Harry PaulFish Sticks by Stephen C Lundin, John Christensen and Harry Paul
Fish Sticks by Stephen C Lundin, John Christensen and Harry Paul
 
Rate zonal centrifugation and Its applications
Rate zonal centrifugation and Its applicationsRate zonal centrifugation and Its applications
Rate zonal centrifugation and Its applications
 
What is system level analysis
What is system level analysisWhat is system level analysis
What is system level analysis
 
Top 10 team coordinator interview questions and answers
Top 10 team coordinator interview questions and answersTop 10 team coordinator interview questions and answers
Top 10 team coordinator interview questions and answers
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual Machines
 
Financial aspects of marketing management
Financial aspects of marketing managementFinancial aspects of marketing management
Financial aspects of marketing management
 
Moving From a Selenium Grid to the Cloud - A Real Life Story
Moving From a Selenium Grid to the Cloud - A Real Life StoryMoving From a Selenium Grid to the Cloud - A Real Life Story
Moving From a Selenium Grid to the Cloud - A Real Life Story
 
Progeny LIMS
Progeny LIMSProgeny LIMS
Progeny LIMS
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 
Getting Past No
Getting Past NoGetting Past No
Getting Past No
 
IT Strategic Planning (Case Studies)
IT Strategic Planning (Case Studies)IT Strategic Planning (Case Studies)
IT Strategic Planning (Case Studies)
 

Similar a Efficient processing of large and complex XML documents in Hadoop

Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018Gavin Lin
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Gavin Lin
 
Shannon Holgate: Bending non-splittable data to harness distributed performance
Shannon Holgate: Bending non-splittable data to harness distributed performanceShannon Holgate: Bending non-splittable data to harness distributed performance
Shannon Holgate: Bending non-splittable data to harness distributed performanceAnalyticsConf
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Enhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through OntologyEnhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through OntologyChunhua Liao
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsContinuent
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Erlang - Concurrent Language for Concurrent World
Erlang - Concurrent Language for Concurrent WorldErlang - Concurrent Language for Concurrent World
Erlang - Concurrent Language for Concurrent WorldZvi Avraham
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High CostsJonathan Long
 
DEVNET-1152 OpenDaylight YANG Model Overview and Tools
DEVNET-1152	OpenDaylight YANG Model Overview and ToolsDEVNET-1152	OpenDaylight YANG Model Overview and Tools
DEVNET-1152 OpenDaylight YANG Model Overview and ToolsCisco DevNet
 
ArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB Database
 
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...Bowen Li
 
Entity Framework
Entity FrameworkEntity Framework
Entity Frameworkvrluckyin
 
Entity framework
Entity frameworkEntity framework
Entity frameworkicubesystem
 
3TU Datacentrum Tech Overview
3TU Datacentrum Tech Overview3TU Datacentrum Tech Overview
3TU Datacentrum Tech Overvieweposthumus
 

Similar a Efficient processing of large and complex XML documents in Hadoop (20)

Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018Dataflow in 104corp - AWS UserGroup TW 2018
Dataflow in 104corp - AWS UserGroup TW 2018
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018
 
Shannon Holgate: Bending non-splittable data to harness distributed performance
Shannon Holgate: Bending non-splittable data to harness distributed performanceShannon Holgate: Bending non-splittable data to harness distributed performance
Shannon Holgate: Bending non-splittable data to harness distributed performance
 
Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Enhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through OntologyEnhancing Domain Specific Language Implementations Through Ontology
Enhancing Domain Specific Language Implementations Through Ontology
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Erlang - Concurrent Language for Concurrent World
Erlang - Concurrent Language for Concurrent WorldErlang - Concurrent Language for Concurrent World
Erlang - Concurrent Language for Concurrent World
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
 
DEVNET-1152 OpenDaylight YANG Model Overview and Tools
DEVNET-1152	OpenDaylight YANG Model Overview and ToolsDEVNET-1152	OpenDaylight YANG Model Overview and Tools
DEVNET-1152 OpenDaylight YANG Model Overview and Tools
 
ArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQL
 
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
 
Entity Framework
Entity FrameworkEntity Framework
Entity Framework
 
Entity framework
Entity frameworkEntity framework
Entity framework
 
3TU Datacentrum Tech Overview
3TU Datacentrum Tech Overview3TU Datacentrum Tech Overview
3TU Datacentrum Tech Overview
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Último (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

Efficient processing of large and complex XML documents in Hadoop

  • 1. Efficient  processing  of  large  and   complex  XML  documents  in  Hadoop     Sujoe  Bose   Senior  Principal,   Sabre  Holdings   June,  2013  
  • 2. Presenta.on  Outline   §  MoBvaBon   §  ETL  vs.  ELT   §  Avro  Format   §  Mapping  from  XML  to  Avro   §  Interfaces  to  access  Avro   §  Performance  and  Storage  consideraBons   §  Other  types  of  storage/processing  formats   confidenBal   2  
  • 3. You  will  learn  about  …   §  A  method  to  store  and  process  complex  XML  data  in   Hadoop  as  Avro  files   §  Interfaces  to  access  and  analyze  data  in  Avro  from   Hive,  Java  and  Pig   §  VariaBons  of  the  method  and  their  relaBve  trade-­‐offs   in  storage  and  processing   confidenBal   3  
  • 4. Mo.va.on   §  Prevalence  of  XML  and  its  derivaBves   –  Spurred  by  WebServices  and  SOA   –  Preferred  communicaBon  format  unBl  newer  formats   entered   –  Data  and  logs  represented  in  XML   §  XML  –  metadata  combined  data     –  Flexibility  vs.  Complexity   §  Could  be  arbitrarily  nested  and  large   §  Volumes  of  documents  –  Big  Data   confidenBal   4  
  • 5. Challenges   §  Parsing  XML  is  CPU  Intensive   §  Certain  parsers/parsing  methods  result  in  more   memory  consumpBon   §  Repeated  parsing  for  each  query   §  Large  and  deeply  nested  XMLs  makes  problem  worse   §  Presence  of  tags  in  data  result  in  high  I/O  due  to   storage  size   §  Special  handling  of  opBonal  fields   confidenBal   5  
  • 6. ETL  vs.  ELT   confidenBal   6   §  Hadoop  generally  built  for  EL  –  T   –  aka  Schema-­‐on-­‐Read   –  Load  as-­‐is   –  Transform  on  Access/Query   §  Compare  with  Data  Warehouse  ETL   –  Aka  Schema-­‐on-­‐Write   –  Transform  and  Load   –  Queries  are  lot  simpler   –  TransformaBon  and  cleansing  done  a  priori  
  • 7. Mix  of  ETL  and  ELT   §  Generally  beaer  in   Flexibility   §  More  suitable  for  simpler   and  well-­‐defined  formats   §  More  applicable  for   experimentaBon   §  XML  data  parsed  on   demand  for  every  query   confidenBal   7   §  Generally  beaer  in   Performance   §  More  suitable  when   substanBal  cleansing  and   reformacng  is  needed   §  RepeBBve  queries  and   producBon  workloads   §  XML  Data  pre-­‐parsed  to   minimize  resource  usage   ELT   ETL  
  • 8. Approaches   confidenBal   8   XML  Files   Avro  Files   ETL   Pre-­‐parsing   Pig   UDF   Avro   Schema   On-­‐demand   Parsing   Interfaces  Processing  Data   Hive   SerDe   MapReduce   Pig   UDF   Hive   SerDe   MapReduce  
  • 9. ELT   confidenBal   9  confidenBal   9   XML  Files   Avro  Files   ETL   Pre-­‐parsing   Pig   UDF   Avro   Schema   On-­‐demand   Parsing   Interfaces  Processing  Data   Hive   SerDe   MapReduce   Pig   UDF   Hive   SerDe   MapReduce  
  • 10. ETL   confidenBal   10  confidenBal   10   XML  Files   Avro  Files   ETL   Pre-­‐parsing   Pig   UDF   Avro   Schema   On-­‐demand   Parsing   Interfaces  Processing  Data   Hive   SerDe   MapReduce   Pig   UDF   Hive   SerDe   MapReduce  
  • 11. XML  Pre-­‐parsing   §  Nested  Elements  and  Aaributes   §  RepresentaBon  of  parsed  XML  Structure   §  Enter  Avro!   confidenBal   11  
  • 12. Avro   §  Data  serializaBon  system   §  Specifically  designed  for  Hadoop,  but  used  in  other   environments  also   §  Rich  data  structures:  Arrays,  Records,  Maps  etc.   §  Compact,  fast,  binary  data  format   §  Metadata  stored  at  file  level  –  not  record  level   §   Split-­‐able  –  Ideal  for  Map-­‐Reduce   confidenBal   12  
  • 13. Avro  APIs   §  Generic  Objects  and  Pre-­‐generated  Objects   –  Easy  API  including  simple  gets  and  puts   §  APIs  in  several  languages   –  Java   –  C#   –  C/C++   –  Python   –  Ruby   confidenBal   13  
  • 14. Use-­‐case   §  FIXML  –  Financial  InformaBon  eXchange   –  hap://www.fixprotocol.org/specificaBons/   §  XML  Database  Benchmark   –  hap://tpox.sourceforge.net/   §  Provides  sample  data  for  benchmarking   §  Data  Generator  for  generaBng  large  and  predictable   datasets   confidenBal   14  
  • 15. FIXML   §  XML  Data  Generator   –  hap://tpox.sourceforge.net/tpoxdata.htm   §  Order:  Buy  and  sell  order  of  securiBes   confidenBal   15  
  • 16. Simple  mapping   confidenBal   16   XML   Avro   Pig   Elements  with  repeated   nested  elements   Array   Bag   Elements  with  aaributes  and   text  elements   Record   Tuple   Aaributes  and  Text  Elements   Field   Field  
  • 17. Avro  Schema   { "type": "record", "name": "FIXOrder", "namespace": "com.sabre.fixml", "doc": "Definition and mapping for FIX Orders", "mapping": "/FIXML", "fields": [ { "name":"v", "type":"string", "mapping":"@v"}, { "name":"r", "type":"string", "mapping":"@r"}, { "name":"s", "type":"string", "mapping":"@s"}, { "name":"Order", "mapping":"Order", "type": { "name":"OrderRecord", "mapping":"Order", "type": "record", "fields": [ { "name":"ID", "type":"string", "mapping":"@ID"}, { "name":"ID2", "type":"string", "mapping":"@ID2"}, { "name":"OrignDt", "type":"string", "mapping":"@OrignDt"}, { "name":"TrdDt", "type":"string", "mapping":"@TrdDt"}, { "name":"Acct", "type":"string", "mapping":"@Acct"}, { "name":"AcctTyp", "type":"string", "mapping":"@AcctTyp"}, { "name":"DayBkngInst", "type":"string", "mapping":"@DayBkngInst"}, { "name":"BkngUnit", "type":"string", "mapping":"@BkngUnit"}, { "name":"PreallocMeth", "type":"string", "mapping":"@PreallocMeth"}, { "name":"AllocID", "type":"string", "mapping":"@AllocID"}, { "name":"CshMgn", "type":"string", "mapping":"@CshMgn"}, { "name":"ClrFeeInd", "type":"string", "mapping":"@ClrFeeInd"}, ...   confidenBal   17  
  • 18. Pig  Schema   FIXOrder: tuple ( v: chararray, r: chararray, s: chararray, Order: tuple ( ID: chararray, ID2: chararray, OrignDt: chararray, TrdDt: chararray, Acct: chararray, AcctTyp: chararray, DayBkngInst: chararray, BkngUnit: chararray, PreallocMeth: chararray, AllocID: chararray, CshMgn: chararray, ClrFeeInd: chararray, confidenBal   18  
  • 19. Avro  –  Access  Methods   §  Direct  support  for  access  from  Hive  (using  SerDe)     CREATE EXTERNAL TABLE <TableName>! ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’! STORED as INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’! OUTPUTFORMAT! ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’! LOCATION ‘location-of-avro-files’! TBLPROPERTIES ('avro.schema.url'=‘location-of-schema- file.avsc')   §  Access  via  Pig  -­‐  AvroStorage   §  Avro  API  -­‐  Java  MapReduce   confidenBal   19  
  • 20. Test  Data   §  Base  SecuriBes  Order  file  500,000  records   §  Replicated  for  volume   –  15x  -­‐  7.5  million  records   –  30x  -­‐  15  million  records   –  45x  -­‐  22.5  million  records   –  60x  –  30  million  records   –  75x  –  37.5  million  records     confidenBal   20  
  • 21. Comparison   confidenBal   21   XML  Files   Avro  Files   ETL   Pre-­‐parsing   Pig   UDF   Avro   Schema   On-­‐demand   Parsing   Interfaces  Processing  Data   Hive   SerDe   MapReduce   Pig   UDF   Hive   SerDe   MapReduce  
  • 22. File  sizes:  Orders   §  Base  Data   –  XML  file  size  as  is:  749,337,916  (750MB)     –  Gzip  Compressed:  182,687,654  (183MB)     §  Applied  Avro  conversion   –  Avro  Snappy:  151,647,926  (152MB)     –  Avro  Gzip:  107,898,177  (108MB)     confidenBal   22  
  • 23. Storage  Size  Comparison   confidenBal   23  
  • 24. Test  Environment   §  18  Nodes   §  Node  configuraBon:   –  12  cores  per  node   –  48GB  memory   –   36  TB  with  12  disks  of  3TB  each   §  CDH  4.1.2   confidenBal   24  
  • 25. Sample  Query   §  Security  Orders  per  Account   order_records  =  LOAD  '$AVRO_INPUT'  using  AVRO_LOAD  AS  (   -­‐-­‐-­‐-­‐-­‐-­‐-­‐  Pig  Schema  goes  here  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   );     order_projecBon  =  FOREACH  order_records  GENERATE  Order.Acct  as  Account,  Order.OrdQty.Qty   as  QuanBty;     order_group  =  GROUP  order_projecBon  BY  Account;     order_count  =  FOREACH  order_group  GENERATE  group,  SUM(order_projecBon.QuanBty);     STORE  order_count  INTO  '$PIG_OUTPUT'  Using  PigStorage(',');   confidenBal   25  
  • 26. Run  Types   §  Pre-­‐parsed  approach:   –  XML  to  Avro  materializaBon:  xml-­‐to-­‐avro   •  XML  to  Avro  is  run  only  once  on  the  data   –  Avro  to  Pig  via  UDF:  avro-­‐to-­‐pig   §  Parse  on  demand   –  XML  parsing  using  Pig  UDF:  xml-­‐to-­‐pig   confidenBal   26  
  • 27. confidenBal   27   Run  .me  in  Seconds   Analysis  on  raw  XML:   XML  to  Pig   Pre-­‐parsing  XML:   XML  to  Avro   Analysis  on  parsed  XML:   Avro  to  Pig  
  • 28. confidenBal   28   CPU  Usage  Comparison   Analysis  on  raw  XML:   XML  to  Pig   Pre-­‐parsing  XML:   XML  to  Avro   Analysis  on  parsed  XML:   Avro  to  Pig  
  • 29. confidenBal   29  confidenBal   29   Memory  Usage  Comparison:  Total  Memused  (GB)   Analysis  on  raw  XML:   XML  to  Pig   Pre-­‐parsing  XML:   XML  to  Avro   Analysis  on  parsed  XML:   Avro  to  Pig  
  • 30. Results   §  Analysis  on  pre-­‐parsed  data  compared  raw  XML   –  RunBme  reducBon  by  more  than  50%   –  Memory  and  CPU  consumpBon  reduced  by  about  50%   §  Pre-­‐parsing  stage  takes  more  resources  and  Bme   than  on-­‐demand  parsing   §  RepeBBve  queries  will  benefit  from  one-­‐Bme  pre-­‐ parsing   confidenBal   30  
  • 31. Caveats   §  Not  all  fields  were  extracted  from  the  XML  input   (opBonal  elements)   §  Challenge  in  keeping-­‐up  with  versions/changes  of   XML   §  Performance  numbers  can  depend  on  the  type  of   data  and  the  mapping  used   confidenBal   31  
  • 32. Alterna.ves   §  Formats  other  than  Avro  may  be  more  suitable   §  Record  Columnar  formats  (RC  Files  &  ORC  Files)   §  Trevni:  a  column  file  format  supporBng  Avro   §  Parquet:  another  columnar  storage  for  Hadoop   confidenBal   32  
  • 33. Mo.va.on  for  Columnar  Format   §  Map  Reduce  capability   §  Column  ProjecBons  reduce  I/O   §  Column  Compression  due  to  similarity  of  data   further  reduces  I/O   confidenBal   33  
  • 34. Summary   §  Materialized  version  well-­‐suited  for  repeated  queries   §  For  ad-­‐hoc/experimental  queries  parse-­‐on-­‐demand   is  beaer   §  Mapping  from  XML  to  Avro  can  be  automated   §  Hive,  Pig  and  MapReduce  Interfaces  to  access  Avro   Files   §  RelaBve  trade-­‐offs  between  flexibility  and   performance/storage   confidenBal   34  
  • 35. Ques.ons  &  Comments   confidenBal   35   Thanks  for  Listening    sujoe.bose@sabre.com