Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Big Data Presentation at SCQAA-SF on June 12 2013

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
Big data and Internet
Big data and Internet
Cargando en…3
×

Eche un vistazo a continuación

1 de 33 Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Big Data Presentation at SCQAA-SF on June 12 2013 (20)

Anuncio

Más de Sujit Ghosh (19)

Más reciente (20)

Anuncio

Big Data Presentation at SCQAA-SF on June 12 2013

  1. 1. Welcome to Our June Meeting June 13, 2013 1
  2. 2. • SCQAA-SF (www.scqaa.net) chapter sponsors the sharing of information to promote and encourage the improvement in information technology quality practices and principles through networking, training and professional development. • Networking: We meet once in 2 months in San Fernando Valley. • Check us out on LinkedIn (SCQAA-SF) • Contact Sujit at sujit58@gmail.com or call 818-878-0834 About SCQAA-SF- A Not-for Profit Organization June 13, 2013 2
  3. 3. Membership Benefits: • Excellent speaker presentations on advancements in technology and methodology • Networking opportunities • PDU, CSTE and CSQA credits • Regular meetings are free for members and include dinner June 13, 2013 3
  4. 4. Membership Policy • Recently revised our membership dues policy to better accommodate member needs and current economic conditions. • Annual membership is $50, or $35 for those who are in between jobs. • Please check your renewal with Cheryl Leoni. If you have recently joined or renewed, please check before renewing again June 13, 2013 4
  5. 5. Sunil Sabat Data Practitioner, Scientist, Architect Insights to Big Data and Quality “ Ref; Jan 2012- for SoCalCodeCamp
  6. 6. Agenda • Big Data and modern data management • Old BI and New BI • Hadoop Frameworks • Big Data Quality – Hybrid Approach • Big Data Processing - ETL • Examples of Hadoop ETL/QA • Big Data QA ToDo • Q/A
  7. 7. Big Data • Today, useful data is 80% unstructured and 20% structured data • Not easy to build old style warehouses, very expensive to build and maintain • Today, business need is real time and actionable insight driven • Big Data features volume, variety, velocity and veracity • Fact - Business need actionable intelligence to succeed
  8. 8. Modern Data Management Hub
  9. 9. Obama Election and Big Data • “The Obama campaign found a way to integrate social media, technology, email databases, fundraising databases and consumer market data,” said GOP digital strategist Vincent Harris, who did digital work for Newt Gingrich and Rick Perry in 2012. “That does not exist on the Republican side to that degree”, to the detriment of Mitt Romney’s campaign, quoted by Politico, “GOP seeks to up its online game”, December 8, 2012. For more on how the Obama campaign used big data, see BusinessWeek’s November 29, 2012 article “The Science Behind Those Obama Campaign Emails”.
  10. 10. BI = ‘Current State’ Questions •What did we sell? •When did we sell it? •Where did we sell it? •What did we sell with it? Collecting Transactional data
  11. 11. BigData = ‘Next State’ BI Questions • What could happen? • Why didn’t this happen? • When will the next new thing happen? • What will the next new thing be? • What should happen? Collecting behavioral temporal data
  12. 12. Comparing old and new BI data Old BI data New BI data Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear DBA Ratio 1:40 1:3000 Reference: Tom White’s Hadoop: The Definitive Guide
  13. 13. Deeper Comparison Chart
  14. 14. Is Data Science your next Career?
  15. 15. R-Language
  16. 16. Hadoop – MapR,HortonWorks, Cloudera,IB M, Apache….
  17. 17. Oracle Loader for Hadoop
  18. 18. SQL Server Connector for Hadoop
  19. 19. Hadoop on Azure
  20. 20. Amazon AWS
  21. 21. Google App Engine Data
  22. 22. Google – MySQL & Cloud Storage
  23. 23. Big Data QA Process • Hybrid approach - can use traditional perl like scripting, tools , Junit tests on destination side • Use Hadoop jobs to refine and do ETL for unstructured data at source side • Improve upstream QA process to do most of ETL/QA at source • Leverage Hadoop infrastructure to do mining • Fact – Big Data QA window is getting smaller
  24. 24. Microsoft SSIS - Hadoop ETL • Use ODBC driver to extract data from any Hadoop HDFS • Use HDInsight ( Microsoft Hadoop ) as data store • Use SSIS for ETL • Source lookups from Melissa Data and others • Load to SQL Server Reference URL : http://sqlmag.com/blog/use-ssis-etl-hadoop
  25. 25. Amazon EMR - Hadoop ETL • Design and code a JOB on Amazon AWS using EMR (elastic map reduce ) • Source lookups from Melissa Data and others • Run the job to do ETL • Read and write to S3 buckets • Use open source Pig/Latin, Java UDFs for ETL Reference URL : http://docs.aws.amazon.com/ElasticMapReduc e/latest/DeveloperGuide/emr-etl.html
  26. 26. Google – Freebase & Refine
  27. 27. Karmasphere Studio for Amazon Elastic MapReduce
  28. 28. Hadoop Connector to Excel
  29. 29. BI >BigData QA ‘To Do List Get trained and Store some (more) data on the cloud • Relational and non-relational Process some data in the cloud • Do ETL , QA • Try data mining • Learn about Data Science Update your client tools • New UI (touch, gestures) • Click to Query • New form factors (phone, tablet)
  30. 30. Keep Up With Big Data QA • Learn Big Data Now ( NRIT is a bootcamp training provider), Learn to write ETL/QA jobs, Query HDFS using ODBC • Assume source data is not clean, do upstream ETL and QA by lookups, reference data sets • Fact - Hadoop is being used by most of fortune 500 companies now for fast analytics and insights • Fact - Investment in Hadoop is dependent on BI/analytics in the end – Obama Election • FACT - QA matters, garbage in – garbage out is still TRUE!
  31. 31. Questions? Please contact NRIT at www.nritinc.com or sunil.sabat@gmail.com Available on LinkedIn and Twitter ( @ssabat)
  32. 32. NRIT Big Data Architecture
  33. 33. NRIT and BIG DATA BI

Notas del editor

  • Presentation: BI/Big Data Futures - Is it really all about the Cloud?In this survey session, SKS will bring you up-to-date on what's happening in the world of enterprise Business Intelligence.  BigData, NoSQL, Hadoop, Big Analytics, Cloud Storage, what does all of this mean to you as a data professional?  Which products and technologies are mature enough for enterprise adoption and which ones are not?  Which vendors should you be trying out and why? What is the reality of hosting enterprise data on the cloud? What are the business reasons to explore these new technologies?  How do you learn to implement them?SKS frames this talk with the three major trends that she sees in the Enterprise BI space, highlighting products and technologies that warrant a deeper look.  
  • From the blog - http://www.thisisthegreenroom.com/2011/data-science-vs-business-intelligence/
  • http://www.romymisra.com/the-new-job-market-rulers-data-scientists/
  • http://www.r-project.org/
  • http://hortonworks.com/technology/hortonworksdataplatform/http://www.cloudera.com/
  • http://www.oracle.com/technetwork/bdc/hadoop-loader/overview/index.html
  • http://www.microsoft.com/download/en/details.aspx?id=27584
  • https://www.hadooponazure.com/Account
  • http://aws.amazon.com/
  • http://code.google.com/appengine/http://code.google.com/appengine/articles/datastore/overview.html
  • http://code.google.com
  • http://www.freebase.com/http://code.google.com/p/google-refine/
  • http://www.youtube.com/watch?v=gjsMDAcI1Mo
  • http://dennyglee.com/

×