Big Data Analytics & Trends Presentation discusses what big data is, why it's important, definitions of big data, data types and landscape, characteristics of big data like volume, velocity and variety. It covers data generation points, big data analytics, example scenarios, challenges of big data like storage and processing speed, and Hadoop as a framework to solve these challenges. The presentation differentiates between big data and data science, discusses salary trends in Hadoop/big data, and future growth of the big data market.
1. Big Data Analytics
& Trends
Presentation
by
Dr.K.Sreenivasa Rao
Dept. of CSE, VBIT
2. Content
1. What is Big data ?
2. Why Big data ?
3. Some Definitions.
4. Types of data-Structured, Unstructured & Semi
structured
5. The data Landscape
6. Some other definitions
7. Characteristics of big data
8. Data generation Points
9. Big Data analytics
10.Example Scenario
11.Challenges of Big data
12.Hadoop, History & Complementary Packages
13.Difference between Big data & Data Science.
14.Salary Trends in Hadoop/Big Data
3. What is Big data?
•Facebook generates 10TB daily
•Twitter generates 7TB of data Daily
•IBM claims 90% of today’s stored data was generated
in just the last two years.
4. Why Big Data ?
• Growth of Big Data is needed because of
– Increase of storage capacities
– Increase of processing power
– Availability of data(different data types)
– Every day we create 2.5 Million TB[quintillion bytes(1
Quintillionbyte= 1 Exabyte=1000Petabytes where 1
Petabyte=1000 TB)] of data; 90% of the data in the
world today has been created in the last two years
alone.
• FB generates 10TB daily
• Twitter generates 7TB of data Daily
• IBM claims 90% of today’s stored data was generated in
just the last two years.
5. Some Definitions
• Big data is a "catch all" word, related to the
power of using a lot of data to solve
problems.. Big data is the data that is large
enough and complex that it becomes
difficult to process using a single
computer...
• Big data is simply the large sets of data that
businesses and other parties put together to
serve specific goals and operations. Big data
can include many different kinds of data in
many different kinds of formats.
6. Some Definitions
• Big data is an evolving term that describes any
voluminous amount of structured,
semi structured and unstructured data that
has the potential to be mined for information.
[Ref:
Strata + Hadoop World 2016: Hadoop and Spark in spotlight]
12. Some Other Definitions
• Gartner defines Big Data as high volume, velocity and
variety information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making.
• Big data is often characterized by 3Vs: the
extreme volume of data, the wide variety of data types
and the velocity at which the data must be processed.
Although big data doesn't equate to any specific
volume of data, the term is often used to describe
Terabytes, Petabytes and even Exabytes of data
captured over time.
13. Characteristics of Big data
Volume: (Data Quantity)
• Twitter generates about 80 MB per second.
• Facebook generates 10 TB data per day.
• Black box data: Single flight generates nearly 10 TB of data per
every ½ an hour.
• Twitter generates of about 80 MB every second.
Velocity: (Data Speed) ebay analyzes 5 million transactions per day.
• Finally, velocity refers to the speed at which big data must be
analyzed. Velocity is also meaningful, as big data analysis expands
into fields like machine learning and artificial intelligence, where
analytical processes mimic perception by finding and using patterns
in the collected data.
Variety: (Data Types) Bigdata includes data from e-commerce sites,
health care data, education, stock exchange, banking etc…..
Varying in Time:
• [http://searchcloudcomputing.techtarget.com/definition/big-data-Big-Data]
16. Data generation Points Examples
Mobile Devices
Readers/Scanners
Science facilities
Microphones
Cameras
Social Media
Programs/ Software
17. Big Data Analytics
• Examining large amount of data
• Appropriate information
• Identification of hidden patterns, unknown correlations
• Competitive advantage
• Better business decisions: Strategic and Operational
• Effective marketing, customer satisfaction, increased
revenue
18. Example Scenario
U need reading articles,
Pictures & videos, links to
facebook & twitter etc….
21. Such bigdata is to be sorted, filtered &
analyzed to produce useful information
for decision making.
22. For haps facebook may help u better to identify best
gym equipment for your office…..
Finally Analytics gives us useful insight or information
from big data.
23. Challenges of big data:
• Problem: To read 1 TB data from a hard drive
• Sol1: 1 machine of 4 I/O channels of 100 MBps
• 1 TB=1024*1024 MB
• 10,48,576 MB
• =10, 485 Seconds
• =174.75 Minutes by 1 i/o channel
• =174.75/4
• =43.6 Minutes for by 4 i/o channels
• Sol2: If 10 machines are used for reading it takes
43.6/10=4.36 minutes to read 1 TB data.
• i.e to analyze big data, first we need to read it,
today challenge is i/o speed but not storage
capacity.
• Challenge is to read/write data but not to store it.
• Hadoop is framework to solve the above challenges.
24. Hadoop
• Hadoop: is an open source java based programming framework that
supports processing of large datasets in distributed computing
environment. It is a part of apache project sponsored by Apache
Software Foundation.
• It is designed to answer the question “How to process big data with
reasonable cost & time”.
• Definition2:
• Apache hadoop ia a framework for distributed processing of large
datasets across clusters of commodity computers/hardware using
simple programming model (mapReduce).
• Commodity hardware is cheap & more in number rather than high
cost high end, less number of servers or super/micro computers.
• Who use hadoop ?:
• Indian Aadar scheme is using hadoop.
• Google has built a new version of distributed file system using
hadoop to handle & analyze its data.
• Yahoo
• Facebook etc….
25. • History:
• It was founded by yahoo in 2005.
• It was handed over to Google in 2006.
• Now it is Apache hadoop.
• Some Public Cloud services that gives hadoop:
• AWS Elastic MapReduce
• Amazon EC2/S3
• Google Cloud DataProc
26. Hadoop Components:
• 1.HDFS: (Hadoop Distributed File System)
for storing data across thousands of servers
to achieve high bandwidth.
• 2.MapReduce: Provides programming model
to handle large distributed processing
–mapping data & reducing it to a result.
• Hadoop is the popular open source
implementation of MapReduce, a powerful
tool designed for deep analysis and
transformation of very large data sets.
27. Complementary software packages:
• The term Hadoop has come to refer not just to the base modules
above, but also to collection of additional software packages that
can be installed on top of or alongside Hadoop, such as
• Apache Pig,
• Apache Hive,
• Apache HBase,
• Apache Phoenix,
• Apache Spark,
• Apache ZooKeeper,
• Cloudera Impala,
• Apache Flume,
• Apache Sqoop,
• Apache Oozie,
• Apache Storm.
• HBase: An open source , non relational distributed database.
• Hive: A datawarehouse that provides data summary
• Pig: A high level platform that creates programs run on hadoop.
• Apache Spark: A fast engine for bigdata processing capable of
streaming & supporting SQL, machine learning, grapg processing.
One survey says, 80 % of hadoop projects are going to mature in
2016 & people are looking towards apache spark for their next
projects.
28. • Where processing is hosted?
– Distributed Servers / Cloud (e.g. Amazon EC2)
• Where data is stored?
– Distributed Storage (e.g. Amazon S3)
• What is the programming model?
– Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
– Analytic / Semantic Processing
Types of tools used in
Big-Data
29. Difference between Big data & Data Science.
• [http://www.kdnuggets.com/2015/07/data-science-big-data-different-beasts.html]
• Creating artifact from the ore requires the tools, craftmanship and science.
Same is the case of big data and data science, here we present the
distinguishing factors between the ore and the artifact.
• Data Science looks to create models that capture the
underlying patterns of complex systems, and codify those models into
working applications. Big Data looks to collect and manage large
amounts of varied data to serve large-scale web applications and vast
sensor networks.
Although both offer the
potential to produce value
from data, the fundamental
difference between Data
Science and Big Data can be
summarized in one
statement:
-Collecting Does Not
Mean Discovering
30. Investments in data-focused activities center around
tools instead of approaches. The engineering cart
gets put before the scientific horse, leaving an
organization with a big set of tools, and a small
amount of knowledge on how to convert data into
something useful.
So, Data Science is expertise in converting data to
an useful information/products that answer
always-changing demands of the market.
31. Salary Trends for Bigdata/hadoop
• Big Data Hadoop Salary Trends
• 1.Average Big Data salaries have increased by 9.3% in the last
12 months. Current salary range is between $119,250 to
$168,250.
• 2.A Hadoop developer making $120,000 will be evaluated by
competitor companies at $155,000. Thats a 29% hike.
• 3.On average there is a new Big Data/Hadoop technology
released every 6 weeks. So make sure you stay updated.
• 4.The average salary for a Hadoop Developer in San Francisco,
CA, is $139,000.
• 5.A Senior Hadoop developer in San Francisco, CA can earn over
$178,000 on an average.
• 6.Hortonworks, Paxata, Bloomberg LP - are hiring top Big Data
Hadoop talent for the highest pay package.
• 7.The states with the most Hadoop Big Data jobs are California,
New York, New Jersey and Texas. - duh that was obvious :)
38. Future of Big Data
• $15 billion on software firms only specializing in
data management and analytics.
• This industry on its own is worth more than $100
billion and growing at almost 10% a year which is
roughly twice as fast as the software business as a
whole.
• In February 2012, the open source analyst firm
Wikibon released the first market forecast for Big
Data , listing $5.1B revenue in 2012 with growth to
$53.4B in 2017
• The McKinsey Global Institute estimates that data
volume is growing 40% per year, and will grow 44x
between 2009 and 2020.
39. • So, Data Science as a career goal will enrich
employability of the graduate in future market.
• Big data Market Forecast
40. References
• www.Slideshare.com
• www.wikipedia.com
• www.computereducation.org
• Strata + Hadoop World 2016: Hadoop and Spark in
spotlight
• http://searchcloudcomputing.techtarget.com/definition/bi
g-data-Big-Data
• http://www.information-management.com/news/big-data-
analytics/the-top-5-trends-in-big-data-for-2017-10029956-
1.html
• Books-
Big Data by Viktor Mayer-Schonberger