2. Big data: Concept & Applications
Big data is the term for collection of dataset so large
and complex that it become difficult to process using
on hand database management tools or traditional
data processing applications.
The amount of data that is beyond the storage
and processing capabilities of single physical
machine then it is called Big data.
Big data ?
Large volume of data
Existing tools were not designed to handle such a huge data
.
Gigabyte Terabyte Petabyte Exabyte Zeta byte
3. Title : Big data : Concept & Applications
Amazon collect social data ,log data , different flavor of data.
Walmart handles more than 1 million customer transactions every hour.
Twitter 300000 tweets per minutes
Instagram 250000 upload new picture per minutes
Email 5 million messages (gmail)
WhatsApp 4,00,000 pictures per min
Google 5 millions search request per min
Facebook 2.5 millions contents per min
500 TB per day
Having data bigger it requires different approaches:
Techniques, Tools and Architecture
An aim to solve new problems or old problems in a better
way
Big Data generates value from the storage and processing of
very large quantities of digital information that cannot be
analyzed with traditional computing techniques.
4. Big data : 3V
•Variety
data coming from various sources
• Velocity
real time live streaming data
• Volume
in order of terabyte and petabyte
5. Title : Big data : Concept & Applications
Big Data are in everywhere.
Network Analysis
Social Network Web Graph
6. Bigdata : Volume
Volume of data is increasing in every second
Data will be measured in TB and ZB.
Amount of data will be double in every two
years
100 terabytes of data are uploaded
daily to Facebook
100 hours of video uploaded in
every minute
Research estimated 65% annual
growth in digital contents , mainly
unstructured data.
Gigabyte Terabyte Petabyte Exabyte Zetabyte
7. Data is created real
time
Internet of thing (IOT),
social media – major
contributor for the
speed at which the
data is generated.
In every minute
25 million queries on Google
20 million photos are viewed on Flickr
over 200 million emails are sent
Big Data : Velocity
8. Data are coming in all shape
structured,
semistructured, unstructured &
even complexed structure
90% of data generated is
‘unstructured’
starting from text to audio,
image or video data.
Big Data : Variety
9. Big Data Life Cycle
Storage
Capacity
2000 2018
Storage
MB
PB
2025
Processing
Speed
11. 11
Hadoop
Apache Hadoop is a framework for storing ,processing and
analyzing big data.
•Distributed
•Scalable
•Open Source
12. 12
Why Hadoop?
• 1 TB data is processed
by 1 computer
• Each computer is
having 4 I/O channel
of 100 mbps.
• Total time required :
44 minutes
1 TB data is processed by 10
computers (same configuration)
parallel .
Total time required : 4.4 minutes
CASE 1 CASE 2
13. 13
HDFS (Hadoop Distributed file System)
- Stores data on the cluster
HDFS is a file system written in Java
Provide storage for massive amount of data
- Scalable
- Fault Tolerance
- Support efficient processing in MR
14. 14
Hadoop : How files are stored?
-Data files split into blocks and distributed to data nodes
- Each block is replicated in multiple nodes ( default 3x)
21. 21
HADOOP
Hadoop =HDFS + Map Reduce
Hadoop HDFS commands are similar
to unix command.
Map reduce is programming model
Hive Data Manipulation (like SQL)
Pig Data Manipulation using Script
Sqoop Import and Export on HDFS
22. 22
Import/Export using Sqoop and Flume
Sqoop : Transfers data between RDBMS and HDFS
Flume : A service to move large amounts of data in real Time
23. Applications
E-commerce : (Amazon)
Recommendation Engine
-User buy pattern
-Digital Marketing Analysis
Telecommunication
-Call drop Analysis
-Network Problem Optimization
Entertainment
-Content Analytics (Netflix)
Sports
-Fitness Management (fitbit)
Health Care
-Early Disease Detection (pfizer)
24. Applications
Technology: In the technology, it is used in the websites like eBay,
Amazon and Facebook and Google utilize it.
Private sector: The application of big data in the private sector includes
the retail, retail banking, and real estate.
Government: The big data is also utilized by the Indian government.
International development: The development in the big data analysis
furnishes cost-effective opportunities to enhance the decision in critical
advancement areas like health care, employment opportunities and crime,
security and natural disaster. Hence, in this way, the big data is helpful for
the international development.
INSTRUCTIONS: Standard technical results slide (2-slide version). Please keep this layout and subheadings. A template is at the end of this exemplar.
Bar-Noy, Basu, Johnson, Ramanathan, “Minimum-cost Broadcast through Varying-size Neighborcast”, Algosensors 2011, Germany, Sept 2011
Johnson, Phelan, Bar-Noy, Basu, Ramanathan, “Minimum-cost Broadcast through Varying-size Neighborcast”, Draft for submission to IEEE ToN (ToN paper has some more hardness results, simulation study and comparisons)
The problem of interest is to broadcast a message originating at a source node to all nodes in the network.
Source node and relay node can multicast to a subset of their neighbors (and they may also perform multiple multicasts to disjoint sets of neighbors). If a node multicasts to a subset k of its neighbors, the incurred cost is 1 + A k^b, where A, b are non-negative constants; `1’ represents the normalized cost of the (first) transmission, and the second term the cost of ACKs and re-transmits. The work also considers the case where the second term is either a sub-linear or a super-linear function of k. The minimum cost problem is formulated as an integer programming problem, and is NP-hard for a range of b expressed as a function of A.
The top line in the table is, in fact, a very important result: if b > g(A) := log2( 2 + 1/A), then multicast cannot outperform unicast; thus, the spanning tree is optimal.
If b=0, problem reduces to the connected dominating set (CDS) problem for which approximability results are known; the approximation ratio is HΔ+ 2
If b=1, problem reduces to minimizing number of transmitters (equivalently the maximum leaf spanning tree); a polynomial time algorithm with approximation ratio 2 is known; the paper improves the approximation ratio by using a pruned CDS approach.
For b > g(A), spanning tree is optimal
For b < g(A), the problem is shown to be NP-hard
For 1 < b < g(A), the paper shows that a spanning tree has very good approximation ratio (less than 2)
For 0 < b < 1, a greedy algorithm is proposed and its approximation ratio derived. Note that the approximation ratio improves with larger b and smaller Δ
Overall note that the approximation ratio becomes worse for smaller b
Note: the network size ‘n’ plays a part in the `inapproximability’ results
The model assumes a known cost function; but the exponent ‘b’ depends both upon the actual protocol as well as open the operating environment (e.g., congestion). Thus ‘b’ may vary and may be hard to estimate. How sensitive is the proposed algorithm when there are errors in estimating ‘b’? The figure on the right shows cost (as incurred by the proposed algorithm) vs. the actual ‘b’ of the underlying cost function; the black curve is the `optimal’ one – it uses the true value of ‘b’; the performance of the algorithm when ‘b’ is assumed to be fixed at some value is shown
Here, Δ is the maximum node degree in the graph
Hn = n-th Harmonic number = 1 + 1/2 + 1/3 + ¼ + … + 1/n ~= log (n) + \gamma + small constant
Where \gamma is the Euler-Mascheroni constant, approximately 0.5772
INSTRUCTIONS: Standard technical results slide (2-slide version). Please keep this layout and subheadings. A template is at the end of this exemplar.
Bar-Noy, Basu, Johnson, Ramanathan, “Minimum-cost Broadcast through Varying-size Neighborcast”, Algosensors 2011, Germany, Sept 2011
Johnson, Phelan, Bar-Noy, Basu, Ramanathan, “Minimum-cost Broadcast through Varying-size Neighborcast”, Draft for submission to IEEE ToN (ToN paper has some more hardness results, simulation study and comparisons)
The problem of interest is to broadcast a message originating at a source node to all nodes in the network.
Source node and relay node can multicast to a subset of their neighbors (and they may also perform multiple multicasts to disjoint sets of neighbors). If a node multicasts to a subset k of its neighbors, the incurred cost is 1 + A k^b, where A, b are non-negative constants; `1’ represents the normalized cost of the (first) transmission, and the second term the cost of ACKs and re-transmits. The work also considers the case where the second term is either a sub-linear or a super-linear function of k. The minimum cost problem is formulated as an integer programming problem, and is NP-hard for a range of b expressed as a function of A.
The top line in the table is, in fact, a very important result: if b > g(A) := log2( 2 + 1/A), then multicast cannot outperform unicast; thus, the spanning tree is optimal.
If b=0, problem reduces to the connected dominating set (CDS) problem for which approximability results are known; the approximation ratio is HΔ+ 2
If b=1, problem reduces to minimizing number of transmitters (equivalently the maximum leaf spanning tree); a polynomial time algorithm with approximation ratio 2 is known; the paper improves the approximation ratio by using a pruned CDS approach.
For b > g(A), spanning tree is optimal
For b < g(A), the problem is shown to be NP-hard
For 1 < b < g(A), the paper shows that a spanning tree has very good approximation ratio (less than 2)
For 0 < b < 1, a greedy algorithm is proposed and its approximation ratio derived. Note that the approximation ratio improves with larger b and smaller Δ
Overall note that the approximation ratio becomes worse for smaller b
Note: the network size ‘n’ plays a part in the `inapproximability’ results
The model assumes a known cost function; but the exponent ‘b’ depends both upon the actual protocol as well as open the operating environment (e.g., congestion). Thus ‘b’ may vary and may be hard to estimate. How sensitive is the proposed algorithm when there are errors in estimating ‘b’? The figure on the right shows cost (as incurred by the proposed algorithm) vs. the actual ‘b’ of the underlying cost function; the black curve is the `optimal’ one – it uses the true value of ‘b’; the performance of the algorithm when ‘b’ is assumed to be fixed at some value is shown
Here, Δ is the maximum node degree in the graph
Hn = n-th Harmonic number = 1 + 1/2 + 1/3 + ¼ + … + 1/n ~= log (n) + \gamma + small constant
Where \gamma is the Euler-Mascheroni constant, approximately 0.5772