An overview on the Big Data field, interesting patterns on how data is used to make data mining, predictive analytics, machine learning and an overview on the jobs generated by the Big Data demand.
8257 interfacing 2 in microprocessor for btech students
Interesting ways Big Data is used today
1. Interesting ways Big Data is used
today
Daniel Sarbe
May 2015, Big Data Romanian Tour - Timisoara
2. Agenda
1. Source of (Big)Data
2. Why now?
3. Interesting patterns of using BigData
4. BigData – Big Opportunities
3. “There is a big data revolution.
But it is not the quantity of data that is revolutionary.
The big data revolution is that now we can do something with the data.”
Gary King, professor at Harvard University
4. “In God we Trust, all others bring data”
William Edwards Deming - American statistician
“If we have data, let’s look at data. If all we have are opinions, let’s go with mine.”
Jim Barksdale, former Netscape CEO
15. Data Mining & Machine Learning
• Data Mining - The process of discovering meaningful correlations, patterns and
trends by sifting through large amounts of data
• Machine Learning is the study of computer algorithms that improve automatically
through experience
▫ Supervised machine learning - The program is “trained” on a pre-defined set of
“training examples”, which then facilitate its ability to reach an accurate conclusion when
given new data.
▫ Unsupervised machine learning - The program is given a bunch of data and must find
patterns and relationships therein.
18. Predictive Analysis
• Predictive analytics is an area of data mining that deals with extracting
information from data and using it to predict trends and behavior patterns.
• The accuracy and usability of results will depend greatly on the level of data
analysis and the quality of assumptions
19. BigData used for predictions – 2012 US Election
The 2012 Election: A Big Win for Big Data
• Statistician Nate Silver, gave Barack Obama over a
90 percent chance of victory in the Electoral College.
• Algorithm 538 name - number of electors in US
• In 2008 his mathematical model correctly called 49
out of 50 states, missing only Indiana (which went to
Obama by 0.1%.) (John McCain vs Barack Obama)
• In 2012 Silver's model has correctly predicted 50 out
of 50 states.
• Incorporated hundreds of state-level polls into his
analysis. Economic variables, demographics,
electoral outcome, historical polls, economic data
and party registration figures were also incorporated
• While some analysts might cherry-pick data sources
according to whether they were qualitatively
"reliable" or "unbiased", Silver incorporated them
all. Silver's model instead looked at trends over time
20. BigData used for predictions - 2014 Sochi Winter Olympics
• “Canada will enjoy their best Olympics ever,
while the U.S. and host Russia will struggle."
21. BigData used for predictions - 2014 Sochi Winter Olympics
• The analysts used publicly available data on all Winter Olympic Games from 1924 forward
• The model's inputs are Gross Domestic Product(GDP), year, if the country is
communist or not, if the country is a host or not, population of that country, and
its historical performances and medal counts in previous Olympics.
• All variables are given the same weight in the model
• The medal count prediction is based on a linear regression model
• The algorithm is based on historical data, and doesn’t necessarily reflect more current
information such as emerging stars, recent funding boosts, and an unexpectedly large addition of
new events to the program.
• “Based on the above mentioned data and analysis, the analysts predict that Canadian athletes
will grab the most medals and the United States will finish seventh. Germany, Norway,
Austria, China and Russia will rank second to sixth respectively.”
22.
23. Big Data used in other sports
Germany Uses Big Data to Crush Brazil in World Cup Semifinal
• Forget about Moneyball - Germany has now used serious Big Data to win a World Cup match.
• Soccer, a more fluid game, was thought to be less amenable than baseball to Big Data's wiles.
• According to assistant coach Hansi Flick, team managers combed through years of research about
the Brazilian team compiled by students at Cologne's Sports University, looking for any advantage to
be gained over the Brazilian team.
• The compiled information included a detailed analysis of all Brazil's players--their favorite moves,
how they deal with high pressure scenarios, their reactions when fouled, and even how they sprint
for the ball.
24. • 3) Cost of cloud/hardware and full-grown of
software solutions (Hadoop ecosystem)
28. Hadoop myths debunked
Hadoop isn’t enterprise
ready
Hadoop isn’t stable, cluster
go down
You lose data on HDFS
Data cannot be shared across
the organization
Hadoop is not secured
NameNode do not scale
Software upgrades are rare
Hadoop use cases are limited
I need expensive servers to
get more
Hadoop is so dead
Source: Sumeet Singh - Yahoo
32. Open Data Platform
The Open Data Platform Initiative (ODP) is a shared
industry effort focused on promoting and advancing the
state of Apache Hadoop® and Big Data technologies for the
enterprise.
35. Netflix
Netflix collects a lot of data to understand how its users behave and what their
preferences are
• It collects metrics including what people watch, when they watch, where they watch,
what devices they use, ratings, searches, when users pause or stop watching, etc.
• Netflix made the House of Cards decision by identifying that subscribers who
watched the original British version of House of Cards were very likely to watch
movies starring Kevin Spacey or directed by David Fincher
• Netflix made ten different versions of the trailer for House of Cards geared towards
different audiences
▫ Fans of Kevin Spacey watched trailers that were focused on him while people who liked
female-oriented movies saw trailers that highlighted the women in the show.
36. Verizon
• 103.3 million wireless customers, 6.2 million Internet users and 5.3 million TV subscribers.
• Data collected:
▫ Calls(order flowers) or accessing some pages
▫ Locations in City + Roaming
▫ Home + Mobile web pages + Television
• Formed a Precision Marketing division – e.g. Event attendance information
▫ Migrate from iPhone 5 to iPhone 6 – resulted in a plan data increase or not?
▫ Some migrated from Android to iPhone and huge data plan consumtion 3x-5x more
Notes:
• Customers can choose not to participate in the program by going to their privacy choices page on MyVerizon or by
calling 866-211-0874
• Verizon’s business and government customers are not part of the Precision program
37. The Perfect Milk - Digital Cow - The internet of cows
• Embaded sensos in cow stomachs
• If cow is seek, sensor will let a veterinar know while there is time to treat
the disease
• Sensor to detect the presence of E.coli bacteria
• Vital Herd, a Texas-based start-up - e-Pill - collect information about the
animal: breathing rate, heart rate, temperature, rumination time, rumen
acidity and estrogen levels
38. The City of Las Vegas
• archaic records and inaccurate information
• took advantage of smart data to develop a living
model of its utilities network
• aggregate data from various sources into a single
real-time 3D model created with Autodesk
technology for both avove and below ground
utilities
40. BigData – Big Opportunities
• Big data means big IT job opportunities -- for the right people
41. Big Opportunities
• Gartner predicted in 2013 that by 2015, Big Data demand will generate 4.4
million jobs in the IT Industry all around the world.
• 1.9 million IT jobs will be created just in the U.S. That is how Big Data
directly affects the IT Industry.
• Only 1/3rd of these jobs will be fulfilled, due to lack of skills in the
individuals
What is needed?
• A Curious Mind Is Key - The most important qualifications for these positions
aren't academic degrees, certifications, job experience or titles. Rather, they seem to
be soft skills: a curious mind, the ability to communicate with nontechnical people, a
persistent -- even stubborn -- character and a strong creative bent.
• The CIA is hiring data scientists : “We are looking for curious, creative
individuals interested in serving their country through the field of data
science.”
42.
43.
44.
45.
46.
47. “I keep saying that the sexy job in the next 10 years will be
statisticians, and I’m not kidding.”
Hal Varian, chief economist at Google
“Without big data, you are blind and deaf and in the
middle of a freeway.”
Geoffrey Moore, author and consultant
In 60 second,
Google receives over 4,000,000 search queries,
YouTube users upload 71 hours of new videos
Twitter users share 277,000 tweet
Apple Watch - predicted 30 M units in fist year, 29.2 M units all Swiss sold watches
In 60 second:
Emails: 200 M emails/minute
Facebook generates 10 PB of data per day
- Twitter users share 277,000 tweet,
- Apple users download 48,000 apps.
The first documented use of the term “big data” appeared in a 1997 paper by scientists at NASA, describing the problem they had with visualization (i.e. computer graphics) which “provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.”
Cloud Market Size: $16 B, 30% Amazon, 10% Microsoft
temperature, humidity, air pressure, etc.
http://www.theguardian.com/technology/2015/apr/29/apple-ipad-fail-grounds-few-dozen-american-airline-flights - $1.2 M saved per year from paper and fuel
“Extragerea de cunostinte din date”
"torturarea datelor pâna când acestea se confeseaza“
Example:
Machine Translation
Spam filters
Face recognition
Car/housing price predictor
- From batch processing to Data Operating System
- YARN (Yet Another Resource Negotiator)
- separating the processing engine and resource management capabilities
More like an operating system, to support multiple users, multiple applications
In Hadoop 1.0, everything was batch-oriented. In 2.0, you will now have multiple apps hitting the data inside all at once.
Streaming, online, in-memory
Due to archaic records and inaccurate information, most utilities have no idea where all of their underground assets are located, resulting in those all-too-common service interruptions for residents when a power line is accidently cut or a water line bursts. To avoid these problems, the City of Las Vegas took advantage of smart data to develop a living model of its utilities network.
VTN Consulting helped the city aggregate data from various sources into a single real-time 3D model created with Autodesk technology. The model includes both above and below ground utilities, and is being used to visualize the location and performance of critical assets located under the city.