1) The document discusses big data problems like volume, velocity, and variety of data and how tools like Hadoop and Word2Vec can help address them.
2) Hadoop is an open-source software that allows massive data storage and processing across multiple machines using MapReduce.
3) Word2Vec represents words as vectors to model relationships between words, allowing uses like finding similar words, analogies, and sentiment analysis.
4) Examples are given of analyzing over 10 million tweets from UK election debates within minutes using these big data tools.
2. Discussionitems
• Who are Digital Contact?
• Problems with Big Data
• Hadoop
• Word2Vec
• (More) Problems with Big Data
• Election debates
3. WhoareDigitalContact?
• We are a big data product company
• Focus on developing products and services
for business-to-business and business-to-
consumers
• Currently developing trading.co.uk
4. Problemswithbigdata
• Often described as the three V’s:
1. Volume – Huge quantities of data available
2. Velocity – Data constantly produced by both people and
3. Variety – Data can be both structured and un-structured
• How can we tackle some of these problems?
5.
6. Hadoop
• Hadoop is an open-source software framework
• Developed at Yahoo to deal with ever-increasing
amounts of content
• It allows you to store and process data in a distributed
fashion (ie over a number of machines)
• This allows for 2 key things: massive data storage and
faster processing
• It’s an incredibly powerful system but, as it’s relatively
new, there is little documentation on it
• Used by Amazon, Ebay, Facebook, LinkedIn and many
more
7. Hadoop–DataStorage
• Hadoop allows for huge data files to be stored across
multiple machines
• Takes files and breaks them into blocks (normally
64/128mb)
• Blocks are stored in data nodes and are typically
replicated across 3 nodes per block
• A master node maintains the location of the blocks and
which file they belong to – however, it doesn’t store the
blocks itself
9. Hadoop–datastorage
• Allows for complete redundancy – data nodes are easily replacable
• Allows for faster access to the data – system can request data from 3 places and use the fastest return
• Storage is reduced to 1/3 capacity but:
• Files can be read in a compressed format
• Redundancy is worth the cost
• Higher failure rates permissible for data nodes
• Storage is cheap!
10. Hadoop–dataprocessing
• Once the data’s in, how is it processed?
• One major component of Hadoop is MapReduce
• Doesn’t try and process everything all at once
• Instead, processes chunks of data and tallies up results
12. Hadoop–dataprocessing
• Designed for massive data sets
• Not suitable for processing small sets quickly (although other tools on Hadoop can do this
in real-time)
• Allows users to stream data through other programming languages
• During most recent debate, able to extract named entities and sentiment from 10,000,000
tweets in 3:30 minutes! (more on this later)
13. Workingwithdata
• Hadoop can help with volume and velocity of data – what about
variety
• Need methods to add structure to unstructured data
• For working with text, we’ve been looking at Word2Vec
14. Word2Vec
• Developed and released as an open source project by Google
• Described as a ‘really, really big deal’ by the head of Kaggle (a data science
competition website)
• Works by representing every word as a vector (a series of numbers for each word
showing how likely it is to be found in relation to other words)
• Trains by taking a word and working out how likely other words are to come
before and after it
• It’s maths with words
• Allows you to do some really interesting stuff…
16. Word2Vecuses
• Works well as a thesaurus
• Able to look for similar words and find odd ones out
• Useful to overcome issues around synonymy
• Even more helpful is that it models relationships between words
• We can see this when we model the words on a 2d space
20. Word2Vecuses
We can also add and subtract words for more information:
• King + Woman – Man = Queen
• London + France – England = Paris
• Bigger – Big + Cold = Colder
• Sushi – Japan + Germany = Bratwurst
• Cu – Copper + Gold = Au
• Windows – Microsoft + Google = Android
• Tim Cook – Apple + Microsoft = Satya Nadella
23. Word2Vecuses
Wide range of applications for this model:
• Answering queries
• Understanding meaning of new words
• Easy to understand results
• Good for finding similar documents in a large corpus
• Intelligent localised searches
• Machine Translation
• Detecting sarcasm
• Sentiment analysis
• Pub quizzes…
24. (More)Problemswithbigdata
• More V’s for data science to deal with:
1. Veracity – Data contains noise – need to keep data ‘clean’
2. Validity – Data needs to be correct and fit for purpose
3. Volatility – Data needs to be relevant to the analysis
4. Viewership – Results need to be appropriate to the audience
• Quick case study
25. Leaders’Debates
• Over 10,000,000 election tweets
• Looked for mentions of parties or leaders
• Analysed tweets for sentiment
• Gave interesting insights into debates
29. Leaders’Debates
• Data was processed with Hadoop within 5 minutes of debate being finished
• Analysed 10,000,000 tweets and extracted relevant information
• Able to provide a clear picture of social media
• Interesting result in second debate…
31. FinalPoints
• Huge number of tools and methods for dealing with Big Data
• Good idea to work out what you want to find
• Is your data big? Can it be made bigger?
• Are your results useful? Can they be improved?
• Have fun!