2. Social Media - Twitter
• What can we learn from Twitter?
• 400 million tweets per day
source: http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter
• 218 million users
source: http://techcrunch.com/2013/10/03/bweeting/
• Excellent source of sentiment
• Excellent source of big data
• Prototyping
• Modeling natural language
• Resume padding
3. Social Media - Twitter
• How do we get at the data?
• Twitter provided APIs:
• https://dev.twitter.com/docs
• Streaming
• Set up a real time data stream (json) based on keywords
• REST (v1.1)
• Make REST requests, and get results
• Possible parameters:
• Geospatial bounding box
• By time
• By user, hashtag, retweets etc
• Fire hose
• Big $$$. Big data
4. Social Media - Twitter
• Information & Obstacles
• Who
• What
• At best: Plain English (!)
• Worse: (Spanish or Arabic or Portuguese...)
• Worst: “Textspeak” symbols :-0, UTF8 chars, etc.
• Absolute Worst: combination of all of them
• Where
• 1-2% with latitude / longitude
• Geocode
• When
6. Cloud Computing
• What does cloud computing bring to the table?
• Amazon’s EC2:
• Commoditized hardware
• Low cost
• Only charged for resources you use
• No long term commitments
• Scalable
• "Throwaway" mentality
**IF** you play by their rules!
7. Cloud Computing – AWS
• Tools
• Virtual Machines
• # of Processors, RAM, OS, disk capacity and I/O – all configurable
• Price range: $.02/hr - $4.60/hr
• Licensed OSes cost 50% more than Linux OSes
• Archive Storage
• S3 / Glacier
• Work Queues
• SQS
• Data Stores
• Dynamo (key value store), Red Shift (analysis store)
• Virtual Networking
• Routers, VPN gateways, access control lists, etc
• APIs
• Command line
• HTTPS REST
• Native programming languages (Python, bash, PHP, Java etc.)
Ideal for rapid prototyping / proof of concepts
8. Cloud Computing – AWS
• APIs
• Basic
• Start an instance (and start billing)
• Stop an instance (stop billing)
• Insert item into queue
• Remove item from queue
• Write to backup store
• Ultra advanced
• Reserved vs. on demand vs. spot instances
• Price can drop as much as 80% due to market demand
• Instance can disappear at any time
9. Big Data Analytics
• Can we skirt the “big data” problem by distilling the tweets
down from millions and millions “noise” tweets into a more
desirable data set?
• Enrich in real time, rather than on archived data, and avoid the
overhead of map/reduce?
• Possible Enrichment of raw data:
• Classification – separate tweets into “relevant” and “irrelevant”
• Geocoding – improve on the 1-2% ?
• Aggregation –> map reduce
• Mapping -> Reduce Function -> Output
• AWS – Elastic Map Reduce
• Clustering
10. Machine Learning
• Classification: relevant, or irrelevant?
• Human trained model
• Once model is established, bounce new data off it for
classification
• Validation of model
• Accuracy =
(Total # of classifications – Mismatches between machine / human)
Total # of classifications
• Crowdsourcing – AWS Mechanical Turk
• Improve model by feeding disagreements back into the model
• Our best text classification model to date: low 90%
11. Open Source
• Friendly to the commoditized computing paradigm
• Don’t have to worry about licensing issues
• Contributes to the “throwaway” discipline
• Don’t have to re-invent the wheel (collaboration)
• Solutions applicable to all parts of the architecture
• Acquire data: Node.js – non blocking
• Analyze data: R – statistical engine
• Store and query data: MongoDB (document store) or Riak (key-
value database)
12. Architecture
• We know Twitter is providing a mountain of data from all parts
of the world
• We know Amazon is providing a framework of low cost, on-
demand, no commitment computing
• Open source is providing a rich tool set
• Goals:
• Architect with cost in mind!
• Enrichment - Real time and after-the-fact enrichment (open data)
• Scalable
• Decoupled
• Service based
• Rapid development
• Prove the concepts
13. Architecture - Acquire
• Acquire the data from Twitter
• If classifying in real time:
• Store then classify?
• Classify then store?
• Tools
• Twitter streaming API
• Keywords
• Node.js
• Several different packages to interface with Twitter APIs
• Amazon
• EC2
• SQS (?) Extremely useful, but drives the cost up
14. Architecture - Analyze
• Classification interface
• Service based – HTTP REST
• Push or pull?
• Push – classifiers listen on port 80
• Pull – classifier starts pulling from an established work queue
• Both highly scalable and flexible with respect to cost.
• Stateless
• R
• Human trained machine learning packages available
• Cloud friendly – no licenses
• Automatable – from install, configuration, execution
15. Architecture - Store
• Store JSON as an object (document store) or normalize (relational
database)?
• Relational databases
• disk I/O intensive – not cloud friendly
• allow complex indexing
• Easy to get a business intelligence front end on them
• Requires a schema / ETL
• Key-value document stores
• Designed to be scalable – doesn’t need fast disks
• Indexing is not nearly as flexible as RDBMS
• More difficult to front a UI – no “drag and drop” tools
• No schema / ETL needed.
• Not as mature
• MongoDB / Riak
16. Architecture – Presentation
• Least need for cloud friendly scalability here?
• Options
• Licensed BI software – Tableau, Endeca, Jaspersoft, Pentaho
• Open source BI software – SpagoBI
• Roll your own - PHP, Ruby, Visual Basic, Javascript, etc
• Connect to an existing system instead?
17. Costs – Real Time Classification
• Number of tweets collected per day: 1,000,000 (comfortable - .25%)
• Machine used on EC2 to acquire (node.js): micro
• $.02/hr * 24 hrs = .48/day
• Machine used on EC2 to classify (R): small (x2)
• $.06/hr * 24 hrs = $1.44/day*2 = $2.88/day
• Machine used on EC2 to store (MongoDB): large
• $.24/hr * 24 hrs = $5.76 /day
• Machine used on EC2 for GUI (Apache): small
• $.06/hr * 24 = $1.44
•
$0.48+$2.88+$5.76+$1.44 = $10.56 / 1,000,000 =
.00001056 cents/tweet
Can add more zeros if you relax real-time classification (spot instances)
18. Costs - Archive
• Size of average tweet: 2.5 KB
• Cost to archive:
• s3 : .095 GB/month
• 0.0000002 per tweet per month
• Glacier: .01 GB/month
• 0.00000002 per tweet per month
• Compression will add even more zeros, but will require more
computing power, and mean more latency for post collection
data analysis. Can be automated.
19. Use Cases
• Foodborne Chicago (http://foodborne.smartchicagoapps.org/)
• Public-private partnership with City of Chicago Dept. of Public Health
and Smart Chicago Collaborative
• Reach out to city residents on Twitter tweeting about food poisoning
symptoms, in an attempt to get them to log information in the City’s
311 database (via the Open311 API)
• Once in the 311 database, it follows established City workflows, and
becomes actionable
• Numbers (1 year):
• 2,390 tweets classified as related to food poisoning
• 282 tweets responded to
• 205 reports submitted
• 145 inspections
• Real time classification examples:
• “Ugh! I got food poisoning from the McDonalds’s on Halstead!”
http://184.73.52.31/cgi-bin/R/fp_classifier?text=Ugh!%20I%20got%20food%20poisoning%20from%20McDonalds%20on%20Halstead
• “U of Chicago releases a new paper on the effects of food poisoning”
http://184.73.52.31/cgi-bin/R/fp_classifier?text=U%20of%20Chicago%20releases%20new%20paper%20on%20the%20effects%20of%20food%20poisoning
• Video:http://www.youtube.com/watch?v=RNf9XQ_25Yw&feature=youtu.be
20. Use Cases
• Disease Tracker
• Large scale attempt to track disease occurrences in the United
States.
• Sponsored by the Dept. of HHS
• Approximately 1 million tweets a day (cold, flu) classified in real
time
• EC2 scalable instances
• Geolocation
• Cost to run for 6 months: $850
21. Future Directions
• Turnkey service
• Can all this functionality be abstracted down to a pushbutton
service?
• Open data
• Can you advertise the data collected, how you enriched it, and
allow others to come along an enrich it as well?
• General purpose bridge between Twitter and issue tracking
databases
• Big industry problem