Injustice - Developers Among Us (SciFiDevCon 2024)
Coalmine spie 2012 presentation - jsw -d3
1. Coalmine:
An E xperience in B uilding a S ystem for S ocial
Media Analytics
Joshua S. White
Jeanna N. Matthews, PhD
2. Outline
• Problem
• Method Overview
• Data Collection
• Analysis
• Case Studies
• Conclusion / Future Work
3. P roblem
• Social Media Networks
– A communications means for good and bad
• Proven cases of malware / botnets use
• SPAM medium
• Our Goal
– To provide a generalized tool for analysis of
potential threats that use these networks for
communications.
5. D ata Collection
• Initially (Spring 2011)
– Twitter approved oAuth application
• Firehose Subscription with white-listing
– ~20% of all Tweets
– (No longer available)
» Twitter no longer allows researchers to share
datasets
» We needed to develop a new collection method
» Can not violate terms of use
6. • Current
– Distributed Data Collection Infrastructure
– Geographically dissimilar IP's to simulate multiple users
– Registered Application with Non-authenticated API access
• ~80 – 100% of all Tweets (1 billion+ / week)
7. D ata Collection
• Storage
– Collection in Streaming Gzip Python Dict.
Format (10:1 Compression Ratio)
• Converted to JSON on the fly when needed
– Initially Stored in HDFS (Had Issues)
» Recent work uses DDFS
– Indexed using Luceen
• New methods are being explored
– Discodex w/ BSON Store
– Storing 1.5 TB a Week
8. Analysis
• Two Part Method
– Manual Inspection
• Query Panel Front-end
– Automated Inspection
9. E xample Analysis
Field Name Description Example Data
name User's REAL Name Text: "Robert Scoble"
screen_name User's Twitter username Text: "scobleizer"
Link: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop-
profile_image_url Link to users profile image fanatiguy_normal.jpg"
url Link to user's non-Twitter site Link: "http://www.google.com/profiles/scobleizer"
followers_count Number of followers user has Number: "185496"
friends_count Number of people user follows Number: "31971"
utc_offset Offset from GMT (in seconds) Number: "-28800"
geo_enabled Whether user has enabled location Boolean: "True"
statuses_count Number of statuses user has posted Number: "53522"
Tweet Specific Fields
created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011"
id Tweet id (useful for URL creation) Number: "80703603437875201"
Contains the actual text + any
text embedded URLs Whatever text the person chooses to enter. <- Could be any language supported.
Links to Twitter client URL <- not
source important HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>"
in_reply_to_status_id Number of status that user replied to Number: "80671170374025220"
in_reply_to_screen_na Screen name of user the current
me status replies to Text: "danharmon"
Number of times this status is
retweet_count retweeted Number: "0"
Whether or not the status has been
retweeted retweeted Boolean: "false"
'geo' flag specific:
georss:point Lat. & Long. Location Number: "43.21227199 -75.39866939"
Points to a JSON or XML file with
url further GEO Info. Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"
10. Case S tudy: B otnet C2
• One well known case:
– Arbor Networks detected first known incident
in 2009
• Base 64 encoded control signals
– Soon After:
• A number of tools released to do the same:
– ControlMyPC, KreosC2, etc.
11. Case S tudy: B otnet C2
• Sample Manual Detection:
12. Case S tudy: S P AM
• Twitter's number one problem, artificially
increases traffic and bothers legitimate users
• Easily detected during manual analysis
• Automated detection based on wording and
rates at which messages are posted
13. Conclusion / Future Work
• Coalmine - A tool for Social Media Analysis
– Scales well based on initial tests
– Useful for both manual and automated detection
• Future (Current) Work
– Rebuild of the tool to fix scaling limitations
• More extensible Map/Reduce method
• Inclusion of native multi-threading capability
• New storage and distribution method
• New algorithms for automated opinion leader detection