SlideShare una empresa de Scribd logo
1 de 17
Coalmine:
An E xperience in B uilding a S ystem for S ocial
Media Analytics


Joshua S. White
Jeanna N. Matthews, PhD
Outline

 •   Problem
 •   Method Overview
 •   Data Collection
 •   Analysis
 •   Case Studies
 •   Conclusion / Future Work
P roblem

 • Social Media Networks
   – A communications means for good and bad
      • Proven cases of malware / botnets use
      • SPAM medium
 • Our Goal
   – To provide a generalized tool for analysis of
     potential threats that use these networks for
     communications.
Method Overview
D ata Collection
 • Initially (Spring 2011)
    – Twitter approved oAuth application
       • Firehose Subscription with white-listing
           – ~20% of all Tweets
           – (No longer available)
               » Twitter no longer allows researchers to share
                 datasets
               » We needed to develop a new collection method
               » Can not violate terms of use
• Current
  – Distributed Data Collection Infrastructure
  – Geographically dissimilar IP's to simulate multiple users
  – Registered Application with Non-authenticated API access
      • ~80 – 100% of all Tweets (1 billion+ / week)
D ata Collection
 • Storage
    – Collection in Streaming Gzip Python Dict.
      Format (10:1 Compression Ratio)
       • Converted to JSON on the fly when needed
          – Initially Stored in HDFS (Had Issues)
              » Recent work uses DDFS
    – Indexed using Luceen
       • New methods are being explored
           – Discodex w/ BSON Store
    – Storing 1.5 TB a Week
Analysis
 • Two Part Method
   – Manual Inspection
     • Query Panel Front-end




   – Automated Inspection
E xample Analysis
  Field Name            Description                             Example Data
  name                  User's REAL Name                        Text: "Robert Scoble"
  screen_name           User's Twitter username                 Text: "scobleizer"

                                                                Link: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop-
  profile_image_url     Link to users profile image             fanatiguy_normal.jpg"
  url                   Link to user's non-Twitter site         Link: "http://www.google.com/profiles/scobleizer"
  followers_count       Number of followers user has            Number: "185496"
  friends_count         Number of people user follows           Number: "31971"
  utc_offset            Offset from GMT (in seconds)            Number: "-28800"

  geo_enabled           Whether user has enabled location       Boolean: "True"

  statuses_count        Number of statuses user has posted Number: "53522"

  Tweet Specific Fields                                          
  created_at            Tweet timestamp                         Text: "Tue Jun 14 18:30:13 +0000 2011"

  id                    Tweet id (useful for URL creation)      Number: "80703603437875201"
                        Contains the actual text + any
  text                  embedded URLs                           Whatever text the person chooses to enter. <- Could be any language supported.
                        Links to Twitter client URL <- not
  source                important                               HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>"

  in_reply_to_status_id Number of status that user replied to   Number: "80671170374025220"
  in_reply_to_screen_na Screen name of user the current
  me                    status replies to                       Text: "danharmon"
                        Number of times this status is
  retweet_count         retweeted                               Number: "0"
                        Whether or not the status has been
  retweeted             retweeted                               Boolean: "false"
  'geo' flag specific:                                           
  georss:point          Lat. & Long. Location                   Number: "43.21227199 -75.39866939"
                        Points to a JSON or XML file with
  url                   further GEO Info.                       Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"
Case S tudy: B otnet C2
 • One well known case:
    – Arbor Networks detected first known incident
      in 2009
      • Base 64 encoded control signals
    – Soon After:
      • A number of tools released to do the same:
         – ControlMyPC, KreosC2, etc.
Case S tudy: B otnet C2
 • Sample Manual Detection:
Case S tudy: S P AM
 • Twitter's number one problem, artificially
   increases traffic and bothers legitimate users
 • Easily detected during manual analysis




 • Automated detection based on wording and
   rates at which messages are posted
Conclusion / Future Work
 • Coalmine - A tool for Social Media Analysis
   – Scales well based on initial tests
   – Useful for both manual and automated detection
 • Future (Current) Work
   – Rebuild of the tool to fix scaling limitations
      •   More extensible Map/Reduce method
      •   Inclusion of native multi-threading capability
      •   New storage and distribution method
      •   New algorithms for automated opinion leader detection
Questions




            ?
R eferences
R eferences
R eferences

Más contenido relacionado

Similar a Coalmine spie 2012 presentation - jsw -d3

474 Password Not Found
474 Password Not Found474 Password Not Found
474 Password Not FoundCodemotion
 
Data encoding and Metadata for Streams
Data encoding and Metadata for StreamsData encoding and Metadata for Streams
Data encoding and Metadata for Streamsunivalence
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightMatthew Russell
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insightDigital Reasoning
 
HATEOAS 101 - Opinionated Introduction to a REST API Style
HATEOAS 101 - Opinionated Introduction to a REST API StyleHATEOAS 101 - Opinionated Introduction to a REST API Style
HATEOAS 101 - Opinionated Introduction to a REST API StyleApigee | Google Cloud
 
Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)
Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)
Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)Giles Greenway
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingCloud Elements
 
Apache Metron Meetup May 4, 2016 - Big data cybersecurity
Apache Metron Meetup May 4, 2016 - Big data cybersecurityApache Metron Meetup May 4, 2016 - Big data cybersecurity
Apache Metron Meetup May 4, 2016 - Big data cybersecurityHortonworks
 
Apache metron meetup presentation at capital one
Apache metron meetup presentation at capital oneApache metron meetup presentation at capital one
Apache metron meetup presentation at capital onegvetticaden
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsthelabdude
 
Extensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPopExtensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPopVarun Ganesh
 
API-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the FutureAPI-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the FuturePat Patterson
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenChristopher Whitaker
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & AnalysisScott Sanders
 
Toward a Mobile Data Commons
Toward a Mobile Data CommonsToward a Mobile Data Commons
Toward a Mobile Data CommonskingsBSD
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software DatasetsTao Xie
 
Klout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIsKlout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIsTyler Singletary
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPChristian Morbidoni
 

Similar a Coalmine spie 2012 presentation - jsw -d3 (20)

474 Password Not Found
474 Password Not Found474 Password Not Found
474 Password Not Found
 
Data encoding and Metadata for Streams
Data encoding and Metadata for StreamsData encoding and Metadata for Streams
Data encoding and Metadata for Streams
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
HATEOAS 101 - Opinionated Introduction to a REST API Style
HATEOAS 101 - Opinionated Introduction to a REST API StyleHATEOAS 101 - Opinionated Introduction to a REST API Style
HATEOAS 101 - Opinionated Introduction to a REST API Style
 
Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)
Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)
Our Data, Ourselves: The Data Democracy Deficit (EMF CAmp 2014)
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
Apache Metron Meetup May 4, 2016 - Big data cybersecurity
Apache Metron Meetup May 4, 2016 - Big data cybersecurityApache Metron Meetup May 4, 2016 - Big data cybersecurity
Apache Metron Meetup May 4, 2016 - Big data cybersecurity
 
Apache metron meetup presentation at capital one
Apache metron meetup presentation at capital oneApache metron meetup presentation at capital one
Apache metron meetup presentation at capital one
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Extensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPopExtensible RESTful Applications with Apache TinkerPop
Extensible RESTful Applications with Apache TinkerPop
 
API-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the FutureAPI-Driven Relationships: Building The Trans-Internet Express of the Future
API-Driven Relationships: Building The Trans-Internet Express of the Future
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Toward a Mobile Data Commons
Toward a Mobile Data CommonsToward a Mobile Data Commons
Toward a Mobile Data Commons
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
Klout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIsKlout as an Example Application of Topics-oriented NLP APIs
Klout as an Example Application of Topics-oriented NLP APIs
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 

Más de Joshua S. White, PhD josh@securemind.org

Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...Joshua S. White, PhD josh@securemind.org
 
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...Joshua S. White, PhD josh@securemind.org
 
Presentation - Application of Actor Level Social Characteristic Indicator Sel...
Presentation - Application of Actor Level Social Characteristic Indicator Sel...Presentation - Application of Actor Level Social Characteristic Indicator Sel...
Presentation - Application of Actor Level Social Characteristic Indicator Sel...Joshua S. White, PhD josh@securemind.org
 
Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...
Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...
Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...Joshua S. White, PhD josh@securemind.org
 

Más de Joshua S. White, PhD josh@securemind.org (12)

Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
Presentation - Hybrid Sentiment Analysis Utilizing Multiple Indicators To Det...
 
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
Presentation - Social Relevance Toward Understanding the Impact of the Indivi...
 
Presentation - Application of Actor Level Social Characteristic Indicator Sel...
Presentation - Application of Actor Level Social Characteristic Indicator Sel...Presentation - Application of Actor Level Social Characteristic Indicator Sel...
Presentation - Application of Actor Level Social Characteristic Indicator Sel...
 
Supraja_SMS_presentation
Supraja_SMS_presentationSupraja_SMS_presentation
Supraja_SMS_presentation
 
ase-social-informatics (6)
ase-social-informatics (6)ase-social-informatics (6)
ase-social-informatics (6)
 
Social Network Analysis Applications and Approach
Social Network Analysis Applications and ApproachSocial Network Analysis Applications and Approach
Social Network Analysis Applications and Approach
 
Clarkson joshua white - ids testing - spie 2013 presentation - jsw - d1
Clarkson   joshua white - ids testing - spie 2013 presentation - jsw - d1Clarkson   joshua white - ids testing - spie 2013 presentation - jsw - d1
Clarkson joshua white - ids testing - spie 2013 presentation - jsw - d1
 
Malware bek slides 20131023 final
Malware bek slides 20131023 finalMalware bek slides 20131023 final
Malware bek slides 20131023 final
 
CSIAC - Social Media Analysis and Privacy
CSIAC - Social Media Analysis and PrivacyCSIAC - Social Media Analysis and Privacy
CSIAC - Social Media Analysis and Privacy
 
Clarkson - Joshua White - Research Proposal Presentation
Clarkson - Joshua White - Research Proposal PresentationClarkson - Joshua White - Research Proposal Presentation
Clarkson - Joshua White - Research Proposal Presentation
 
Phishing spie 2012 presentation - jsw - d2
Phishing   spie 2012 presentation - jsw - d2Phishing   spie 2012 presentation - jsw - d2
Phishing spie 2012 presentation - jsw - d2
 
Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...
Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...
Physical Layer Optical Network Security Thesis Presentation To The CNY ISSA C...
 

Último

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Último (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Coalmine spie 2012 presentation - jsw -d3

  • 1. Coalmine: An E xperience in B uilding a S ystem for S ocial Media Analytics Joshua S. White Jeanna N. Matthews, PhD
  • 2. Outline • Problem • Method Overview • Data Collection • Analysis • Case Studies • Conclusion / Future Work
  • 3. P roblem • Social Media Networks – A communications means for good and bad • Proven cases of malware / botnets use • SPAM medium • Our Goal – To provide a generalized tool for analysis of potential threats that use these networks for communications.
  • 5. D ata Collection • Initially (Spring 2011) – Twitter approved oAuth application • Firehose Subscription with white-listing – ~20% of all Tweets – (No longer available) » Twitter no longer allows researchers to share datasets » We needed to develop a new collection method » Can not violate terms of use
  • 6. • Current – Distributed Data Collection Infrastructure – Geographically dissimilar IP's to simulate multiple users – Registered Application with Non-authenticated API access • ~80 – 100% of all Tweets (1 billion+ / week)
  • 7. D ata Collection • Storage – Collection in Streaming Gzip Python Dict. Format (10:1 Compression Ratio) • Converted to JSON on the fly when needed – Initially Stored in HDFS (Had Issues) » Recent work uses DDFS – Indexed using Luceen • New methods are being explored – Discodex w/ BSON Store – Storing 1.5 TB a Week
  • 8. Analysis • Two Part Method – Manual Inspection • Query Panel Front-end – Automated Inspection
  • 9. E xample Analysis Field Name Description Example Data name User's REAL Name Text: "Robert Scoble" screen_name User's Twitter username Text: "scobleizer" Link: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop- profile_image_url Link to users profile image fanatiguy_normal.jpg" url Link to user's non-Twitter site Link: "http://www.google.com/profiles/scobleizer" followers_count Number of followers user has Number: "185496" friends_count Number of people user follows Number: "31971" utc_offset Offset from GMT (in seconds) Number: "-28800" geo_enabled Whether user has enabled location Boolean: "True" statuses_count Number of statuses user has posted Number: "53522" Tweet Specific Fields     created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201" Contains the actual text + any text embedded URLs Whatever text the person chooses to enter. <- Could be any language supported. Links to Twitter client URL <- not source important HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>" in_reply_to_status_id Number of status that user replied to Number: "80671170374025220" in_reply_to_screen_na Screen name of user the current me status replies to Text: "danharmon" Number of times this status is retweet_count retweeted Number: "0" Whether or not the status has been retweeted retweeted Boolean: "false" 'geo' flag specific:     georss:point Lat. & Long. Location Number: "43.21227199 -75.39866939" Points to a JSON or XML file with url further GEO Info. Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"
  • 10. Case S tudy: B otnet C2 • One well known case: – Arbor Networks detected first known incident in 2009 • Base 64 encoded control signals – Soon After: • A number of tools released to do the same: – ControlMyPC, KreosC2, etc.
  • 11. Case S tudy: B otnet C2 • Sample Manual Detection:
  • 12. Case S tudy: S P AM • Twitter's number one problem, artificially increases traffic and bothers legitimate users • Easily detected during manual analysis • Automated detection based on wording and rates at which messages are posted
  • 13. Conclusion / Future Work • Coalmine - A tool for Social Media Analysis – Scales well based on initial tests – Useful for both manual and automated detection • Future (Current) Work – Rebuild of the tool to fix scaling limitations • More extensible Map/Reduce method • Inclusion of native multi-threading capability • New storage and distribution method • New algorithms for automated opinion leader detection