SlideShare una empresa de Scribd logo
1 de 36
musweet.com
Handling Humongous Data Sets
from the Social Web

Grischa Andreew & Nader Cserny, compuccino
Agenda


• Einführung

• Technik

• Daten im Detail

• Abfragen

• Tools & Debugging

• Fragen
musweet.com: Handling Humongous Data Sets from the Social Web
Einführung
Was ist musweet?
Media Stream
Themen
Analytics
Pro l
Statistik




7002  KÜNSTLER
                  20.848 3.914.259
                  SOCIAL MEDIA PROFILE                 STREAM ITEMS




                  ~3450
                    READ QUERIES / SEC
                                         309.855
                                         MEDIEN (AUDIO, VIDEO, FOTO)




                 ~15.000 INSERTS / DAY
                                         3.75 GB           DATA SIZE
musweet.com: Handling Humongous Data Sets from the Social Web
Technik
Was brauchen wir?


KÜNSTLER            Name, City, Genre, Bild


SOCIAL PROFILE      Plattform (z.B. twitter), Link


MEDIA POSTS         Bilder, Videos, Audios, Statusmeldungen




PROFIL INFO         Freunde, Follower, Datum, Webseiten,
                    Pro lbild, Biographie, Label, etc.
MySQL Schema

                                                                                                                                Artist
                                                                                                                            id
                                                                                                                            name
                                                                                                                            Indexs
                                                                                                                            name




                                                                        Numbers
                                                                 artist
                                                                 socialprofile
                                                                 outgoing                                         Socialprofile                   artist_genres
                                                                 incoming                                        id                             artist
                                                                 feedback                                        artist                         genre
                                                                 push                                            url                            Indexes
       Service Informations Twitter
artist                                                           Indexes                                         service                        artist_genre
lang,                                                            artist_outgoing                                 Indexes
verified                                                          artist_incoming                                 artists_service
location,                                                        artist_feedback
id,                                                              artist_push
url,                                                             artist
created_at,
description,                                                                                                          Stream
time_zone,                                                                                                     artist                                Genres
profile_image_url,                                                                                              socialprofile                       id
screen_name                                Service Informations Facebook                                       message                            name
Indexes                               artist                                                                   created_at                         Indexes
artist                                category,                                                                Indexes                            name
                                      name,                                                                    message
                                      fan_count,                                                               artist_created_at
     Service Informations Myspace     bio,                                                                     created_at
artist                                url,
website,                              username,
genre,                                record_label,
location,                             location,
art_des_labels,                       profile_image_url,                                Stream Informations Facebook
                                                                                                                                 Stream Informations Twitter
headline,                             band_members,                                stream                                                                              Stream Informations Myspace
                                                                                                                            stream
created_at                            website,                                     name,                                                                         stream
                                                                                                                            source,
id,                                   ink,                                         caption,                                                                      category,
                                                                                                                            in_reply_to_status_id,
profile_image_url,                     pinnwand_posts,                              link,                                                                         image,
                                                                                                                            in_reply_to_user_id,
label                                 genre,                                       likes,                                                                        link,
                                                                                                                            truncated,
Indexes                               friends,                                     type,                                                                         source
                                                                                                                            deleted
artist                                id                                           icon                                                                          Indexes
                                                                                                                            Indexes
                                      Indexes                                      Indexes                                                                       stream
                                                                                                                            stream
                                      artist                                       stream
MongoDB Schema

                                        Artist
                 id (Object Id)
                 name (str)
                 genres (strict array)
                 socialprofiles (strict array)
                    service (dbref)
                    url (str)
                    numbers (strict array)
                     incoming
                     outgoing
                     push
                     feedback
                     date
                    meta (array)
                      (unterschiedliche Felder, ja nach Plattform)
                 Indexes
                 name,
                 genres,
                 socialprofiles.service,
                 socialprofiles.numbers



                                         Stream
                 id (Hash aus facebook / myspace / twitter id)
                 socialprofile (dbref)
                 genres (strict array) (redundanz der genres vom
                 artists um den stream direkt über genres
                 abzufragen)
                 data (array)
                    ( data from plattforms,
                      field message is a must have)
                 created_at (datetime)
                 Indexes
                 socialprofile,
                 genres,
                 data.message
Wie kommen wir an die Daten? (Einfach)



 Crawler                                 musweet

 • Verarbeitung von Links     Links      • Darstellung der Inhalte
 • Extraktion von Medien                 • Zuordnung Artist / Service
 • Aufbereitung der Inhalte

                              Daten
musweet.com: Handling Humongous Data Sets from the Social Web
Daten im Detail
Künstler Pro l bei MySpace




"numbers" : {
     "outgoing" : 221665,
     "incoming" : 770355,
     "feedback" : 36862603,
     "push" : 0,
     "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)",
},
"meta" : {
     "website" : "http://www.snoopdogg.com",
     "genre" : "Hip Hop / Rap / R&B",
     "location" : "Long Beach, California Vereinigte Staaten von Amerika",
     "art_des_labels" : "Major",
     "headline" : "",
     "created_at" : "Sat Dec 11 2004 01:00:00 GMT+0100 (CET)",
     "id" : 6344278,
     "profile_image_url" : "http://c1.ac-images.myspacecdn.com/images02/130/
     m_9857dcca155247b69e1260e6e34cce3c.jpg",
     "label" : "Doggystyle / Priority"
}
Künstler Pro l bei twitter




"numbers" : {
	 "outgoing" : 1204,
	 "incoming" : 2030350,
	 "feedback" : 22750,
	 "push" : 3145,
           "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)"
},
"meta" : {
	 "lang" : "en",
	 "verified" : true,
	 "location" : "LBC",
	 "id" : 3004231,
	 "url" : "http://www.snoopdogg.com",
	 "created_at" : "Fri Mar 30 2007 21:05:42 GMT+0200 (CEST)",
	 "description" : "More Malice CD + DVD IN STORES NOW",
	 "time_zone" : "Pacific Time (US & Canada)",
	 "profile_image_url" : "http://a3.twimg.com/profile_images/1096549203/snoop_normal.jpg",
	 "screen_name" : "SnoopDogg"
}
Künstler Pro l bei facebook
"numbers" : {
	 "outgoing" : 0,
	 "incoming" : 2930860,
	 "feedback" : 0,
	 "push" : 0,
},
"meta" : {
	 "category" : "Musicians",
	 "name" : "Snoop Dogg",
	 "fan_count" : 2930860,
	 "bio" : "The offices at the top of the Capitol Records building in Hollywood are home to some
of Southern California’s most awe-inspiring views. ....",
	 "url" : "http://www.facebook.com/snoopdogg?v=info",
	 "username" : "snoopdogg",
	 "record_label" : "Priority/Doggystyle ",
	 "location" : "Long Beach, CA",
	 "profile_image_url" : "http://profile.ak.fbcdn.net/hprofile-ak-snc4/
hs622.snc3/27524_11455644806_1192_s.jpg",
	 "band_members" : "Snoop Dogg",
	 "website" : [
	 	 "http://www.snoopdogg.com",
	 	 "http://www.myspace.com/snoopdogg",
	 	 "http://twitter.com/snoopdogg"
	 ],
	 "link" : "http://www.facebook.com/snoopdogg",
	 "pinnwand_posts" : 0,
	 "genre" : "Hip Hop / Rap / R&B",
	 "friends" : 0,
	 "id" : "11455644806"
}
musweet.com: Handling Humongous Data Sets from the Social Web
Abfragen
MySQL vs. MongoDB (1)

Alle Social Media Pro le mit Follower-Zahlen von einem Artist

MySQL                                    MongoDB
SELECT                                   db.artist.find( { "name": "Snoop Dogg" } )
	 n.incoming,
	 a.id as artist,
	 a.name as artist_name,                 Dauer: 0.0001 Sek.
	 s.id as socialprofile,
	 s.url as socialprofile_url,
FROM
	 numbers as n
	 JOIN socialprofile as s on s.id =
n.socialprofile
	 JOIN artist as a on a.id = n.artist
WHERE
	 a.name = "Snoop Dogg"
ORDER n.incoming DESC


Dauer: 0.0288 Sek.
MySQL vs. MongoDB (2)

10 HipHop Musiker mit den meisten Followern

MySQL                                    MongoDB
SELECT                                   db.artist.find( {
	 n.incoming,                              "genre": DBRef("genre","hiphop")
	 a.id as artist,                        } ).sort( {
	 a.name as artist_name,                   "socialprofiles.numbers.incoming": -1
	 s.id as socialprofile,                 } ).limit(10)
	 s.url as socialprofile_url,
FROM
                                         Dauer: 0.0230 Sek.
	 numbers as n
	 JOIN artist_genres as ag on
ag.artists = n.artist
   JOIN genres as g on g.id = ag.genre
   JOIN socialprofile as s on s.id =
n.socialprofile
   JOIN artist as a on a.id = n.artist
WHERE
	 g.name = "Hip/Hop"
ORDER BY
         n.incoming DESC
LIMIT 10

Dauer: 0.8741 Sek.
MySQL Index


• Index wird von links nach rechts gelesen

  Reihenfolge wichtig

  Felder: „artist“, „incoming“, „push“, „date“

  SELECT   *   FROM   numbers   WHERE   artist =   1   Funktioniert
  SELECT   *   FROM   numbers   WHERE   incoming   =   1 Funktioniert nicht
  SELECT   *   FROM   numbers   WHERE   artist =   1   AND push < 10 Funktioniert nicht
  SELECT   *   FROM   numbers   WHERE   artist =   1   AND push < 10 AND incoming > 0 Funktioniert




• Index Debugging

  EXPLAIN SELECT * FROM numbers WHERE artist = 1
MongoDB Index


• Index Reihenfolge ist egal

  kann ein Feld mitten im Index verwenden

  db.artist.ensureIndex( {"name":1, "numbers": -1 } );


  db.artist.find( { "name": "Snoop Dog" } ) Funktioniert
  db.artist.find( { "socialprofiles.numbers.incoming": { "$gte": 10 } } ) Funktioniert
  db.artist.find( {
    "name": "Snoop Dogg",
    "socialprofiles.numbers.incoming": { "$gte": 0 }
  } ) Funktioniert




• Index Debugging

  db.artist.find( { "name": "Snoop Dogg" } ).explain()
musweet.com: Handling Humongous Data Sets from the Social Web
Tools & Debugging
MongoDB Fehlermeldungen


• Sortierte Abfrage ohne Limit:
  Fehler: „too much data for sort() with no index. add an index or
  specify a smaller limit“

  Lösung: Feld in den Index aufnehmen



• Duplicate Key Error:
  Fehler: in älteren Versionen (< 1.6.0) schmiert DB bei zu vielen
  Duplicate Key Errors ab

  Lösung: Upsert verwenden
db.serverStatus()


Wieviel memory-Verbrauch, wieviele Connections, ...



globalLock           Wie lange Collections gesperrt waren, ...

connections          Wieviel Verbindungen offen / verfügbar, ...

backgroundFlushing Wann war der letzte Flush auf die Festplatte, ...

...                  Mehr Info in der Dokumentation:
                     http://www.mongodb.org/display/DOCS/Monitoring+and+Diagnostics
Pro ling



db.setPro lingLevel(0) off

                       log slow operations (>100ms), optional „slow“
db.setPro lingLevel(1)
                       de nieren mit db.setPro lingLevel(1, 10)
db.setPro lingLevel(2) log all operations



system.pro le
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607ef2a68299079400b5ea) } nscanned:1 moved ", "millis"   : 0 }
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607ef2a68299079400b468) } nscanned:1 moved ", "millis"   : 0 }
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607fd9a68299079400c067) } nscanned:1", "millis" : 0 }
{   "ts"   : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update   musweet.media  query:
{   _id:   ObjId(4c607ef2a68299079400b6a6) } nscanned:1 moved ", "millis"   : 0 }
Collection Objekte analysieren

Download: http://github.com/compuccino/mongodb-ac
musweet.com: Handling Humongous Data Sets from the Social Web
Abschließend...
Abschließend...


• Fragen?



• Mehr über uns:

  http://compuccino.com

  http://facebook.com/compuccino



• Personen:

  Grischa Andreew, @grischaandreew

  Nader Cserny, @nadr

Más contenido relacionado

Último

VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 

Último (20)

VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 

musweet.com: Handling Humongous Data Sets from the Social Web

  • 1. musweet.com Handling Humongous Data Sets from the Social Web Grischa Andreew & Nader Cserny, compuccino
  • 2. Agenda • Einführung • Technik • Daten im Detail • Abfragen • Tools & Debugging • Fragen
  • 10. Statistik 7002 KÜNSTLER 20.848 3.914.259 SOCIAL MEDIA PROFILE STREAM ITEMS ~3450 READ QUERIES / SEC 309.855 MEDIEN (AUDIO, VIDEO, FOTO) ~15.000 INSERTS / DAY 3.75 GB DATA SIZE
  • 13. Was brauchen wir? KÜNSTLER Name, City, Genre, Bild SOCIAL PROFILE Plattform (z.B. twitter), Link MEDIA POSTS Bilder, Videos, Audios, Statusmeldungen PROFIL INFO Freunde, Follower, Datum, Webseiten, Pro lbild, Biographie, Label, etc.
  • 14. MySQL Schema Artist id name Indexs name Numbers artist socialprofile outgoing Socialprofile artist_genres incoming id artist feedback artist genre push url Indexes Service Informations Twitter artist Indexes service artist_genre lang, artist_outgoing Indexes verified artist_incoming artists_service location, artist_feedback id, artist_push url, artist created_at, description, Stream time_zone, artist Genres profile_image_url, socialprofile id screen_name Service Informations Facebook message name Indexes artist created_at Indexes artist category, Indexes name name, message fan_count, artist_created_at Service Informations Myspace bio, created_at artist url, website, username, genre, record_label, location, location, art_des_labels, profile_image_url, Stream Informations Facebook Stream Informations Twitter headline, band_members, stream Stream Informations Myspace stream created_at website, name, stream source, id, ink, caption, category, in_reply_to_status_id, profile_image_url, pinnwand_posts, link, image, in_reply_to_user_id, label genre, likes, link, truncated, Indexes friends, type, source deleted artist id icon Indexes Indexes Indexes Indexes stream stream artist stream
  • 15. MongoDB Schema Artist id (Object Id) name (str) genres (strict array) socialprofiles (strict array) service (dbref) url (str) numbers (strict array) incoming outgoing push feedback date meta (array) (unterschiedliche Felder, ja nach Plattform) Indexes name, genres, socialprofiles.service, socialprofiles.numbers Stream id (Hash aus facebook / myspace / twitter id) socialprofile (dbref) genres (strict array) (redundanz der genres vom artists um den stream direkt über genres abzufragen) data (array) ( data from plattforms, field message is a must have) created_at (datetime) Indexes socialprofile, genres, data.message
  • 16. Wie kommen wir an die Daten? (Einfach) Crawler musweet • Verarbeitung von Links Links • Darstellung der Inhalte • Extraktion von Medien • Zuordnung Artist / Service • Aufbereitung der Inhalte Daten
  • 19. Künstler Pro l bei MySpace "numbers" : { "outgoing" : 221665, "incoming" : 770355, "feedback" : 36862603, "push" : 0, "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)", }, "meta" : { "website" : "http://www.snoopdogg.com", "genre" : "Hip Hop / Rap / R&B", "location" : "Long Beach, California Vereinigte Staaten von Amerika", "art_des_labels" : "Major", "headline" : "", "created_at" : "Sat Dec 11 2004 01:00:00 GMT+0100 (CET)", "id" : 6344278, "profile_image_url" : "http://c1.ac-images.myspacecdn.com/images02/130/ m_9857dcca155247b69e1260e6e34cce3c.jpg", "label" : "Doggystyle / Priority" }
  • 20. Künstler Pro l bei twitter "numbers" : { "outgoing" : 1204, "incoming" : 2030350, "feedback" : 22750, "push" : 3145, "date" : "Wed Sep 29 2010 02:00:00 GMT+0200 (CEST)" }, "meta" : { "lang" : "en", "verified" : true, "location" : "LBC", "id" : 3004231, "url" : "http://www.snoopdogg.com", "created_at" : "Fri Mar 30 2007 21:05:42 GMT+0200 (CEST)", "description" : "More Malice CD + DVD IN STORES NOW", "time_zone" : "Pacific Time (US & Canada)", "profile_image_url" : "http://a3.twimg.com/profile_images/1096549203/snoop_normal.jpg", "screen_name" : "SnoopDogg" }
  • 21. Künstler Pro l bei facebook "numbers" : { "outgoing" : 0, "incoming" : 2930860, "feedback" : 0, "push" : 0, }, "meta" : { "category" : "Musicians", "name" : "Snoop Dogg", "fan_count" : 2930860, "bio" : "The offices at the top of the Capitol Records building in Hollywood are home to some of Southern California’s most awe-inspiring views. ....", "url" : "http://www.facebook.com/snoopdogg?v=info", "username" : "snoopdogg", "record_label" : "Priority/Doggystyle ", "location" : "Long Beach, CA", "profile_image_url" : "http://profile.ak.fbcdn.net/hprofile-ak-snc4/ hs622.snc3/27524_11455644806_1192_s.jpg", "band_members" : "Snoop Dogg", "website" : [ "http://www.snoopdogg.com", "http://www.myspace.com/snoopdogg", "http://twitter.com/snoopdogg" ], "link" : "http://www.facebook.com/snoopdogg", "pinnwand_posts" : 0, "genre" : "Hip Hop / Rap / R&B", "friends" : 0, "id" : "11455644806" }
  • 24. MySQL vs. MongoDB (1) Alle Social Media Pro le mit Follower-Zahlen von einem Artist MySQL MongoDB SELECT db.artist.find( { "name": "Snoop Dogg" } ) n.incoming, a.id as artist, a.name as artist_name, Dauer: 0.0001 Sek. s.id as socialprofile, s.url as socialprofile_url, FROM numbers as n JOIN socialprofile as s on s.id = n.socialprofile JOIN artist as a on a.id = n.artist WHERE a.name = "Snoop Dogg" ORDER n.incoming DESC Dauer: 0.0288 Sek.
  • 25. MySQL vs. MongoDB (2) 10 HipHop Musiker mit den meisten Followern MySQL MongoDB SELECT db.artist.find( { n.incoming, "genre": DBRef("genre","hiphop") a.id as artist, } ).sort( { a.name as artist_name, "socialprofiles.numbers.incoming": -1 s.id as socialprofile, } ).limit(10) s.url as socialprofile_url, FROM Dauer: 0.0230 Sek. numbers as n JOIN artist_genres as ag on ag.artists = n.artist JOIN genres as g on g.id = ag.genre JOIN socialprofile as s on s.id = n.socialprofile JOIN artist as a on a.id = n.artist WHERE g.name = "Hip/Hop" ORDER BY n.incoming DESC LIMIT 10 Dauer: 0.8741 Sek.
  • 26. MySQL Index • Index wird von links nach rechts gelesen Reihenfolge wichtig Felder: „artist“, „incoming“, „push“, „date“ SELECT * FROM numbers WHERE artist = 1 Funktioniert SELECT * FROM numbers WHERE incoming = 1 Funktioniert nicht SELECT * FROM numbers WHERE artist = 1 AND push < 10 Funktioniert nicht SELECT * FROM numbers WHERE artist = 1 AND push < 10 AND incoming > 0 Funktioniert • Index Debugging EXPLAIN SELECT * FROM numbers WHERE artist = 1
  • 27. MongoDB Index • Index Reihenfolge ist egal kann ein Feld mitten im Index verwenden db.artist.ensureIndex( {"name":1, "numbers": -1 } ); db.artist.find( { "name": "Snoop Dog" } ) Funktioniert db.artist.find( { "socialprofiles.numbers.incoming": { "$gte": 10 } } ) Funktioniert db.artist.find( { "name": "Snoop Dogg", "socialprofiles.numbers.incoming": { "$gte": 0 } } ) Funktioniert • Index Debugging db.artist.find( { "name": "Snoop Dogg" } ).explain()
  • 30. MongoDB Fehlermeldungen • Sortierte Abfrage ohne Limit: Fehler: „too much data for sort() with no index. add an index or specify a smaller limit“ Lösung: Feld in den Index aufnehmen • Duplicate Key Error: Fehler: in älteren Versionen (< 1.6.0) schmiert DB bei zu vielen Duplicate Key Errors ab Lösung: Upsert verwenden
  • 31. db.serverStatus() Wieviel memory-Verbrauch, wieviele Connections, ... globalLock Wie lange Collections gesperrt waren, ... connections Wieviel Verbindungen offen / verfügbar, ... backgroundFlushing Wann war der letzte Flush auf die Festplatte, ... ... Mehr Info in der Dokumentation: http://www.mongodb.org/display/DOCS/Monitoring+and+Diagnostics
  • 32. Pro ling db.setPro lingLevel(0) off log slow operations (>100ms), optional „slow“ db.setPro lingLevel(1) de nieren mit db.setPro lingLevel(1, 10) db.setPro lingLevel(2) log all operations system.pro le { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607ef2a68299079400b5ea) } nscanned:1 moved ", "millis" : 0 } { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607ef2a68299079400b468) } nscanned:1 moved ", "millis" : 0 } { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607fd9a68299079400c067) } nscanned:1", "millis" : 0 } { "ts" : "Tue Aug 10 2010 00:25:33 GMT+0200 (CEST)", "info" : "update musweet.media  query: { _id: ObjId(4c607ef2a68299079400b6a6) } nscanned:1 moved ", "millis" : 0 }
  • 33. Collection Objekte analysieren Download: http://github.com/compuccino/mongodb-ac
  • 36. Abschließend... • Fragen? • Mehr über uns: http://compuccino.com http://facebook.com/compuccino • Personen: Grischa Andreew, @grischaandreew Nader Cserny, @nadr

Notas del editor

  1. Erweiterbarkeit und Handling von gro&amp;#xDF;en Datenmengen im Rahmen unseres Projekts musweet.com
  2. Was ist musweet? Warum wir uns f&amp;#xFC;r MongoDB entschieden haben Vergleich zw. dem alten System mit MySQL u. MongoDB Interessante Abfragen Welche Tools &amp; Debugging Methoden wir verwenden
  3. Website rund um Musik und deren Akteure im Social Web misst und bewertet Online-Aktivit&amp;#xE4;t in Echtzeit analysiert Datenquellen und stellt diese dar zeigt Fotos, Musik, Videos von Bands u. Musikern Erfahrungen von wahl.de mit MySQL jetzt mit MongoDB bei musweet.com umgesetzt
  4. Media Stream mit Link Expander (=Enthaltene Medien werden direkt auf der Seite dargestellt) Aktuell crawlen wir myspace, facebook, twitter -&gt; sp&amp;#xE4;ter erweiterung auf blogs, youtube Stream nach Genre filterbar
  5. Meist diskutierte Themen der letzten 7 Tage
  6. Wer hat die meisten Freunde dazugewonnen (Big Mover) Wer die meisten Nachrichten geschrieben (Big Shaker) Filterbar nach Genre Tagesaktuell
  7. Stamminformationen eines K&amp;#xFC;nstlers Social Media Profile =&gt; Wo bewegt sich der Musiker im Netz Media Stream vom Musiker Zuk&amp;#xFC;nftige Konzerte Related Artists: &amp;#xE4;hnliche im Genre und &amp;#xE4;hnliche Kontaktzahlen
  8. Wachsende Datenbasis Aktivit&amp;#xE4;t aus dem Social Web verlangt hohe Performance bei den Inserts Erstmal mit bekannten K&amp;#xFC;nstlern gestartet, sp&amp;#xE4;ter Erweiterung
  9. Wir haben K&amp;#xFC;nstler mit versch. Social Profiles die jeweils wieder unterschiedliche Profile / Stream Informationen haben der Stream / die Profileinformationen sollen nach den Attributen (genres,..) vom K&amp;#xFC;nstler sortierbar sein
  10. F&amp;#xFC;r jeden weiteren Service brauchen wir zwei Tabellen ( Profileinformation, Stream ) mehr, f&amp;#xFC;r jedes weitere Attribut beim K&amp;#xFC;nstler / Scoialprofile was mehrdimensional sein soll brauchen wir eine Join und einen Daten Tabelle ( artist -&gt; artists_genres -&gt; genres ). Durch die vielen Tabellen ist es nicht einfach die Daten abzufragen / jede &amp;#xC4;nderung muss im Backend und im Frontend implementiert werden
  11. Drastisch reduziertes Schema m&amp;#xF6;glich Neues Attribut erfordert nur einen neuen Eintrag im Objekt (ohne dass man an die DB ran muss) die &amp;#xC4;nderungen werden im Backend implementiert, das Frontend muss nicht ge&amp;#xE4;ndert werden.
  12. Crawler ist eine eigenst&amp;#xE4;ndige Application und verwaltet die Crawls f&amp;#xFC;r mehrere Client-Apps wie musweet.com. musweet.com registriert die Socialprofiles im Crawler und bekommt eine Push Notfication wenn sich ein Profil &amp;#xE4;ndert oder eine neue Nachricht geschrieben wird.
  13. numbers Object ist festgesetzt und immer gleich aufgebaut meta Object ist mit plattformspezifischen Daten gef&amp;#xFC;llt.
  14. Bei Twitter haben wir andere Infos als bei Myspace &amp;#x201E;profile_image_url&amp;#x201C; bezeichnet das Profil-Bild des K&amp;#xFC;nstlers auf der Plattform.
  15. Bei Facebook haben wir meist mehr Informationen als bei den anderen Plattformen, je nach Facebook Account Type (Fanpage/User Profile)
  16. MySQL: entweder mit JOIN oder 3 SELECTs MongoDB Abfragen gestalten sich viel einfacher und performanter
  17. MySQL: Noch mehr JOINs oder SELECT statements MongoDB mit DBRef auf Genre
  18. Viele unterschiedliche Indizes notwendig =&gt; viele GB an Daten
  19. Indizes platzsparender und einfacher anwendbar MongoDB kann in einem Index nur einen multiindex (Array als Daten) haben
  20. Fehlermeldungen die wir w&amp;#xE4;hrend der Entwicklung hatten Fehler &amp;#x201E;too much data for sort()&amp;#x201C; tritt erst sp&amp;#xE4;ter auf, wenn man viele Daten in der DB hat
  21. globalLock: wie lange gesperrt mem: wieviel Speicher verbraucht wird IndexCounters: wieviele Hits, wieviele Misses connections: wieviele offen, wieviele verf&amp;#xFC;gbar opcounters: wieviel inserts, updates, deletes backgroundFlushing: wann war der letzte Flush
  22. langsame Datenbank-Abfragen oder alle Abfragen Profiling auf Datenbank-Ebene
  23. N&amp;#xFC;tzliches Tool um herauszufinden wieviele unterschiedliche Objekt Strukturen man in der Collection hat und deren Aufbau zu sehen.