SlideShare una empresa de Scribd logo
1 de 20
Randomly Sampling YouTube Users:
 An Introduction to Random Prefix
         Sampling Method




             Cheng-Jun Wang

               Web Ming Lab
        City University of Hong Kong
                  20121225
YouTube growth curve




http://singularityhub.com/2012/05/25/now-serving-the-latest-in-exponential-growth-youtube/



https://gdata.youtube.com/feeds/api/standardfeeds/most_recent
Contents
Plan A: Sampling Users

∗ Unfortunately, YouTube’s user identifiers do not follow a
  standard format, YouTube’s user identifiers are user-specified
  strings. We were therefore unable to create a random sample
  of YouTube users.




  Mislove (2007) Measurement and Analysis of Online Social Networks. IMC
Plan B: Sampling Videos

∗ Using the YouTube search API, Zhou et al develop a random
  prefix sampling method, and find that roughly 500 millions
  YouTube videos by May, 2011.
∗ Sample the videos first, and then find the respective users.




  Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
Get proportional users?

∗ Limitation: selection bias towards those who uploading more
  videos. Therefore, weight against the number of videos per
  user (by the max value) is necessary to get a random sample of
  YouTube users.
∗ Is it possible?



                                                     1




              1    Videos crawled   Users detected
UserID   Video   Active
                                                 Num     Days
User   Video Weight   Active            1        10      20
ID     Num Factor     Days              2        5       15
                                        2        5       15
1      10    1        20                3        1       1
                               Weight   3        1       1
                               Cases
                                        3        1       1
2      5     2        15                3        1       1
                                        3        1       1
                                        3        1       1
3      1     10       1                 3        1       1
                                        3        1       1
                                        3        1       1
                                        3        1       1
Strategy




∗   60^10*16 = 9.674588e+18
∗   YouTube video is randomly generated from the id space
∗   Sampling space is tooooooo large!
∗   Any good idea?
∗   http://www.youtube.com/watch?v=1yo0zBFCMxo
∗   http://www.youtube.com/watch?v=_OBlgSz8sSM
YouTube Search API
∗ One unique property of YouTube search API we find is that when searching
  using a keyword string of the format “watch?v=xy...z” (including the quotes)
  where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id
  which does not contain the literal “-” in the prefix, YouTube will return a list
  of videos whose id’s begin with this prefix followed by “-”, if they exist.
∗ YouTube limits the number of returned results for any query.


∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned
  search results may contain such “noisy” video ids; also, the short prefix may
  match a large number of videos
∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned
  by the search engine.
Practice

∗ However, in practice, a prefix of length L < 5 contains usually
  more than one hundred results, and YouTube API can only
  return at most 30 ids for each prefix query.
∗ On the other hand, based on our experimental results, a prefix
  with length L = 5 always contains less than 10 valid ids.
∗ Therefore, a prefix length of 5 is a good choice in practice.
∗ They find that querying prefixes with a prefix length of four
  will returned ids having a “-” in the fifth place, which provides
  a big enough result set so that each prefix returns some results
  and small enough to never reach the result limit set by the API.
∗ Zhou et al. found that there are about 500 million YouTube
  videos by 2011!




        Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
Python and gdata


             gdata                                    Code
∗ gdata is a module for         def SearchAndPrint(search_terms):
                                 yt_service = gdata.youtube.service.YouTubeService()
  connecting Google data         query = gdata.youtube.service.YouTubeVideoQuery()
  (including YouTube) via API    query.vq = search_terms
                                 query.orderby = 'viewCount'
                                 query.racy = 'include'
                                 feed = yt_service.YouTubeQuery(query)
                                 PrintVideoFeed(feed)
Test Validity

∗ http://www.youtube.com/watch?v=1yo0zBFCMxo
∗ The Secret State - The Biggest Mistake - Official Lyric Music
  Video
                                                    Cant’ find
                                                    the video!
∗ searchApi("watch?v=1yo0z")
Restricted query term

∗ searchApi('"watch?v=1yo0"')
Compare two random samples

∗   # summary(da$Freq)
∗   # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗   # 1.00 7.00 25.00 17.15 25.00 75.00
∗
∗   # summary(db$Freq)
∗   # Min. 1st Qu. Median Mean 3rd Qu. Max.
∗   # 1.00 8.00 25.00 17.57 25.00 50.00
There are 604 million videos in
        YouTube by Dec, 2012!
∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26
∗ 34361/x = 125/34361
∗ X = (34361^2/125)*64 == 604507300
Numeric simulation of random
                 prefix sampling
∗   # using degreenet to simulate decrete pareto distribution
∗   library(degreenet)
∗   a<-simdp(n=100000, v=3.5, maxdeg=10000)

∗   b<-data.frame(cbind(c(1:length(a)),a))
∗   c<-b[rep(1:nrow(b),b$a),]
∗   c$vid<-c(1:length(c$a))
∗   names(c)<-c("uid", "count", "vid")

∗   id<-sample(c(1:length(c$vid)), 2000, replace = F) #
∗   ds<-subset(c, c$vid%in%id)
∗   dat<-subset(ds, !duplicated(ds$uid))

∗   hist(dat$count)

∗   da<-as.data.frame(table(a))
∗   ds<-as.data.frame(table(dat$count))

∗   plot(log(da[,2])~log(as.numeric(as.character(da[,1]))), xlab = "Number of Videos (Log)", ylab = "Frequency (Log)" )
∗   points(log(ds[,2])~log(as.numeric(as.character(ds[,1]))), pch=2, col="red")
∗   legend("topright", c("population", "sample"),
∗               col = c( "black","red"),
∗               cex=0.9, pch= c(3, 2))
Reference

∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix
  Sampling. IMC
∗ Mislove (2007) Measurement and Analysis of Online Social
  Networks. IMC
∗ YouTube deverlopers guide for python
  https://developers.google.com/youtube/1.0/developers_guide_python

∗ Introduction to the library of gdata.youtube
  http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry
20121225

Más contenido relacionado

La actualidad más candente

Gender & hr lecture 1
Gender & hr lecture 1Gender & hr lecture 1
Gender & hr lecture 1
waheedaq
 
Mémoire L'apparence des agents virtuels intelligents et le comportement du co...
Mémoire L'apparence des agents virtuels intelligents et le comportement du co...Mémoire L'apparence des agents virtuels intelligents et le comportement du co...
Mémoire L'apparence des agents virtuels intelligents et le comportement du co...
Lucie Polo Matas
 
Narratives of systemic barriers &amp; accessibility summary of article 1
Narratives of systemic barriers &amp; accessibility   summary of article 1Narratives of systemic barriers &amp; accessibility   summary of article 1
Narratives of systemic barriers &amp; accessibility summary of article 1
Beth Carey
 
Module 6 Session Hijacking
Module 6   Session HijackingModule 6   Session Hijacking
Module 6 Session Hijacking
leminhvuong
 

La actualidad más candente (20)

Deepfake detection
Deepfake detectionDeepfake detection
Deepfake detection
 
Gender & hr lecture 1
Gender & hr lecture 1Gender & hr lecture 1
Gender & hr lecture 1
 
Deepfakes
DeepfakesDeepfakes
Deepfakes
 
Post genocide challenges and achievements in Rwanda
Post genocide challenges and achievements in RwandaPost genocide challenges and achievements in Rwanda
Post genocide challenges and achievements in Rwanda
 
Crimes Against Humanity
Crimes Against HumanityCrimes Against Humanity
Crimes Against Humanity
 
أساليب حديثة في مجال رعاية المسنين
أساليب حديثة في مجال رعاية المسنينأساليب حديثة في مجال رعاية المسنين
أساليب حديثة في مجال رعاية المسنين
 
Image based authentication
Image based authenticationImage based authentication
Image based authentication
 
Digital Activism examples
Digital Activism examplesDigital Activism examples
Digital Activism examples
 
IT Act 2000 Penalties, Offences with case studies
IT Act 2000 Penalties, Offences with case studies IT Act 2000 Penalties, Offences with case studies
IT Act 2000 Penalties, Offences with case studies
 
Snapchat
SnapchatSnapchat
Snapchat
 
Electronic evidence digital evidence in india
Electronic evidence  digital evidence in indiaElectronic evidence  digital evidence in india
Electronic evidence digital evidence in india
 
Genocide
GenocideGenocide
Genocide
 
International Law on the Protection of Cultural Heritage: UNESCO 1954 and 197...
International Law on the Protection of Cultural Heritage: UNESCO 1954 and 197...International Law on the Protection of Cultural Heritage: UNESCO 1954 and 197...
International Law on the Protection of Cultural Heritage: UNESCO 1954 and 197...
 
Face Detection Attendance System By Arjun Sharma
Face Detection Attendance System By Arjun SharmaFace Detection Attendance System By Arjun Sharma
Face Detection Attendance System By Arjun Sharma
 
Mémoire L'apparence des agents virtuels intelligents et le comportement du co...
Mémoire L'apparence des agents virtuels intelligents et le comportement du co...Mémoire L'apparence des agents virtuels intelligents et le comportement du co...
Mémoire L'apparence des agents virtuels intelligents et le comportement du co...
 
Narratives of systemic barriers &amp; accessibility summary of article 1
Narratives of systemic barriers &amp; accessibility   summary of article 1Narratives of systemic barriers &amp; accessibility   summary of article 1
Narratives of systemic barriers &amp; accessibility summary of article 1
 
Module 6 Session Hijacking
Module 6   Session HijackingModule 6   Session Hijacking
Module 6 Session Hijacking
 
Sources of International Criminal Law
Sources of International Criminal LawSources of International Criminal Law
Sources of International Criminal Law
 
State Jurisdiction under International Criminal Law
State Jurisdiction under International Criminal LawState Jurisdiction under International Criminal Law
State Jurisdiction under International Criminal Law
 
The protection of civilians within un pso oct 14
The protection of civilians within un pso oct 14The protection of civilians within un pso oct 14
The protection of civilians within un pso oct 14
 

Similar a Randomly sampling YouTube users

NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]
MODUL Technology GmbH
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_Johnston
H Eddie Newton
 

Similar a Randomly sampling YouTube users (20)

Autom editor video blooper recognition and localization for automatic monolo...
Autom editor  video blooper recognition and localization for automatic monolo...Autom editor  video blooper recognition and localization for automatic monolo...
Autom editor video blooper recognition and localization for automatic monolo...
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
Video summarization using clustering
Video summarization using clusteringVideo summarization using clustering
Video summarization using clustering
 
NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]NoTube: Ad Insertion [compatibility mode]
NoTube: Ad Insertion [compatibility mode]
 
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
Thomas Kauders - Agile Test Design And Automation of a Life-Critical Medical ...
 
Phillipson learning from archives how historical content can be used to eng...
Phillipson learning from archives   how historical content can be used to eng...Phillipson learning from archives   how historical content can be used to eng...
Phillipson learning from archives how historical content can be used to eng...
 
Develop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConfDevelop Maintainable Apps - edUiConf
Develop Maintainable Apps - edUiConf
 
Rubinius For You - GoRuCo
Rubinius For You - GoRuCoRubinius For You - GoRuCo
Rubinius For You - GoRuCo
 
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
YouTube APIs presentation at Facultad de Ciencias, Universidad Nacional Autón...
 
Qtp interview questions and answers
Qtp interview questions and answersQtp interview questions and answers
Qtp interview questions and answers
 
Real-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big DataReal-Time Video Copy Detection in Big Data
Real-Time Video Copy Detection in Big Data
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...
 
Java Performance Tuning
Java Performance TuningJava Performance Tuning
Java Performance Tuning
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache Beam
 
UC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_JohnstonUC2010_BRS1280_Eastman_Chemical_Johnston
UC2010_BRS1280_Eastman_Chemical_Johnston
 
肉体言語 Tython
肉体言語 Tython肉体言語 Tython
肉体言語 Tython
 
The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88The Ring programming language version 1.3 book - Part 8 of 88
The Ring programming language version 1.3 book - Part 8 of 88
 
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
iOSDevCamp 2011 - Getting "Test"-y: Test Driven Development & Automated Deplo...
 
YouTube for Developers
YouTube for DevelopersYouTube for Developers
YouTube for Developers
 
Why biased matrix factorization works well?
Why biased matrix factorization works well?Why biased matrix factorization works well?
Why biased matrix factorization works well?
 

Más de Chengjun Wang

Pajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and RelationsPajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and Relations
Chengjun Wang
 
Calculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with PajekCalculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with Pajek
Chengjun Wang
 
Chapter 2. Multivariate Analysis of Stationary Time Series
 Chapter 2. Multivariate Analysis of Stationary Time Series Chapter 2. Multivariate Analysis of Stationary Time Series
Chapter 2. Multivariate Analysis of Stationary Time Series
Chengjun Wang
 
人类行为与最大熵原理
人类行为与最大熵原理人类行为与最大熵原理
人类行为与最大熵原理
Chengjun Wang
 
Impact of human value, consumer perceived value
Impact of human value, consumer perceived valueImpact of human value, consumer perceived value
Impact of human value, consumer perceived value
Chengjun Wang
 
Introduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing WebsiteIntroduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing Website
Chengjun Wang
 
Suppressor and distort variables
Suppressor and distort variablesSuppressor and distort variables
Suppressor and distort variables
Chengjun Wang
 
Stata Learning From Treiman
Stata Learning From TreimanStata Learning From Treiman
Stata Learning From Treiman
Chengjun Wang
 
A M O S L E A R N I N G
A M O S  L E A R N I N GA M O S  L E A R N I N G
A M O S L E A R N I N G
Chengjun Wang
 

Más de Chengjun Wang (15)

计算传播学导论
计算传播学导论计算传播学导论
计算传播学导论
 
数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104数据可视化 概念案例方法 王成军 20140104
数据可视化 概念案例方法 王成军 20140104
 
An introduction to computational communication
An introduction to computational communication An introduction to computational communication
An introduction to computational communication
 
Pajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and RelationsPajek chapter2 Attributes and Relations
Pajek chapter2 Attributes and Relations
 
Calculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with PajekCalculate Thresholds of Diffusion with Pajek
Calculate Thresholds of Diffusion with Pajek
 
Chapter 2. Multivariate Analysis of Stationary Time Series
 Chapter 2. Multivariate Analysis of Stationary Time Series Chapter 2. Multivariate Analysis of Stationary Time Series
Chapter 2. Multivariate Analysis of Stationary Time Series
 
人类行为与最大熵原理
人类行为与最大熵原理人类行为与最大熵原理
人类行为与最大熵原理
 
Impact of human value, consumer perceived value
Impact of human value, consumer perceived valueImpact of human value, consumer perceived value
Impact of human value, consumer perceived value
 
Introduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing WebsiteIntroduction to News diffusion On News Sharing Website
Introduction to News diffusion On News Sharing Website
 
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
The Emergence of Spiral of Silence from Individual behaviors: Agent-based Mod...
 
Suppressor and distort variables
Suppressor and distort variablesSuppressor and distort variables
Suppressor and distort variables
 
Pajek chapter1
Pajek chapter1Pajek chapter1
Pajek chapter1
 
Stata Learning From Treiman
Stata Learning From TreimanStata Learning From Treiman
Stata Learning From Treiman
 
A M O S L E A R N I N G
A M O S  L E A R N I N GA M O S  L E A R N I N G
A M O S L E A R N I N G
 
Amos Learning
Amos LearningAmos Learning
Amos Learning
 

Último

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377087607
dollysharma2066
 
February 2024 Recommendations for newsletter
February 2024 Recommendations for newsletterFebruary 2024 Recommendations for newsletter
February 2024 Recommendations for newsletter
ssuserdfec6a
 
Girls in Mahipalpur (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Mahipalpur  (delhi) call me [🔝9953056974🔝] escort service 24X7Girls in Mahipalpur  (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Mahipalpur (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
Cara Menggugurkan Kandungan 087776558899
 

Último (18)

Social Learning Theory presentation.pptx
Social Learning Theory presentation.pptxSocial Learning Theory presentation.pptx
Social Learning Theory presentation.pptx
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377087607
 
Dadar West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Dadar West Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsDadar West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Dadar West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
SIKP311 Sikolohiyang Pilipino - Ginhawa.pptx
SIKP311 Sikolohiyang Pilipino - Ginhawa.pptxSIKP311 Sikolohiyang Pilipino - Ginhawa.pptx
SIKP311 Sikolohiyang Pilipino - Ginhawa.pptx
 
(JAYA)🎄Low Rate Call Girls Lucknow Call Now 8630512678 Premium Collection Of ...
(JAYA)🎄Low Rate Call Girls Lucknow Call Now 8630512678 Premium Collection Of ...(JAYA)🎄Low Rate Call Girls Lucknow Call Now 8630512678 Premium Collection Of ...
(JAYA)🎄Low Rate Call Girls Lucknow Call Now 8630512678 Premium Collection Of ...
 
2023 - Between Philosophy and Practice: Introducing Yoga
2023 - Between Philosophy and Practice: Introducing Yoga2023 - Between Philosophy and Practice: Introducing Yoga
2023 - Between Philosophy and Practice: Introducing Yoga
 
Goregaon West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Goregaon West Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsGoregaon West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Goregaon West Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
Hisar Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Hisar Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsHisar Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Hisar Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
February 2024 Recommendations for newsletter
February 2024 Recommendations for newsletterFebruary 2024 Recommendations for newsletter
February 2024 Recommendations for newsletter
 
Exploring Stoic Philosophy From Ancient Wisdom to Modern Relevance.pdf
Exploring Stoic Philosophy From Ancient Wisdom to Modern Relevance.pdfExploring Stoic Philosophy From Ancient Wisdom to Modern Relevance.pdf
Exploring Stoic Philosophy From Ancient Wisdom to Modern Relevance.pdf
 
March 2023 Recommendations for newsletter
March 2023 Recommendations for newsletterMarch 2023 Recommendations for newsletter
March 2023 Recommendations for newsletter
 
Emotional Freedom Technique Tapping Points Diagram.pdf
Emotional Freedom Technique Tapping Points Diagram.pdfEmotional Freedom Technique Tapping Points Diagram.pdf
Emotional Freedom Technique Tapping Points Diagram.pdf
 
Girls in Mahipalpur (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Mahipalpur  (delhi) call me [🔝9953056974🔝] escort service 24X7Girls in Mahipalpur  (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Mahipalpur (delhi) call me [🔝9953056974🔝] escort service 24X7
 
KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
KLINIK BATA Jual obat penggugur kandungan 087776558899 ABORSI JANIN KEHAMILAN...
 
What are some effective methods for increasing concentration and focus while ...
What are some effective methods for increasing concentration and focus while ...What are some effective methods for increasing concentration and focus while ...
What are some effective methods for increasing concentration and focus while ...
 
Colaba Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Colaba Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsColaba Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Colaba Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
Call Girls In Mumbai Just Genuine Call ☎ 7738596112✅ Call Girl Andheri East G...
Call Girls In Mumbai Just Genuine Call ☎ 7738596112✅ Call Girl Andheri East G...Call Girls In Mumbai Just Genuine Call ☎ 7738596112✅ Call Girl Andheri East G...
Call Girls In Mumbai Just Genuine Call ☎ 7738596112✅ Call Girl Andheri East G...
 
Bokaro Escorts Service Girl ^ 9332606886, WhatsApp Anytime Bokaro
Bokaro Escorts Service Girl ^ 9332606886, WhatsApp Anytime BokaroBokaro Escorts Service Girl ^ 9332606886, WhatsApp Anytime Bokaro
Bokaro Escorts Service Girl ^ 9332606886, WhatsApp Anytime Bokaro
 

Randomly sampling YouTube users

  • 1. Randomly Sampling YouTube Users: An Introduction to Random Prefix Sampling Method Cheng-Jun Wang Web Ming Lab City University of Hong Kong 20121225
  • 4. Plan A: Sampling Users ∗ Unfortunately, YouTube’s user identifiers do not follow a standard format, YouTube’s user identifiers are user-specified strings. We were therefore unable to create a random sample of YouTube users. Mislove (2007) Measurement and Analysis of Online Social Networks. IMC
  • 5. Plan B: Sampling Videos ∗ Using the YouTube search API, Zhou et al develop a random prefix sampling method, and find that roughly 500 millions YouTube videos by May, 2011. ∗ Sample the videos first, and then find the respective users. Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
  • 6. Get proportional users? ∗ Limitation: selection bias towards those who uploading more videos. Therefore, weight against the number of videos per user (by the max value) is necessary to get a random sample of YouTube users. ∗ Is it possible? 1 1 Videos crawled Users detected
  • 7. UserID Video Active Num Days User Video Weight Active 1 10 20 ID Num Factor Days 2 5 15 2 5 15 1 10 1 20 3 1 1 Weight 3 1 1 Cases 3 1 1 2 5 2 15 3 1 1 3 1 1 3 1 1 3 1 10 1 3 1 1 3 1 1 3 1 1 3 1 1
  • 8. Strategy ∗ 60^10*16 = 9.674588e+18 ∗ YouTube video is randomly generated from the id space ∗ Sampling space is tooooooo large! ∗ Any good idea? ∗ http://www.youtube.com/watch?v=1yo0zBFCMxo ∗ http://www.youtube.com/watch?v=_OBlgSz8sSM
  • 9. YouTube Search API ∗ One unique property of YouTube search API we find is that when searching using a keyword string of the format “watch?v=xy...z” (including the quotes) where “xy...” is a prefix (of length L, 1 ≤ L ≤ 11) of a possible YouTube video id which does not contain the literal “-” in the prefix, YouTube will return a list of videos whose id’s begin with this prefix followed by “-”, if they exist. ∗ YouTube limits the number of returned results for any query. ∗ When the prefix is short (e.g., 1 or 2), it is more likely that the returned search results may contain such “noisy” video ids; also, the short prefix may match a large number of videos ∗ In contrast, if the prefix is too long (e.g., 6 or 7), no result may be returned by the search engine.
  • 10. Practice ∗ However, in practice, a prefix of length L < 5 contains usually more than one hundred results, and YouTube API can only return at most 30 ids for each prefix query. ∗ On the other hand, based on our experimental results, a prefix with length L = 5 always contains less than 10 valid ids. ∗ Therefore, a prefix length of 5 is a good choice in practice.
  • 11. ∗ They find that querying prefixes with a prefix length of four will returned ids having a “-” in the fifth place, which provides a big enough result set so that each prefix returns some results and small enough to never reach the result limit set by the API.
  • 12. ∗ Zhou et al. found that there are about 500 million YouTube videos by 2011! Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC
  • 13. Python and gdata gdata Code ∗ gdata is a module for def SearchAndPrint(search_terms): yt_service = gdata.youtube.service.YouTubeService() connecting Google data query = gdata.youtube.service.YouTubeVideoQuery() (including YouTube) via API query.vq = search_terms query.orderby = 'viewCount' query.racy = 'include' feed = yt_service.YouTubeQuery(query) PrintVideoFeed(feed)
  • 14. Test Validity ∗ http://www.youtube.com/watch?v=1yo0zBFCMxo ∗ The Secret State - The Biggest Mistake - Official Lyric Music Video Cant’ find the video! ∗ searchApi("watch?v=1yo0z")
  • 15. Restricted query term ∗ searchApi('"watch?v=1yo0"')
  • 16. Compare two random samples ∗ # summary(da$Freq) ∗ # Min. 1st Qu. Median Mean 3rd Qu. Max. ∗ # 1.00 7.00 25.00 17.15 25.00 75.00 ∗ ∗ # summary(db$Freq) ∗ # Min. 1st Qu. Median Mean 3rd Qu. Max. ∗ # 1.00 8.00 25.00 17.57 25.00 50.00
  • 17. There are 604 million videos in YouTube by Dec, 2012! ∗ length(unique(subset(a[,1], b[,1]%in%a[,1]))) == 26 ∗ 34361/x = 125/34361 ∗ X = (34361^2/125)*64 == 604507300
  • 18. Numeric simulation of random prefix sampling ∗ # using degreenet to simulate decrete pareto distribution ∗ library(degreenet) ∗ a<-simdp(n=100000, v=3.5, maxdeg=10000) ∗ b<-data.frame(cbind(c(1:length(a)),a)) ∗ c<-b[rep(1:nrow(b),b$a),] ∗ c$vid<-c(1:length(c$a)) ∗ names(c)<-c("uid", "count", "vid") ∗ id<-sample(c(1:length(c$vid)), 2000, replace = F) # ∗ ds<-subset(c, c$vid%in%id) ∗ dat<-subset(ds, !duplicated(ds$uid)) ∗ hist(dat$count) ∗ da<-as.data.frame(table(a)) ∗ ds<-as.data.frame(table(dat$count)) ∗ plot(log(da[,2])~log(as.numeric(as.character(da[,1]))), xlab = "Number of Videos (Log)", ylab = "Frequency (Log)" ) ∗ points(log(ds[,2])~log(as.numeric(as.character(ds[,1]))), pch=2, col="red") ∗ legend("topright", c("population", "sample"), ∗ col = c( "black","red"), ∗ cex=0.9, pch= c(3, 2))
  • 19. Reference ∗ Zhou et al. (2011) Counting YouTube Videos via Random Prefix Sampling. IMC ∗ Mislove (2007) Measurement and Analysis of Online Social Networks. IMC ∗ YouTube deverlopers guide for python https://developers.google.com/youtube/1.0/developers_guide_python ∗ Introduction to the library of gdata.youtube http://gdata-pythonclient.googlecode.com/svn/trunk/pydocs/gdata.youtube.html#YouTubeVideoEntry