SlideShare una empresa de Scribd logo
1 de 28
Subreddit
Subcultures
Insight Data Engineering Fellowship, Silicon Valley
David Lyon
Find your Reddit Subculture
2007 - Impersonal Web
2017 - Personal Web
Reddit Comment Dataset
2 billion comments
1 million
subreddits
Personalization of Reddit Over Time
Reddit Clustering App
https://youtu.be/XHczo0TM17E
Data Pipeline
Ingestion / Processing User Interface
Challenge 1:
Data Size
Every month on
Reddit:
● Reddit is too big to cluster
directly!
● The raw clustering matrix
has 200 billion elements.
60k Subreddits
3 million unique authors
Solution 1a:
Filtering
Every month on
Reddit:
● Filter for activity: 100
comments/month
● Active clustering matrix has
200 million elements
● Now 1000 times faster to
cluster
6k active
Subreddits
30k active
authors
Solution 1b: PCA
Every month on
Reddit:
● PCA transforms author
space to shared interest
space by finding correlations
● PCA shrinks dimensionality
by another 100 times 300
shared
interests
6k active
Subreddits
Challenge 2: Slow PCA
Even on a cluster, PCA takes too long
on 200 million elements: 100 minutes
on 9 Spark workers.
PCA scales as O(MI)
M is the number of matrix elements
I is the number of interests after PCA
Over 80% of total time!
Solution 2: Random PCA
Use Facebook Research Random PCA
(2014) on a single node
Fbpca is O(M ln(I))
For 250 interests, FBPCA is 45 times
faster! One FBPCA worker is 5x faster
than 9 full PCA workers.
5x faster for an average sized month
Challenge 3: Finding K for K-Means Clustering
Number of clusters is not the same
as number of PCA shared
interests
Clustering can happen on more
than one scale
Football
Baseball
TV
Movies
Solution 3: Silhouette Analysis
Silhouette Analysis reveals
clustering scale at small k
Also reveals a second clustering
scale of around 400 clusters
in this case
A Happy Medium
Too impersonal Too personalized
David Lyon
PhD Physics from the University of Illinois
Doing GPU simulations
I love hiking, table tennis, and astrophysics
Next Steps - Random PCA for Spark.ml
Step 1: Learn Scala!
Step 2: Contribute to Open Source community
Step 3: Streaming Random PCA?
Next Steps - Popular Topics by Cluster
Find the popular topics within each cluster using Term-Frequency Inverse-
Document-Frequency (TF-IDF) or LDA
Terms are 1-grams and 2-grams used in each cluster, and the document
frequency is over all of reddit for that month.
Challenge 2:
Every month on
Reddit:
● Too many individual authors
● Need to cluster by shared
interests, not author 30k active
authors
6k active
Subreddits
Challenge 3: Finding K for K-Means
Number of clusters is not the same
as number of PCA shared
interests
Clustering can happen on more
than one scale
Football
Baseball
TV
Movies
Random PCA
Complexity of PCA is O(mnk) for m rows, n input columns, k output columns
FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR
CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko,
2009)
Fast Randomized SVD (Facebook Research, 2014)
Complexity of Random PCA is O(mn ln(k))
For k=100, Random PCA is more than 20x faster!
Before PCA
Football 2 1
Baseball 3 1 15
TV 5 2 22
Movies 1 21 1 2
Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth99
9,999
Auth
1,000,0
00
After PCA
Football 80 2 1
Baseball 90 3 2
TV 6 80 77
Movies 2 80 20
Sub Sporting Fictional Political
Anatomy of a Reddit Comment
BodyAuthorDate Subreddit
Group by Month
Group by Subreddit
Count #comments by author per subreddit
Normalize authors so each author has
mean=0 and variance = 1
Growth in Number of Subreddits
40 subreddits
1 million subreddits
Week 4 Challenges
● Spark for iterative machine learning because Spark can
mapreduce in memory
● By reducing the dimension of data,
● No streaming - clustering requires lots of data & clusters
change slowly, but time window reduced from monthly to
daily
Clustering is Universal
Galaxies cluster into
superclusters of ~100k
members
The red dot is our galaxy
● Human knowledge is clustered - purple for physics, blue for
chemistry, green for biology and medicine.
● The big blob to the upper left is Liberal Arts.
Subreddit Clustering
Monthly graph from 10k subreddits X 2 million authors = 10 billion
matrix entries
Drastically reduce the size of data using Principal Component
Analysis, normalized so that larger subreddits aren’t favored
Cluster in reduced dimensional space using K-means
Topics within Clusters based on relative frequency of 1-grams and
Social media brings us closer
Continual contact with over 1 billion people
We can find people who share our exact interests
...and separates us
● Less tolerance for differences - unfriend
or ban from community!
● Online communities become bubbles
isolated from each other

Más contenido relacionado

Similar a Subreddit Subcultures

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 
Monitoring using Open source technologies
Monitoring using Open source technologiesMonitoring using Open source technologies
Monitoring using Open source technologiesUTKARSH BHATNAGAR
 
SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"Inhacking
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Symeon Papadopoulos
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation承剛 謝
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsNatalino Busa
 
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...AIST
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)Kunwoo Park
 
Msr2010 ibrahim
Msr2010 ibrahimMsr2010 ibrahim
Msr2010 ibrahimSAIL_QU
 
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and GraphsSTING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and GraphsJason Riedy
 
Dynamic Data Community Discovery
Dynamic Data Community DiscoveryDynamic Data Community Discovery
Dynamic Data Community DiscoverySarang Rakhecha
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereDERIGalway
 
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...Pei Lee
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept Miha Ahronovitz
 

Similar a Subreddit Subcultures (20)

Insight presentation
Insight presentationInsight presentation
Insight presentation
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
Monitoring using Open source technologies
Monitoring using Open source technologiesMonitoring using Open source technologies
Monitoring using Open source technologies
 
Talk @ GrafanaCon 2016
Talk @ GrafanaCon 2016Talk @ GrafanaCon 2016
Talk @ GrafanaCon 2016
 
Denis Reznik Data driven future
Denis Reznik Data driven futureDenis Reznik Data driven future
Denis Reznik Data driven future
 
SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"SE2016 BigData Denis Reznik "Data driven future"
SE2016 BigData Denis Reznik "Data driven future"
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
 
Immersive Recommendation
Immersive RecommendationImmersive Recommendation
Immersive Recommendation
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
Dmitry Bugaychenko - Smart.Data@ОК.ru. How to make the world a bit better usi...
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
Msr2010 ibrahim
Msr2010 ibrahimMsr2010 ibrahim
Msr2010 ibrahim
 
Hendrickson data2 2012-gnip
Hendrickson data2 2012-gnipHendrickson data2 2012-gnip
Hendrickson data2 2012-gnip
 
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and GraphsSTING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
STING: A Framework for Analyzing Spacio-Temporal Interaction Networks and Graphs
 
Dynamic Data Community Discovery
Dynamic Data Community DiscoveryDynamic Data Community Discovery
Dynamic Data Community Discovery
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphere
 
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
[ICDE 2014] Incremental Cluster Evolution Tracking from Highly Dynamic Networ...
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept
 

Último

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 

Último (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 

Subreddit Subcultures

  • 1. Subreddit Subcultures Insight Data Engineering Fellowship, Silicon Valley David Lyon Find your Reddit Subculture
  • 4. Reddit Comment Dataset 2 billion comments 1 million subreddits
  • 5. Personalization of Reddit Over Time Reddit Clustering App https://youtu.be/XHczo0TM17E
  • 6. Data Pipeline Ingestion / Processing User Interface
  • 7. Challenge 1: Data Size Every month on Reddit: ● Reddit is too big to cluster directly! ● The raw clustering matrix has 200 billion elements. 60k Subreddits 3 million unique authors
  • 8. Solution 1a: Filtering Every month on Reddit: ● Filter for activity: 100 comments/month ● Active clustering matrix has 200 million elements ● Now 1000 times faster to cluster 6k active Subreddits 30k active authors
  • 9. Solution 1b: PCA Every month on Reddit: ● PCA transforms author space to shared interest space by finding correlations ● PCA shrinks dimensionality by another 100 times 300 shared interests 6k active Subreddits
  • 10. Challenge 2: Slow PCA Even on a cluster, PCA takes too long on 200 million elements: 100 minutes on 9 Spark workers. PCA scales as O(MI) M is the number of matrix elements I is the number of interests after PCA Over 80% of total time!
  • 11. Solution 2: Random PCA Use Facebook Research Random PCA (2014) on a single node Fbpca is O(M ln(I)) For 250 interests, FBPCA is 45 times faster! One FBPCA worker is 5x faster than 9 full PCA workers. 5x faster for an average sized month
  • 12. Challenge 3: Finding K for K-Means Clustering Number of clusters is not the same as number of PCA shared interests Clustering can happen on more than one scale Football Baseball TV Movies
  • 13. Solution 3: Silhouette Analysis Silhouette Analysis reveals clustering scale at small k Also reveals a second clustering scale of around 400 clusters in this case
  • 14. A Happy Medium Too impersonal Too personalized
  • 15. David Lyon PhD Physics from the University of Illinois Doing GPU simulations I love hiking, table tennis, and astrophysics
  • 16. Next Steps - Random PCA for Spark.ml Step 1: Learn Scala! Step 2: Contribute to Open Source community Step 3: Streaming Random PCA?
  • 17. Next Steps - Popular Topics by Cluster Find the popular topics within each cluster using Term-Frequency Inverse- Document-Frequency (TF-IDF) or LDA Terms are 1-grams and 2-grams used in each cluster, and the document frequency is over all of reddit for that month.
  • 18. Challenge 2: Every month on Reddit: ● Too many individual authors ● Need to cluster by shared interests, not author 30k active authors 6k active Subreddits
  • 19. Challenge 3: Finding K for K-Means Number of clusters is not the same as number of PCA shared interests Clustering can happen on more than one scale Football Baseball TV Movies
  • 20. Random PCA Complexity of PCA is O(mnk) for m rows, n input columns, k output columns FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS (Nathan Halko, 2009) Fast Randomized SVD (Facebook Research, 2014) Complexity of Random PCA is O(mn ln(k)) For k=100, Random PCA is more than 20x faster!
  • 21. Before PCA Football 2 1 Baseball 3 1 15 TV 5 2 22 Movies 1 21 1 2 Sub Auth1 Auth2 Auth3 Auth4 Auth5 Auth6 Auth7 Auth99 9,999 Auth 1,000,0 00
  • 22. After PCA Football 80 2 1 Baseball 90 3 2 TV 6 80 77 Movies 2 80 20 Sub Sporting Fictional Political
  • 23. Anatomy of a Reddit Comment BodyAuthorDate Subreddit Group by Month Group by Subreddit Count #comments by author per subreddit Normalize authors so each author has mean=0 and variance = 1
  • 24. Growth in Number of Subreddits 40 subreddits 1 million subreddits
  • 25. Week 4 Challenges ● Spark for iterative machine learning because Spark can mapreduce in memory ● By reducing the dimension of data, ● No streaming - clustering requires lots of data & clusters change slowly, but time window reduced from monthly to daily
  • 26. Clustering is Universal Galaxies cluster into superclusters of ~100k members The red dot is our galaxy ● Human knowledge is clustered - purple for physics, blue for chemistry, green for biology and medicine. ● The big blob to the upper left is Liberal Arts.
  • 27. Subreddit Clustering Monthly graph from 10k subreddits X 2 million authors = 10 billion matrix entries Drastically reduce the size of data using Principal Component Analysis, normalized so that larger subreddits aren’t favored Cluster in reduced dimensional space using K-means Topics within Clusters based on relative frequency of 1-grams and
  • 28. Social media brings us closer Continual contact with over 1 billion people We can find people who share our exact interests ...and separates us ● Less tolerance for differences - unfriend or ban from community! ● Online communities become bubbles isolated from each other