For my final year project I used data analysis techniques to investigate user behavior pattern recognition in respect of similar interests and culture versus offline geographical location. This was an out-of-the-box topic, which I selected due to my love on Data Analysis, in respect of the Social Network Analysis in the Internet era.
User Behaviour Pattern Recognition On Twitter Social Network
1. User Behavior Pattern Recognition
Using
Data Analysis Techniques
On
Twitter Social Network
George Konstantakopoulos
Supervisor: George Siogkas
2. • Network analysis - online social networks
• Demonstrate data analysis techniques
• Twitter a broadcast ‘area’
• Analyze advertised ‘packets’ in the broadcast area
• Investigate affected nodes by description & location
‘Similar interests & culture are more important than
geographical location in the context of internet era?’
Abstract
3. Introduction
Internet Era Facts
• More than 2,4B Internet Users – 34.3% of the World Population (June ‘12)
• During 2012 the Digital World generated Almost 2.9 ZB of Data
• Directly connected and affected from web 2.0 & 3.0 technologies
*Zettabyte = Gigabyte x 1012
4. Introduction
Social Web Statistics
• 67% of those Users use any Social Networking Site
• Facebook, Google+, YouTube, Twitter are currently the leaders
*In million of users
5. Introduction
• By acquiring a Twitter network dataset
• By creating a Graph based on the dataset
• By clustering based on recognized patterns
What:
• @username / description / location
• Follow (directed graph)
• Update (hashtag/link/photo/video)
• Reply or Mention (@username)
• Updates - hashtags
• Location / description
Focus on:
Why:
• Directed Broadcast Network Topology
• 2012 Q3 to Q4 fastest growing Social Platform by 40%
• Reflected in 288m active users
6. Literature Review
Combining Ideas
‘Learning to Discover Social Circles in Ego Networks’
1. Influenced on the way an Ego-network can be explored & clustered
‘Socio-semantic Query Expansion Using Twitter Hashtags’
2. Influenced on the way a hash tag may be used
#tag:
The hash symbol
Followed by a
Word or Concatenated phrase
e.g.: #truestory
7. 3. Influenced to research on users location
‘Geographic Dissection of the Twitter Network’
‘Does offline geography still matter in online social networks?’
Break down on Twitter Research:
• User’s Geo Location
• Their connections to others
• Information they exchange with them
Concluded:
‘Our in-depth analysis reveals that geography crucially impacts all aspects
of the Twitter social network’.
Literature Review
8. Methodology
• Twitter API
• Data mining procedure
• All data are publicly shared
• Cluster by hashtags, description &
location
Data Mining & Analysis
Physician John Snow in 1854 during the
‘Broad Street cholera outbreak’, recognized patterns
and created clusters. Water pumps were the disease
source. Convinced the city authorities to close the
pumps and solved the problem.
*API stands for:
Application Programming Interface
11. ATL Atlanta | LA Atlanta, GA, USA
ATL? NY ? FL ? WORLDWIDE Atlanta and Fort Lauderdale Atlanta, Ga.
ATLby way of West Philly Atlanta GA ATLANTA, GA.
Atl shawty Atlanta Ga. Atlanta, GA.
Atl. Atlanta Georgia Atlanta, Ga. ?
ATL/NY/NJ/ Atlanta Georgia Area Atlanta, Georgia
Atlanta Atlanta Headquarters Atlanta, Georgia (Gwinnett)
atlanta Atlanta Nightlife! Atlanta, Georgia, USA
ATLANTA Atlanta via Hampton Roads Atlanta, Los Angeles
Atlanta -- DC Atlanta, DC, & International Atlanta, New York
Atlanta - London - Tokyo Atlanta, GA Atlanta, New York, Los Angeles
Atlanta - sometimes Houston Atlanta, Ga Atlanta,GA
Atlanta & New York City ATLANTA, GA Atlanta,Ga
Atlanta (Soufside) atlanta, ga ATLANTA,GA
ATLanta , ga Atlanta, GA and Sarasota, FL Atlanta,Georgia
Atlanta , GA Atlanta, GA Area atlanta. georgia. u.s.a
Atlanta /Global Atlanta, GA USA Atlanta/Ghana/Africa/Worldwide
Atlanta | Brasil | NYC Atlanta, GA, U.S.A. Atlanta/New York USA
Atlanta's West Midtown
ATTRIBUTE Location - REGION: Atlanta
• The ‘Atlanta Problem’
• Data Cleaning needed
Data Elaboration
Design
55 different inputs of the location Atlanta in this table
12. Different people have different writing habits:
In order to tackle the problem one cluster for the region ‘USA’
was created and all other location data were clustered by country.
Same problem applies on the description of the followers.
The ‘Atlanta Problem’
Design
13. Implementation
Implementation Cycles
Based on ‘The Spiral Model of Software Development’ the project came through three
different cycles:
1. The Twitter Project (1,9Billion lines)
2. The Ego-network Project (Already cleaned & clustered)
3. The Data mining Project (Author's ego-network analysis)
Data pre-processing
Intel Core2duo 2.66Ghz
CPU Load:100% | Kernel peak 98%
Data pre-processing
User's IDs relational graph
Total Lines:
Author’s ego-network
18. Results
1. Parent node is based in Greece
2. Most ‘affected’ nodes are in
USA, followed by
UK, Canada, Greece, etc.
3. Empty value can affect the
diffusion on the small returns.
Test, Results & Evaluation
*Dot size reflect number of incoming edges
20. Evaluation - SEOmoz
Test, Results & Evaluation
Geographical expansion results are verified - May 2013
21. Evaluation – SEOmoz & Tweetstats
Test, Results & Evaluation
From January to May
Days: 119
Tweets: 333
Followers growth: 163%
January 2013 May 2013
22. Test, Results & Evaluation
From January to May
Days: 119
Tweets: 333
Followers growth: 163%
Evaluation – SEOmoz & Tweetstats
23. Conclusion
Research on:
• Network analysis
• Online social networks
• Specific Twitter characteristics
Raised the issue:
‘Similar interests & culture are more important than
geographical location?’
Based on the analysis undertaken :
‘Shared interests & culture play a greater role on connecting
people via the twitter medium than their geographic location.’
!
Small dataset
24. THANK YOU FOR YOUR TIME
Conclusion
Personal Reflection
• Network analysis
• Social network analysis
• Project management
• Research skills
• Data analysis
• Visualization skills
Future Work
• Project in ongoing state
• User categorization through created metrics
• Evaluate results based on same analysis but
with different accounts
25. Final Year Project in numbers
• 6,286 files
• In 135 folders or…
• 59.5GB of data and counting…
• Explored over 10 different SW programs in data
analysis, processing and visualization field
User Behavior Pattern Recognition Using Data Analysis Techniques On Twitter Social Network
26. References
Anagnostopoulos, I., Kolias, V. and Mylonas, P. (2012). Socio-semantic Query Expansion Using Twitter Hashtags. In: 2012 Seventh International
Workshop on Semantic and Social Media Adaptation and Personalization (SMAP), 2012. [Online]. Available at: doi:10.1109/SMAP.2012.15.
Bastian, M., Heymann, S. and Jacomy, M. (2009). Gephi: An Open Source Software for Exploring and Manipulating Networks. In: Third International
AAAI Conference on Weblogs and Social Media, 19 March 2009. [Online]. Available at:
http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154.
Duggan, M. and Brenner, J. (n.d.). The Demographics of Social Media Users - 2012. Pew Internet & American Life Project. [Online]. Available at:
http://www.pewinternet.org/Reports/2013/Social-media-users/The-State-of-Social-Media-Users.aspx [Accessed: 6 March 2013].
GlobalWebIndex. (2012). SOCIAL PLATFORMS GWI.8 UPDATE: Decline of Local Social Media Platforms. GlobalWebIndex. [Online]. Available at:
http://www.globalwebindex.net/social-platforms-gwi-8-update-decline-of-local-social-media-platforms/ [Accessed: 15 March 2013].
J. McAuley and J. Leskovec. Learning to Discover Social Circles in Ego Networks. NIPS, 2012.
Kulshrestha, J., Kooti, F., Nikravesh, A. and Gummadi, K. P. (2012). Geographic Dissection of the Twitter Network. Dublin, Ireland: Max Planck
Institute for Software Systems.
Miniwatts Marketing Group. (n.d.). World Internet Users Statistics | Usage and World Population Stats. Internet World Stats. [Online]. Available at:
http://www.internetworldstats.com/stats.htm [Accessed: 6 February 2013].
Smith, M. A., Shneiderman, B., Milic-Frayling, N., Mendes Rodrigues, E., Barash, V., Dunne, C., Capone, T., Perer, A. and Gleave, E. (2009). Analyzing
(social media) networks with NodeXL. In: Proceedings of the fourth international conference on Communities and technologies, 2009, p.255–264.
[Online]. Available at: http://dl.acm.org/citation.cfm?id=1556497 [Accessed: 30 March 2013].
Snow, J. (n.d.). Mode of Communication of Cholera(John Snow, 1855). [Online]. Available at: http://www.ph.ucla.edu/epi/snow/snowbook4.html
[Accessed: 1 February 2013].
Twitter Help Center. (n.d.). The Twitter glossary. [Online]. Available at: https://support.twitter.com/articles/166337-the-twitter-glossary#
[Accessed: 2 January 2013].