SlideShare a Scribd company logo
1 of 24
Introduction to Data Mining forWeb Applications Paul-Alexandru Chirita, Ph.D.
About Me Education: Ph.D., Information Retrieval & Data Mining, Univ. of Hannover, Germany B.Sc., Ecole Polytechnique, Paris, France + “Politehnica” Univ. Bucharest, CS Dept. Roughly 8 yrs. in IT, out of which 7 in IR & DM Now in Adobe Romania (L3S, Yahoo!, Schlumberger and others in the past)
Web Mining The application of Data Mining algorithms to discover patterns in the Web. Three dimensions: Usage Mining Analyzes various access logs in order to provide input to Business Decisions By far the most used, with the highest ROI Content Mining Analyzes Web page content in order to extract useful information (e.g., keywords, topic, content type, sentiment, etc.) Structure Mining Also known as “Link Analysis” Investigates the hyperlink structure of the Web to improve current algorithms
Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
Client side tools Purpose: Return basic information about traffic on your Web Site, SEO Most of them are also (partly) integrated with Monetization Tools (e.g., AdWords) Pros: Hosted by third party sites, zero or minimal cost for you Easy to implement and integrate, no maintenance Cons: The client side tracking code will eat some of your bandwidth (~200-600 ms. additional response time) If your traffic increases “too much” you have to pay
Client-side tools: Google Analytics Free, and well-engineered! Shows statistics about: Basic stuff: Visits, Pages, etc. Visitor profiles: Browser, OS, Language/Locale Visitor loyalty: How many times did each visitor return to your site, When was the last time they did it, For how long Trends: Is your traffic & popularity growing or decreasing Traffic sources: Entry/Exit pages, Referring sites & search engines Some customization planned for the near-term future Good for personal or small scale sites https://www.google.com/analytics
Client-side tools: Google Analytics [2]
Omniture: Site Catalyst Low price per thousand of entries, but may become costly if you have a lot of traffic (millions of visits per day) or if you have many dozens of sensors Same statistics as Google Analytics, but you can drill down very deep: Statistics per hour of day, per file type (html, cfm, etc.), per action type (download, view page, etc.) Visitor segmentation down to the level of city Purchases, Promotions, and Many metrics for e-commerce (e.g., how many products added to the cart have actually been checked out) Most importantly, you can define ANY metric you want! (e.g., how many people click on my survey link, how many of them fill it in, etc.) www.omniture.com
Omniture: Site Catalyst [2]
Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
Server side tools Purpose: Return basic information about traffic on your Web Site Similar to the client-side tools, but currently more focused on Reliability & Application Improvements Pros: Most importantly, zero bandwidth overhead for your app (Every ms counts!) Show a lot of developer specific information (errors, visitor browsers/OS, etc.) Very easy to install Cons: Usually open source, but hard to extend with your own metrics
FREE Server side tools Similar statistics as with the Client Side tools, but… Less business specific information (do not include Visitor Loyalty, Trends, etc.) More developer specific data (errors & error types, HTTP status codes, etc.) Good for medium and large scale sites http://awstats.sourceforge.net/ http://www.stedee.id.au/awffull/
Server side tools: AW Stats
Server side tools: Webalizer / AWF-Full
Paid Server side tools Overcome most limitations of the free tools Log everything into text files (see next Section) Provide some sort of SQL-like query language which helps you define any type of query you want Run reports much faster The most expensive of them all, meant for professional use http://www.splunk.com/
Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
How is this done in the heavy weight category ;-) Multiple log files, one per each functionality checked As simple as possible (see next slide for an example) The main guideline is to be able to parse any log file and generate statistics using only the command line Example: Tab separated
Sample log Date & Time		IP (hashed)	User ID  (hashed)	Query		Parameters Sep 28 06:49:42		Ea9hjnc4ufTfU	anonymous	spell checker	:0:10:en_US:en_US:0:0 Sep 28 06:49:42		8NCTsHqR366	anonymous	javascript		:0:10:fr_FR:fr_FR:0:1 Sep 28 06:49:42		K4nD5xy/R5fw	anonymous	text	:0:10:en_US:en_US:0:1 Sep 28 06:49:43		lRqBaIaUWxna	yxDkhBEqC6xxR8z=	module	:0:10:en_US:en_US:0:0 Sep 28 06:49:44 	jMjJpy6bHAdb	hPFLKaMNeShD0=	delete spread	:0:10:en_US:en_US:0:0 Sep 28 06:49:44		r3xgRLagX1cQ6	anonymous	_x	:0:10:ru_RU:ru_RU:0:0 Sep 28 06:49:45		b2DLBl3VTT67Q	anonymous	anti a	:0:10:de_DE:de_DE:0:0 Sep 28 06:49:45		KaKiB2ITEdPeM	VcLic9CIy4QxVtJQ=	create a star	:0:10:en_US:en_US:0:0
What can be done using this data You can basically measure everything ;-) Plus you can enable loads of new features: Personalization for search, sold/promoted products, etc. Browsing recommendations Improve site organization (make popular pages more accessible, promote some other pages and track their traffic increase, etc.) Search suggestions Advertising (keyword selection, etc.)
Personalized search and promotions Show different results/ads to different users
Browsing recommendations
Search suggestions
How To Web - Introduction To Data Mining For Web Applications

More Related Content

Viewers also liked

Izobrazevanje za data-mining
Izobrazevanje za data-miningIzobrazevanje za data-mining
Izobrazevanje za data-miningbutest
 
Educational Data Mining in relation to education statistics of Nepal
Educational Data Mining in relation to education statistics of NepalEducational Data Mining in relation to education statistics of Nepal
Educational Data Mining in relation to education statistics of NepalRaj Subit
 
Data Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational ResearchData Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational ResearchQiang Hao
 
Educational Data Mining/Learning Analytics issue brief overview
Educational Data Mining/Learning Analytics issue brief overviewEducational Data Mining/Learning Analytics issue brief overview
Educational Data Mining/Learning Analytics issue brief overviewMarie Bienkowski
 
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
Learning Analytics in Education:  Using Student’s Big Data to Improve TeachingLearning Analytics in Education:  Using Student’s Big Data to Improve Teaching
Learning Analytics in Education: Using Student’s Big Data to Improve TeachingRafael Scapin, Ph.D.
 
Data Mining in Healthcare: How Health Systems Can Improve Quality and Reduce...
Data Mining in Healthcare:  How Health Systems Can Improve Quality and Reduce...Data Mining in Healthcare:  How Health Systems Can Improve Quality and Reduce...
Data Mining in Healthcare: How Health Systems Can Improve Quality and Reduce...Health Catalyst
 

Viewers also liked (6)

Izobrazevanje za data-mining
Izobrazevanje za data-miningIzobrazevanje za data-mining
Izobrazevanje za data-mining
 
Educational Data Mining in relation to education statistics of Nepal
Educational Data Mining in relation to education statistics of NepalEducational Data Mining in relation to education statistics of Nepal
Educational Data Mining in relation to education statistics of Nepal
 
Data Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational ResearchData Mining and Text Mining in Educational Research
Data Mining and Text Mining in Educational Research
 
Educational Data Mining/Learning Analytics issue brief overview
Educational Data Mining/Learning Analytics issue brief overviewEducational Data Mining/Learning Analytics issue brief overview
Educational Data Mining/Learning Analytics issue brief overview
 
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
Learning Analytics in Education:  Using Student’s Big Data to Improve TeachingLearning Analytics in Education:  Using Student’s Big Data to Improve Teaching
Learning Analytics in Education: Using Student’s Big Data to Improve Teaching
 
Data Mining in Healthcare: How Health Systems Can Improve Quality and Reduce...
Data Mining in Healthcare:  How Health Systems Can Improve Quality and Reduce...Data Mining in Healthcare:  How Health Systems Can Improve Quality and Reduce...
Data Mining in Healthcare: How Health Systems Can Improve Quality and Reduce...
 

Similar to How To Web - Introduction To Data Mining For Web Applications

Basis Omniture
Basis OmnitureBasis Omniture
Basis Omnituresmishra
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingMichelle Minkoff
 
Internet of Things Chicago - Meetup
Internet of Things Chicago - MeetupInternet of Things Chicago - Meetup
Internet of Things Chicago - MeetupJason Lobel
 
Web analyticsandbigdata techweek2011
Web analyticsandbigdata techweek2011Web analyticsandbigdata techweek2011
Web analyticsandbigdata techweek2011Raghu Kashyap
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...yalisassoon
 
Google Business Tools
Google Business ToolsGoogle Business Tools
Google Business Toolsredcomin
 
Data Driven Design: Using Web Analytics to Improve Information Architectures
Data Driven Design: Using Web Analytics to Improve Information ArchitecturesData Driven Design: Using Web Analytics to Improve Information Architectures
Data Driven Design: Using Web Analytics to Improve Information ArchitecturesAndrea Wiggins
 
Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Sematext Group, Inc.
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALsathish sak
 
PPT 3 Web Analytics (1).pptx
PPT 3 Web Analytics (1).pptxPPT 3 Web Analytics (1).pptx
PPT 3 Web Analytics (1).pptxDevChaudhari15
 
Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...hannonhill
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
BAQMaR - Conference DM
BAQMaR - Conference DMBAQMaR - Conference DM
BAQMaR - Conference DMBAQMaR
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupMaarten Balliauw
 
Google analytics and google data studio
Google analytics and google data studioGoogle analytics and google data studio
Google analytics and google data studioBrian Pichman
 
Cmg10 Web Analytics Pres Am Long
Cmg10 Web Analytics Pres   Am LongCmg10 Web Analytics Pres   Am Long
Cmg10 Web Analytics Pres Am LongAnna Long
 

Similar to How To Web - Introduction To Data Mining For Web Applications (20)

Basis Omniture
Basis OmnitureBasis Omniture
Basis Omniture
 
Web scrapingpanel
Web scrapingpanelWeb scrapingpanel
Web scrapingpanel
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
 
Internet of Things Chicago - Meetup
Internet of Things Chicago - MeetupInternet of Things Chicago - Meetup
Internet of Things Chicago - Meetup
 
Web analyticsandbigdata techweek2011
Web analyticsandbigdata techweek2011Web analyticsandbigdata techweek2011
Web analyticsandbigdata techweek2011
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...
 
Web analytics
Web analyticsWeb analytics
Web analytics
 
Google Business Tools
Google Business ToolsGoogle Business Tools
Google Business Tools
 
Data Driven Design: Using Web Analytics to Improve Information Architectures
Data Driven Design: Using Web Analytics to Improve Information ArchitecturesData Driven Design: Using Web Analytics to Improve Information Architectures
Data Driven Design: Using Web Analytics to Improve Information Architectures
 
Google’s tridente
Google’s tridenteGoogle’s tridente
Google’s tridente
 
Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011Search Analytics at Enterprise Search Summit Fall 2011
Search Analytics at Enterprise Search Summit Fall 2011
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVAL
 
PPT 3 Web Analytics (1).pptx
PPT 3 Web Analytics (1).pptxPPT 3 Web Analytics (1).pptx
PPT 3 Web Analytics (1).pptx
 
Web Analytics Basics
Web Analytics BasicsWeb Analytics Basics
Web Analytics Basics
 
Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...	Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
Superautomatic! Data Feeds, Bricks, and Blocks, with Server-side Transformat...
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
BAQMaR - Conference DM
BAQMaR - Conference DMBAQMaR - Conference DM
BAQMaR - Conference DM
 
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
 
Google analytics and google data studio
Google analytics and google data studioGoogle analytics and google data studio
Google analytics and google data studio
 
Cmg10 Web Analytics Pres Am Long
Cmg10 Web Analytics Pres   Am LongCmg10 Web Analytics Pres   Am Long
Cmg10 Web Analytics Pres Am Long
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

How To Web - Introduction To Data Mining For Web Applications

  • 1. Introduction to Data Mining forWeb Applications Paul-Alexandru Chirita, Ph.D.
  • 2. About Me Education: Ph.D., Information Retrieval & Data Mining, Univ. of Hannover, Germany B.Sc., Ecole Polytechnique, Paris, France + “Politehnica” Univ. Bucharest, CS Dept. Roughly 8 yrs. in IT, out of which 7 in IR & DM Now in Adobe Romania (L3S, Yahoo!, Schlumberger and others in the past)
  • 3. Web Mining The application of Data Mining algorithms to discover patterns in the Web. Three dimensions: Usage Mining Analyzes various access logs in order to provide input to Business Decisions By far the most used, with the highest ROI Content Mining Analyzes Web page content in order to extract useful information (e.g., keywords, topic, content type, sentiment, etc.) Structure Mining Also known as “Link Analysis” Investigates the hyperlink structure of the Web to improve current algorithms
  • 4. Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
  • 5. Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
  • 6. Client side tools Purpose: Return basic information about traffic on your Web Site, SEO Most of them are also (partly) integrated with Monetization Tools (e.g., AdWords) Pros: Hosted by third party sites, zero or minimal cost for you Easy to implement and integrate, no maintenance Cons: The client side tracking code will eat some of your bandwidth (~200-600 ms. additional response time) If your traffic increases “too much” you have to pay
  • 7. Client-side tools: Google Analytics Free, and well-engineered! Shows statistics about: Basic stuff: Visits, Pages, etc. Visitor profiles: Browser, OS, Language/Locale Visitor loyalty: How many times did each visitor return to your site, When was the last time they did it, For how long Trends: Is your traffic & popularity growing or decreasing Traffic sources: Entry/Exit pages, Referring sites & search engines Some customization planned for the near-term future Good for personal or small scale sites https://www.google.com/analytics
  • 9. Omniture: Site Catalyst Low price per thousand of entries, but may become costly if you have a lot of traffic (millions of visits per day) or if you have many dozens of sensors Same statistics as Google Analytics, but you can drill down very deep: Statistics per hour of day, per file type (html, cfm, etc.), per action type (download, view page, etc.) Visitor segmentation down to the level of city Purchases, Promotions, and Many metrics for e-commerce (e.g., how many products added to the cart have actually been checked out) Most importantly, you can define ANY metric you want! (e.g., how many people click on my survey link, how many of them fill it in, etc.) www.omniture.com
  • 11. Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
  • 12. Server side tools Purpose: Return basic information about traffic on your Web Site Similar to the client-side tools, but currently more focused on Reliability & Application Improvements Pros: Most importantly, zero bandwidth overhead for your app (Every ms counts!) Show a lot of developer specific information (errors, visitor browsers/OS, etc.) Very easy to install Cons: Usually open source, but hard to extend with your own metrics
  • 13. FREE Server side tools Similar statistics as with the Client Side tools, but… Less business specific information (do not include Visitor Loyalty, Trends, etc.) More developer specific data (errors & error types, HTTP status codes, etc.) Good for medium and large scale sites http://awstats.sourceforge.net/ http://www.stedee.id.au/awffull/
  • 14. Server side tools: AW Stats
  • 15. Server side tools: Webalizer / AWF-Full
  • 16. Paid Server side tools Overcome most limitations of the free tools Log everything into text files (see next Section) Provide some sort of SQL-like query language which helps you define any type of query you want Run reports much faster The most expensive of them all, meant for professional use http://www.splunk.com/
  • 17. Agenda Client side tools Google Analytics Omniture Server side tools AW-Stats Webalizer / AWF-Full Advanced analytics
  • 18. How is this done in the heavy weight category ;-) Multiple log files, one per each functionality checked As simple as possible (see next slide for an example) The main guideline is to be able to parse any log file and generate statistics using only the command line Example: Tab separated
  • 19. Sample log Date & Time IP (hashed) User ID (hashed) Query Parameters Sep 28 06:49:42 Ea9hjnc4ufTfU anonymous spell checker :0:10:en_US:en_US:0:0 Sep 28 06:49:42 8NCTsHqR366 anonymous javascript :0:10:fr_FR:fr_FR:0:1 Sep 28 06:49:42 K4nD5xy/R5fw anonymous text :0:10:en_US:en_US:0:1 Sep 28 06:49:43 lRqBaIaUWxna yxDkhBEqC6xxR8z= module :0:10:en_US:en_US:0:0 Sep 28 06:49:44 jMjJpy6bHAdb hPFLKaMNeShD0= delete spread :0:10:en_US:en_US:0:0 Sep 28 06:49:44 r3xgRLagX1cQ6 anonymous _x :0:10:ru_RU:ru_RU:0:0 Sep 28 06:49:45 b2DLBl3VTT67Q anonymous anti a :0:10:de_DE:de_DE:0:0 Sep 28 06:49:45 KaKiB2ITEdPeM VcLic9CIy4QxVtJQ= create a star :0:10:en_US:en_US:0:0
  • 20. What can be done using this data You can basically measure everything ;-) Plus you can enable loads of new features: Personalization for search, sold/promoted products, etc. Browsing recommendations Improve site organization (make popular pages more accessible, promote some other pages and track their traffic increase, etc.) Search suggestions Advertising (keyword selection, etc.)
  • 21. Personalized search and promotions Show different results/ads to different users

Editor's Notes

  1. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  2. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  3. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  4. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  5. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  6. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  7. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  8. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  9. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  10. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  11. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  12. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  13. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  14. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  15. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  16. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  17. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  18. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  19. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  20. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului
  21. “secrete” – “secretomanie” – oamenii nu suntinformati. Se tin informatiile la nivelulmanagerilor. Oamenii se simtdezinformati. Pierdereaincrederii in managerul direct. Zvonurilesuntincurajate.Oamenii nu stiu ce trebuie sa facapentru a puteaavansa in cariera deoarecemanagerii nu spun:Cum se avanseazaCare suntdirectiile de avansareIn ce directie se indreaptaechipaCare e strategiacampusului