Zhen wang demo3

•Descargar como PPTX, PDF•

0 recomendaciones•157 vistas

Zhen Wang

Demo slides for my Flask App SpreadHealth.tech

Datos y análisis

Empower Public Health
through Social Media
Zhen Wang, Ph.D.
Insight Health Data Science

Text
Cleaning, Tokenizing
Convert to Feature Vectors
“I like food!”
“Food is good!”
“I had some good food.”
i, like, food
food, is, good
i, had, some, good, food
e.g., TF-IDF
I’m really good
with numbers!
i like food is good had some
1 1 1 0 0 0 0
0 0 1 1 1 0 0
1 0 1 0 1 1 1
Downweight, Normalize
Machine
Learning
Numbers
Natural Language Processing

Text Classification
Normalized Retweet Counts
NumberofTweets
Distribution of Tweets
● Sample Imbalance
● Classification (0/1: Not / Retweeted)
● Logistic Regression
Threshold: 0.005
Misclassification Error: 22%
0 01 1
Train Test
downsampling
0.81
0.740.26
0.19
Normalized Confusion Matrix
Codes: github.com/zweinstein/SpreadHealth_dev

Zhen (Jen) Wang
Beta Tester
Since 2015 Editor since 2015
Traditional Medicine Science Fiction
Public Speaking Online Education
Ph.D. in Physical Chemistry

Text Preprocessing Pipeline
Text Cleaning:
● Convert to lower case
● Replace URL, #, and @
● Remove special characters other than
emoticons
● Remove stopwords
Tokenizing:
● Splitting each documents into individual
elements
● Bag-of-Words or N-grams
● Stemming
○ Porter Stemmer was used
○ Snowball or Lancaster stemmer faster but
more aggressive
○ Lemmatization computationally more
expensive but little impact on the
performance of text classification
Term Frequency-Inverse Document
Frequency (tf-idf):
Term Frequency--tf(t,d): the number of times
a term t occurs in a document d
Used to downweight frequently occurring
words in the feature vectors tf(t,d)
Document Frequency--df(d,f): the number of
documents d that contain a term t.
The implementation in Scikit-learn

● Train Dataset: 10000 tweets on diabetes (4782 retweeted);
● Test Set Accuracy (Random Chance 0.49 on positive class):
○ KNN: 60%
○ Naive Bayes: 67%
○ Logistic regression: 75% (chosen and tested on imbalanced test data)
● Potential Improvements:
○ Decision Trees with Bagging/Boosting (e.g., Random Forest, XGBoost)
○ Other Features:
■ Polarity & Sentiment
■ Length
● Out-of-Core Incremental Learning with Stochastic Gradient Descent
(Advantages of Logistic Regression…)
● Automatic Update to SQLite Database and to the Classifier
Prediction Algorithms

Más contenido relacionado

Destacado

Our Opening Title sequence presentationchloe-carman

Hareket Magazine 12.Hareket

3D Game Environment Workflowraimondklavins

Hareket Magazine-19-2016Hareket

What is a dance music videochloe-carman

Rich Aquilone- Top 5 Rock DrummersRichard Aquilone

1.1 ingles sistema operativodenissecollins94

A step by-step guide on i doc-ale between two sap serverskrishna RK

Destacado (8)

Our Opening Title sequence presentation

Hareket Magazine 12.

3D Game Environment Workflow

Hareket Magazine-19-2016

What is a dance music video

Rich Aquilone- Top 5 Rock Drummers

1.1 ingles sistema operativo

A step by-step guide on i doc-ale between two sap servers

Similar a Zhen wang demo3

Using Bioinformatics Data to inform Therapeutics discovery and developmentEleanor Howe

Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...The Research Council of Norway, IKTPLUSS

Predicting Thyroid Disorder with Deep Neural NetworksAnaelia Ovalle

Machine Learning FoundationsAlbert Y. C. Chen

Natural Language Processing to Curate Unstructured Electronic Health RecordsMMS Holdings

Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.

2016 bergen-sarsc.titus.brown

Analysing & interpreting data.pptmanaswidebbarma1

171017 giab for giab grc workshopGenome Reference Consortium

Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson

Introduction to Data MiningKai Koenig

Automated health responses Austin Powell

Recommendation engine Using Genetic AlgorithmVaibhav Varshney

How deep learning reshapes medicineHongyoon Choi

Biostatistics and DNA for NCKU iGEMPo-Jen Wu

Multivariate Analysis and Visualization of Proteomic DataUC Davis

2014 aus-agtac.titus.brown

Deep learning for natural language understandingDavid Talby

Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...Twitter Inc.

TADPole_Nurjahan BegumNurjahan Begum

Similar a Zhen wang demo3 (20)

Using Bioinformatics Data to inform Therapeutics discovery and development

Norwegian clinical genetics analysis platform ”genAP”, Thomas Grünfeld and To...

Predicting Thyroid Disorder with Deep Neural Networks

Machine Learning Foundations

Natural Language Processing to Curate Unstructured Electronic Health Records

Text Mining for Biocuration of Bacterial Infectious Diseases

2016 bergen-sars

Analysing & interpreting data.ppt

171017 giab for giab grc workshop

Modeling Electronic Health Records with Recurrent Neural Networks

Introduction to Data Mining

Automated health responses

Recommendation engine Using Genetic Algorithm

How deep learning reshapes medicine

Biostatistics and DNA for NCKU iGEM

Multivariate Analysis and Visualization of Proteomic Data

2014 aus-agta

Deep learning for natural language understanding

Towards Privacy-Preserving Evaluation for Information Retrieval Models over I...

TADPole_Nurjahan Begum

Último

Week-01-2.ppt BBB human Computer interactionfulawalesam

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Invezz.com - Grow your wealth with trading signalsInvezz1

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Discover Why Less is More in B2B Researchmichael115558

Mature dropshipping via API with DroFx.pptxolyaivanovalion

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823

Probability Grade 10 Third Quarter LessonsJoseMangaJr1

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823

Zhen wang demo3

1. Empower Public Health through Social Media Zhen Wang, Ph.D. Insight Health Data Science

2. Text Cleaning, Tokenizing Convert to Feature Vectors “I like food!” “Food is good!” “I had some good food.” i, like, food food, is, good i, had, some, good, food e.g., TF-IDF I’m really good with numbers! i like food is good had some 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 Downweight, Normalize Machine Learning Numbers Natural Language Processing

3. Text Classification Normalized Retweet Counts NumberofTweets Distribution of Tweets ● Sample Imbalance ● Classification (0/1: Not / Retweeted) ● Logistic Regression Threshold: 0.005 Misclassification Error: 22% 0 01 1 Train Test downsampling 0.81 0.740.26 0.19 Normalized Confusion Matrix Codes: github.com/zweinstein/SpreadHealth_dev

4. Zhen (Jen) Wang Beta Tester Since 2015 Editor since 2015 Traditional Medicine Science Fiction Public Speaking Online Education Ph.D. in Physical Chemistry

5. Thank you!

6. See the App in Action:

7. Text Preprocessing Pipeline Text Cleaning: ● Convert to lower case ● Replace URL, #, and @ ● Remove special characters other than emoticons ● Remove stopwords Tokenizing: ● Splitting each documents into individual elements ● Bag-of-Words or N-grams ● Stemming ○ Porter Stemmer was used ○ Snowball or Lancaster stemmer faster but more aggressive ○ Lemmatization computationally more expensive but little impact on the performance of text classification Term Frequency-Inverse Document Frequency (tf-idf): Term Frequency--tf(t,d): the number of times a term t occurs in a document d Used to downweight frequently occurring words in the feature vectors tf(t,d) Document Frequency--df(d,f): the number of documents d that contain a term t. The implementation in Scikit-learn

8. ● Train Dataset: 10000 tweets on diabetes (4782 retweeted); ● Test Set Accuracy (Random Chance 0.49 on positive class): ○ KNN: 60% ○ Naive Bayes: 67% ○ Logistic regression: 75% (chosen and tested on imbalanced test data) ● Potential Improvements: ○ Decision Trees with Bagging/Boosting (e.g., Random Forest, XGBoost) ○ Other Features: ■ Polarity & Sentiment ■ Length ● Out-of-Core Incremental Learning with Stochastic Gradient Descent (Advantages of Logistic Regression…) ● Automatic Update to SQLite Database and to the Classifier Prediction Algorithms

Notas del editor

http://54.191.168.240

Zhen wang demo3

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (8)

Similar a Zhen wang demo3

Similar a Zhen wang demo3 (20)

Último

Último (20)

Zhen wang demo3

Notas del editor