1. Wenzhe(Evelyn) Xu
217-417-9270 | wxu23@illinois.edu | Address: 100 24 Ave #3 ,San Mateo, CA 94403
CORE QUALIFICATIONS
Overall Highlight: Solid data science background with data engineer skill sets applied in industry
Statistical Skill:
l Data manipulation: ETL (extract-transform-loading) data technique with large dataset (Impala 1.x)
l Modeling: machine learning (regression, classification, clustering), categorical data analysis, time series analysis,
sampling, ANOVA analysis, A/B testing, dimension reduction, model selection
Computer Skill:
l Programming: expertise in R (4+yr,dplyr,ggplot2,shiny) , Python(1+yr, Scipy, numpy, pandas, scikit-learn), SQL(3+yr)
l Data engineering: Apache Impala 1.x, Apache Hadoop 2.x, Cloudera CDH 5.x, Apache Spark 1.x
l Software and system: SAS (Advanced Certified), SPSS, Looker, Linux (CentOS6.5), Microsoft Office, Map Reduce
Communication and Presentation:
l Meeting with business colleagues and transfer the commercial goal into data-driven objective
l Quick response to the later-added requirement from business side
l Presentation of the modeling result and data insight by PPT and building interactive dashboard through Shiny
Project Management: Able to track each stage according to scheduled timeline for an independent project
EDUCATION BACKGROUND
University of Illinois at Urbana-Champaign 8/2013-5/2015
Master of Science in Statistics-Analytics | Major GPA:3.96/4.0
Tianjin University of Finance and Economics, Tianjin, China
9/2009-6/2013
Bachelor of Science in Statistics | Overall GPA: 3.75/4.0 | Major GPA:4.0/4.0
PROFESSIONAL EXPERIENCE
Data Scientist Intern, MasterCard Inc. San Carlos, CA 6/2015 - 12/2015(expected)
Independently designed the automatic anomalies detection system for business metrics time series data
l Researched on anomaly detection algorithm and finally choose the S-H-ESD test from Twitter as basic logic
l Read package source code of R and re-organized the logic and input to fit the business metrics data application scenario,
with integrating special requirement from business and engineer staff’s execution
l Used Impala/SQL for ETL the data from production, complied the R code into mapper and reducer function in Python
with each group id as mapper key, and set up the map-reduce data pipeline job on Oozie
l Build PPT for presentation and interactive dashboard from Shiny for collecting feedback of the algorithm to team and
quick response to additional need from business and engineer colleagues
Collaborative tasks from other projects
l Helped to write mapper and reducer function on word count for merchant classification based on transection records
l Build Hadoop cluster of 12 nodes including tuning critical parameters, and was responsible for the maintenance on
memory usage and related package installation under CentOS6.5 Linux environment
l Participated in the Cloudera training on Apache Spark
Tech-Sale Analytics Intern, Anheuser-Busch Inbev, Champaign, IL 5/2014 -5/2015
Customer profiling with social media data source and internal wholesaler database
l Learned the original algorithm and created advanced string cleaning step for Foursquare/Yelp database
l Improved the whole workflow by adding lat/lon information as well as innovative filters to increase mapping accuracy
l Wrote and integrated all steps in pipelined R code for future usage and cooperated with 3rd
party to build dashboard
l Overcame the challenges of handling Asian languages and popularized the customer profiling project to global market
l Mentored new interns for taking over the customer profiling work flow
Quantitative exploration experiment projects (Ad Hoc)
l Predicted volume for each POC through logistic models and learning algorithms, identified the influencing POCs
l Used regression tree to predict the number of pipe needed by the target volume for Belgium on-premise market
l Transfer the data analysis result to commercial insight and presented to business managers on a regular basis
2. OTHER RELATED PROJECT EXPERIENCES
Bad Auction Prediction for Old Car Purchase Applied machine learning, Champaign, IL 12/2014
l Preprocessed the data by checking for missing values and transforming features into appropriate form for modeling
l Used random forest to obtain the variable importance and performed feature selection
l Built gradient logistic regression model and tuning the threshold parameter for binary prediction
l Applied KNN algorithm and cluster based prediction algorithm for prediction and compared the results
Online advertisement click-through-rate prediction (large data set) Champaign, IL 12/2014
l Loaded in and manipulated the data of 12 GB for checking missing values and basic visualization
l Constructed logistic regression model and trained random forest algorithm to predict the click probability
l Applied frequent pattern mining for auxiliary prediction and gradient algorithm to combine the prediction results
Spatial Analysis on the Workload Distribution of Urbana Police, Consulting, Champaign, IL 5/2014
l Preprocessed about 50000 records of crimes in Urbana area from January 2011 to May 2013 using R and SAS
l Visualized the data by spatial-temporal bar plot to present the trend of the crime frequency
l Applied Kernel Density Estimation and Bernoulli likelihood estimation to detect global and local crime clusters
Time series analysis for the private final consumption of Australia, Champaign, IL 12/2013
l Collected the quarterly data with 127 records and took 121 for modeling, based on the plots of ACF and PACF after
1-step difference to built SARIMA model, conducted model selection process through AIC and BIC, and finally
choose SARIMA(1,1,1)*(2,1,2) as the prediction model which gave the smallest forecast error
l Discovered the seasonality of the raw data by plotting the smoothed periodogram through the spectral domain
approach using R
Modeling for the prediction of house price at Urbana-Champaign area, Champaign, IL 12/2013
l Cleaned the data from the original 3727 records in the raw dataset including checking and correcting record errors,
remove duplicated records and irrelevant records
l Extracted location, structure, size and age as factors which significantly influenced the price
l Built a linear regression model with R-square of 0.728 and a tree model with the most significant variables of house
size and number of bathroom using R
The correlation between the attributes of grapes and quality of wine, Tianjin, China 9/2012
l Classified grapes by the physical and chemical indexes with different quality level by K-means cluster method
l Established the canonical correlation model between the quality of grapes and wines
l Extracted the principle component of color index as significant factor in grading the quality of wine using R
HONORS, AWARDS, AND ACTIVITIES
Silver prize for the 2012 National Mathematical Modeling Contest, Tianjin, China
Attended the 2014 UseR Conference at University of California, Los Angeles, CA
Honored as the “Most Advanced Marketing Development” intern for the internship in AB-Inbev Budlab, Champaign,IL