Lessons after working as a data scientist for 1 year
1. Lessons After Working as a
Data Scientist for 1 Year
Yao Yao
SMU DS Alumni Talks Oct ‘19
Graduated SMU MSDS Aug ’18
Hired as Data Scientist Nov ‘18
2. Data Science is an Expression of Self in Data
• Given a dataset and an open ended problem, every solution variation would be
different: from the way you code, to the libraries and algorithms used, to the
different views, to the data merges, to the visualization, to the final solution
• Is a culmination of all the experiences you underwent towards understanding
and extracting business insights with the dataset
• Transfer learning of unrelated disciplines, certificate courses, code references of what is
possible, and imagination of possibilities all feed into the solution that is rewardingly you
• The creative eye in data scientists are hard to evaluate in live coding puzzles and
technical questions; therefore, data challenges and interviews are needed
• Even then, some are rigid in their solution path with established cookbooks while
others are nonconformists, eclectic thinkers, and tailor their solution better than
what the problem and dataset suggested
3. Evaluating Data Scientists
• Fundamentally understand the nature and math of how solutions arrive
• Some data scientists thrive based on a certain domains and can find stability a
niche while others require multiple variations of different datasets and problems
to challenge them. Evaluate data scientists holistically instead of by one metric
• Data science sometimes is hard to monetize and justify higher salaries
• If the same insights is sold to multiple parties, the information is no longer exclusive and
eventually becomes public knowledge, which drives the value of the information down. The
insights are truly valuable if one party has exclusive access and can use it competitively
• To resist insight-value degradation due to economies of scale, find niche ways in
DS in which the value increases when there is more participation
• DNA testing of ancestry become more accurate as more people participate in the dataset
• Facebook and Google target ads and recommend news based on finding similar people
with your tastes based on cookie tracking as more people use their system
4. Evaluating Data Scientists
• Unlike software developers and engineers, most of our work is built on the ability
to produce insights based on our coding and statistics acumen, algorithm
methodology, and creativity factor to extract the necessary business insights to
induce action, build data pipeline products, and/or embed systems in IOT devices
• Because we are so many steps removed from building a software or a website to induce
monetary compensation, we get paid less and companies sometimes cannot gauge the
direct relationship between what we produce and how that ultimately generates money
• Find niche ways, such as FinTech, where the proprietary insights we produce has
a more direct route towards generating revenue based on decisions from insights
• Build products like recommendation/ad targeting systems where the perceived
value of the information distributed does not degrade due to economies of scale
or knockoffs bypassing R&D and instead increases once more people participate
5. Data Science has Clever Ways to Reflect Life
• Neural networks reflect dendrites in a human brain
• CNNs are filters in which a computer perceive images
• Generative Adversarial (GAN) learning, which is “sword sharpens sword”
• Adversarial: Be challenged, have thought experiments, and cover blind spots
• Reinforcement learning, which is "being an apprentice to a mentor“
• Become a better version of yourself by learning from established others
• Use DS to optimize for life given that life is the ultimate DS problem
• Be creative and have less inhibitions that prevent you from being successful
• Apply DS in rare domains such as mergers and acquisitions and portfolio asset management
• Given that life has imperfect information: help people by building products and
gaining insights with the compensation as a byproduct of what you do
6. DS is Relatable, For Better or Worse
• DS has so many algorithmic metaphors to explain everything intuitively
• It is beneficial for leadership with tech backgrounds to understand intuitively
what we do and can improve your communication skills to become manager
• However, every layman thinks that they are a DS until they are proven otherwise
• Bad when devs with only coding backgrounds tries to impede on DS work by
doing PowerBI without intuitively knowing statistics, or using correlation, which
is the bottom tier of analysis, for inferencing causation, or using improper cross
validation techniques and metrics to evaluate “cookbook methods” pulled online
• Worse when leadership only values percent change with endpoints for better
marketing PR instead of fitting a curve with a slope for better metrics
• When you explain things too simple so they would understand, it backfires when
leadership does not see the justification why certain algorithms still takes long
7. DS Discretion
• In industry, a lot of software are held together by “rope and duct tape”
• Not all completed code is perfect but to self-review or become a better coder,
use split screen and retype your code into the final version without excessive
exploratory functions and more efficient methodology
• Instead of being aspirational by importing all the data and using the most heavy
duty algorithm for your computer to run, which takes two weeks+, subsample
and prototype your results so that leadership can see and get a sense of initial
results for them to approve you of running the full model with all the data
• Visualizations always do better to persuade leadership, clients, customers,
investors, etc than the same tangible/numeric results
• Improvise, adapt, overcome; be a nonconformist, nonconventional, eclectic
thinker because all standard practice and common sense ideas have been done
8. Sequel to The Job Search Interview Offer
Letter Experience for Data Science
• I work for Viral Launch (5-year start-up): they do keyword and
category product discovery for Amazon sellers in Indianapolis, IN
• Given Amazon’s diverse dataset
• Soloed 6+ projects from data orchestration, workflow design, to production
code for online software release
• Dependent on interest level, company culture, the right mentor,
relevancy with previous projects/capstone, relative income, career
prospect, growth, and the ability to dive deeper into DS
• Previous Presentation: video and slides
https://www.youtube.com/watch?v=Fiz1Tn7ogP4
• https://www.slideshare.net/YaoYao44/yao-yao-msds-alum-the-job-
search-interview-offer-letter-experience
9. My Timeline
• 2012: Bachelor’s in material science engineering
• Job experience in engineering but did mostly data analyst
work and VBA automation of spreadsheets
• Minor DS mentorships and certificates
• 2017: Started MSDS program Jan 9th (Spring)
• 2018: Started DS job search August 2nd
• Still had capstone paper, ML pres, and QTW due
• 2018: Graduated MSDS program August 27th (Summer)
• 5-semester track; NLP in capstone, ML as elective
• 2018: Landed DS offer Nov 8th; Accepted Nov 10th
• Took 99 days (~3 months); 74 days since graduation
• 2019: Landed next DS offer Oct 6th; Accepted Oct 8th
• Took 28 days (1 month) since initial applications
Highlights from Previous Presentation (Slide 2/52): Nov 2018
10. Variations in Job Titles Related to MSDS
• Analytics Engineer – Supply chain, logistics, financial
• Business Intelligence Analyst – Operations; PowerBI; insights
• Decision Scientist – Ability to DS convert findings into business decisions
• Data Analyst – Mostly SAS, R, SQL; Less python
• Data Engineer – Data architects and databases for DS
• Data Scientist – Variation in prediction/NLP/ML
• Machine Learning Engineer – 2-5 years of previous DS experience; custom
applications; specific focus
• Predictive Analyst – More stats based; Less ML
• Quantitative Analyst – Financial technology, time series, stats
Last word in job title hints rigor and salary range
Highlights from Previous Presentation (Slide 18/52): Nov 2018
11. Variations in Job Titles Related to MSDS
• Applied Scientist – Catch-all for those who have MSDS or PhD in computationally
rigorous field like stats, physics, math, industrial operations, CS
• NLP Engineer – Mostly NLP: corpus, Stanford NLP, cosine similarity
• Computer Vision Engineer – Mostly CV: CNN filters, edge detection, robotics
• Data Science, Analyst – Rigorous coding requirements in the DS field: SQL, ML
but has sequestered projects due to large company size and many team divisions
• Deep Learning Engineer – Mostly Neural Networks/ML
• Python ML Developer – Ability to write custom ML algorithms/optimizations
from scratch, like ADAMAX, based on specific domain applications
• AI Developer – Catch-all; Most likely has imposter syndrome by HR or by person
Analyst: insights; Scientist: build/explore; Engineer: master; Developer: Hardcore
Highlights from Previous Presentation (Addendum): Nov 2018
12. What Semester to Apply Jobs?
• Streamline the process
• Once you learn QTW/NLP/ML and have enough time allocated to finish the
capstone project, apply for jobs
• Have the confidence, expertise, and credibility to mitigate imposter
syndrome when applying and interviewing
• Common DS questions: How does a neural network “learn”? How to tune for random
forest hyperparameters? What is xgboost? How to train for Naïve Bayes text
classifiers?
• Questions can be simple yet unravel an encyclopedia of knowledge to answer; If you know the
answers w/o googling, apply now
• Wait to master them to explain it satisfactory in your own words
• Job applications can be a “full time” endeavor; hard to balance time
• Streamline attempts; otherwise, waste time/effort in retry
Highlights from Previous Presentation (Slide 13/52): Nov 2018
13. What Job Title Should I Apply For?
• Data Scientist – your MSDS degree
• Machine Learning Engineer if ambitious, more years exp
• Data Engineer if you have prior software engr exp
• You don’t qualify for senior, lead, principal, chief, director DS positions just yet
• Full-fledged title, no junior or entry, no contract
• May be relevant to previous job field
• If field change
• Reflect which parts of capstone/projects/topics resonated with you
• Don’t sabotage yourself by recycling the same project themes
• Apply to the DS job fields that interest you
• The future can be non-contingent to the past as long as you plan it
• (Ranks to apply to: Intern, Junior, Entry, Contract, Associate, –, PhD, Senior,
Staff, Head, Principal, Director, Chief)
Highlights from Previous Presentation (Slide 20/52): Nov 2018
14. Techniques to Apply To Jobs Fast
• Filter by “data scientist” -senior or “machine learning” -senior and location
• -senior means senior is removed from the results and quotes makes it one phrase
• Apply everyday filtered by “most recent” until you see the job you applied to
yesterday or when the timestamp is “1 day ago”
• LinkedIn premium (optional): insights and morale boost for who viewed your profile
Highlights from Previous Presentation (Slide 34/52): Nov 2018
15. Techniques to Apply To Jobs Fast
• Apply to all available “Easy Apply” regardless of location for interview practice
• Do not apply to Senior, Staff, Head, Principal, Director, Chief positions if you do not qualify
• Optimize your resume format and LinkedIn profile so that information is
imported correctly automatically
• Make logins to applications with same credentials and use autocomplete
• Per external job site for large companies, search on their job board for more
data scientist or machine learning positions and apply to all of them once you
make a login profile (Walmart, Amazon, Apple, etc has 50+ any given time)
• Create Indeed, Dice, Glassdoor, Monster, Yahoo, etc profiles for quick apply but
only apply based on LinkedIn search results to not send duplicate applications
or get phished by fraudulent job listings
Highlights from Previous Presentation (Slide 35/52): Nov 2018
16. 2018 DS Job Application Summary
• Systematically applied to ~800 DS job postings (solve conversion rate)
• Too many phone interviews and HR screeners
• Completed 6 data challenges (<1%)
• >15 video conference interviews by company (~2%)
• Columbus x5, Cincinnati x3, Raleigh x3, Seattle x3, St. Louis x3, Dallas x2, Austin x2, Atlanta x2, Tampa x2, Portland,
Indianapolis, New York, Bentonville, Miami, Denver, Hartford, Pittsburg, San Diego, Singapore, Irvine, D.C., San
Francisco
• 8 on-site interviews (1%)
• 4 local to Chicago
• Others in Columbus, Austin, Charlotte, Indianapolis
Show your best self w/o having to alter the resume by industry (80:20 rule)
The time of acceptance (Saturday Nov 10th)
• 2 DS job offers (Nov 8th and Nov 9th), both matching (0.25%)
• Competitive for national average for DS (for 0-1 year exp, check glass door)
• +1 verbal offer (Nov 9th), adjusted for location (Total 0.375%)
• 5 canceled phone interviews; 1 canceled video conference interview
Highlights from Previous Presentation (Slide 51/52): Nov 2018
17. 2019 DS Job Application Summary
Started Applications
• Applied to 400+ jobs, mostly in California; PST 3 hours behind EST yet 5+ years ahead in tech
• 2 Offers, 1 site interview, 2 coding challenges, 4 timed aptitude coding and multiple choice
tests, 3 real-time interface coding challenges
• Verbal Offer Oct 4th; Official Offer on Oct 6th; Took 28 days since initial applications
18. Why Change Companies?
• Year’s lease for apartment is up; renewal is risky if DS not prominent nearby
• Salary increase, title promotion, new dataset opportunities elsewhere
• New DS grads can do current work; running out of challenging projects
• Need to develop towards ML/DL/NN/managerial to distinguish self
• Previous projects automated or solved – automated your own position
• Current dataset cannot solve aspirational problems given restrictions
• Company leaders does not give free reign or funding for DS to thrive
• Company pivots away from DS and is no longer the core value/business model
• Company does not know how to monetize or see benefits of DS despite
completed projects and explanations; DS has no promotions in title or wage
19. Benefits of Changing Companies
• Similar to the current SMU football coach, you join different
companies carrying out different roles to be more rounded
in experience to eventually level up to become CTO
• In some cases, a step back is needed to reevaluate, or to claim
parental leave/sabbatical benefits for positions with lower stakes
• Another 4 bullets on the resume to show versatility
• May be quicker to be promoted by job hopping than internal
• Optimize location/position while increasing salary/perks
• Have more references and opportunities to be inspired by
great coworkers, projects, and business opportunities
• Ride the wave of upcoming industries, new tech, and IPOs
20. Paths to get Promoted
• It is not in your best interest to stay a DS: as master programs get better, you
know less than the recent graduate over time if you stay in the same position
• There is always going to be another coder better than you but it is still in your
interest to get better at coding and dive deeper into ML/managerial
• Given that DS is new and positions are available, be reliable and go into
leadership roles so that more industries can better understand and utilize DS
• Technical:
• DS > Sr. DS > Staff DS > Lead DS > Principal DS > CDS/CIO/CTO **
• Harder: you have to oversee CS who are ML engineers to enhance/correct their code
• Managerial: <- Choose this
• DS > Sr. DS > DS Manager > Head DS > Director DS > CPO/CTO **
• Easier: you are just managing your team over time and less hands on with code over time
** Path depends on company and some titles are interchangeable
21. What to Look for in Great Companies
• Variety of datasets to prove yourself on multiple projects and not be sequestered
with specific project/dataset jurisdictions due to large company size and many
team divisions; ability to be autonomous and be trusted for your work
• Ability to access/merge datasets without much bureaucracy or clearance
• Company leadership understands DS to be in business model/technology
• Opportunity to get promoted once you prove yourself with projects and patents
• Not be shackled by “golden handcuffs” where everything you do for the company is top
secret to the point that you cannot even disclose generically in interviews what you did
• Not get promoted too soon and your experience cannot be easily transferred to another
company easily because of niche or you defer certain IPO benefits if you leave early
• You can envision yourself working there and can explain to a layman what you
are doing for the company and for the betterment of your career
22. What to Watch Out for in Companies
• Ownership of first hand data collection
• Ownership of data has always been key towards company sustainability
• When purchasing data (to keep hands clean), you do not know credibility or reliability of
the data; the distributors can get a cease and desist letter to shutdown (if black/grey hat)
• APIs change and access can be revoked; free methods suddenly become pay, which alters
bottom line budget, changes established data pipelines, or make past projects obsolete
• DS as a Service is still iffy compared to Software as a Service
• DSaS masquerades as a consulting firm for data science automation
• You end up being the data scientist on-call for technical troubleshooting for their
automation system helping other data scientists rather than doing data science yourself
• Consulting is not ideal for data scientists because firms are too concerned about saving
money than to go with the optimal solution
• Whether if DS is treated as a novelty or part of the business model
See Slide 33 of 2018 presentation
23. Yellow and Red Flags in the Interview Process
• Check Glassdoor for insights from interviews and current employees alike!
• HR, who are buying plane tickets and hotels for site interviews, are more
concerned about saving money by having layovers and lesser accommodations
than making the candidate feel they are wanted by the company
• HR or company culture has a superiority complex that they are doing you a favor than
mutual benefit once you are hired on to the team
• The company’s financials are restructured by a bank and is more concerned
about paying back the loan or making money for investors with short-run
frantic ”flashes in the pan” patches than long-term calculated solutions
• You are going to be the 19th DS there and they are basically hiring to “pad” their
employee count, diversity metrics, and marketing photoshoots of the team
• The existing data science team is already claiming who has seniority/leadership
roles over others simply they are hired first yet they are not proven to be better
data scientists or leaders than the people hired later
24. Yellow and Red Flags in the Interview Process
• Your base salary is under market value for your location and you forgo stock
options if you were to be terminated prior to your one-year’s employment
• There are some “funky” termination clauses/loopholes in your contract that you
must sign prior to accepting the job offer
• There is animosity among the data science teams yet the company view this as
“healthy competition” driven by company culture or peer pressure to get
projects delivered faster yet not as accurate/polished
• Multiple teams can get assigned the same projects without knowledge of the other
teams’ whereabouts, where certain teams can be on the “chopping block” if they fail to
consistently deliver those projects faster than the other teams
• There are subtle hints during the office tour that the team does not respect one another
and they consider themselves mercenaries and does not showcase any team camaraderie
• Historically, team members were restructured due to abandoned projects, layoffs, etc
25. Yellow and Red Flags in the Interview Process
• Somehow, all the data scientists for a company are internally trained to do DS
without a master’s program but instead with some company sanctioned ‘lower
tier’ courses, where you have to correct for their misconceptions
• Correcting for their misconception that collinearity affected random forest
• The person who wanted to interview you had an emergency but never came
back to reschedule for the interview and another person from an elite team had
to interview you and nobody knows the original team that you qualified for
• Companies that are hiring DS to “keep up with the Joneses” or are getting a free
crash course on what data science is through your interviews
• Having the HR write notes on VERY technical questions about NN and NLP when
it is supposed to be the job of the technical adept person to evaluate you
• Companies that are hiring 1500 DSs for the sake of publicity but they have 2 DS
openings on their job portal and half their hires are 3rd party contracts
26. Yellow and Red Flags in the Interview Process
• They claim that they are using anomaly detection but they are using linear
regression instead of NN/DL or unsupervised clustering
• The whole dashboard of the demo data that they have are based off of estimates
and barely off of real any real numbers
• Their CEO is in the news for controversial activity or have articles written about
him about his ego and extravagant flaunting of wealth
• Their website is written by a copywriting team to hit every jargon possible to
attract SEO/customers with cheeky one liners and also mentions “big data”
• Evaluate all the yellow and red flags you see
27. Interview Gaffes and More Flags
• Having someone they hired 2 weeks ago interview you for the technical interview
• Ask some hard questions that they themselves did not have to go through
• Instead of an one-on-one interviews, have a two-on-one interview session to save
time and tag-team hard questions towards the candidate; intimidation factor
• Asking if I knew of Andrew Ng’s course but not go further into that question
• Have everyone interview you one-by-one but not consult each other to eliminate
redundancies and then have a consortium interview with more redundancies
• Asking what pros and cons are with type 1 and 2 errors but then get offended that “cancer”
was brought up as an example that false positives are benign due to personal reasons
• The DS manager going through my data challenge slowly and redundantly not
because he was evaluating my code but instead he wanted to learn from me
something he did not know how to do
• Turning in 2 data challenges and not getting a scheduled phone call nor rejection letter
28. Bad Evaluators
• Companies can ask you questions and test you in an exorbitant amount of ways
but they still cannot seem to consistently detect talent using the same metrics
• People are coming from different programs and have different levels of difficultly and
cannot be easily compared; somewhat like the NFL draft
• You cannot measure heart on how dedicated certain people are over others
• Some positions, regardless of coding aptitude, are considered ‘clerical’ with most of the
constraints defined while others require more creativity in the problem solving process
• Coding in real time during interviews often are considered IQ or SAT tests
• Problems are usually dictionaries and lists and have no reflection of what you do as a DS
• Solve certain puzzles without the use of libraries within 30 minutes; list comprehension
• Practice using Leetcode and www.geeksforgeeks.org
• Technical questions of stats/NN require you to talk higher-end jargon even
though at work you are required to dumb it down for the layman to understand
29. Coding Puzzle Examples
Find the minimum number of coins for given cents
def num_coin_return(cents):
if cents < 1:
return 0
coins = 0
drawer = [25,10,5,1]
for coin_type in drawer:
coins += int(cents/coin_type)
cents = cents%coin_type
return coins
num_coin_return(32)
4 ## 1 quarter, 1 nickel, 2 pennies
## find remainder of cents given coin
denomination and then add to counter
Find the length of the shortest path from start S to end E
in maze given that X’s are walls and o’s are paths.
https://www.cs.bu.edu/teaching/alg/maze/ (recursion, least
steps; edge detection)
Example Input:
o o o X X o o o o o o o
o o E o X o o X o o X o
o o o o X o o X o o X o
X o o o o o o X o S o o
o o X X X o o o o o o o
Output: 11
Example Input:
o o o X X o o o o X o o
o o E o X o o X X o X o
o o o o X o o X o o X o
X o o o o o o X o S o o
o o X X X o o o X o o o
Output: None
30. Coding Puzzle Examples
Given nums = [2, 7, 11, 15], target = 9,
Because nums[0] + nums[1] = 2 + 7 = 9,
return [0, 1].
def twoSum(self, nums, target):
h = {}
for i, num in enumerate(nums):
n = target - num
if n not in h:
h[num] = i
else:
return [h[n], i]
## Create a dictionary to find the
difference between existing dictionary
and target and then re-reference the
dictionary with the difference number to
find the two pairs summing up to target
Input: 22
Output: 2
Explanation:
22 in binary is 0b10110. find inner 0s distance
In the binary representation of 22, there are three ones, and
two consecutive pairs of 1’s.
The first consecutive pair of 1's have distance 2.
The second consecutive pair of 1's have distance 1.
The answer is the largest of these two distances, which is 2.
len(max(bin(N)[2:].strip('0').strip('1').split('1’)))
## convert into binary, remove first two characters, strip the
outer 0s, then the outer 1s, and then split the results by 1s
to find the max length
31. Coding Puzzle Examples
Find day of week given initial day of week and
days after
def solution(S, K):
day_dict = {
0:'Sun',1:'Mon',2:'Tue',3:'Wed’,
4:'Thu',5:'Fri',6:'Sat'}
day_dict2 = {v:k for k,v in day_dict.items()}
a = K % 7
return day_dict[(day_dict2[S]+a) % 7]
solution('Mon',33)
Sat
##hard code the dictionary, reverse the
dictionary, solve for remainder of given days
and add to reference, then output date of week
Find the smallest positive number missing from an array
Input: { 2, 3, -7, 6, 8, 1, -10, 15 }
Output: 4
def solution(A):
m = max(A)
if m < 1:
return 1
if len(A) == 1:
return 2 if A[0] == 1 else 1
l = [0] * m
for i in range(len(A)):
if A[i] > 0:
if l[A[i] - 1] != 1:
l[A[i] - 1] = 1
for i in range(len(l)):
if l[i] == 0:
return i + 1
return i + 2
## if all are negative, return 1; return 2 or 1 given special
case; create a new array of 0s given the maximum value as
length and set it as 1 if the value exists. Loop through new
array until 0 is found and return the counter
32. Once you get the Offer
• Reflect and evaluate them with the keen eye they used to evaluate you
• Did you like your interviewers and team? Is there work life balance? Culture and personality match?
• Were the questions and circumstances fair? Were there any red or yellow flags?
• Does the people and projects there inspire you to do better? Will the team get new hires?
• Did they brush off certain questions with a polished answer or did they expose some weaknesses by
going out of their way to tell you the truth? Was there braggadocio in awards and numbers?
• Is there substantial contributions that you can see yourself making towards company development
and growth? Do you understand the business model and how DS contributes to it?
• Is there opportunity for promotion if you develop enough projects? Leverage this position for exp?
• See that you are able to envision yourself spending at least a year on the datasets given
and projects discussed; sign a year lease if not final destination.
• Negotiate for salary depending on how well you did in the interview process and how
much demand they have for your talent
• Based on market value, taxes, housing, commute, relocation, perks, standard of living
See Slide 38 and 46 of 2018 presentation
33. Things that are Unfortunately Found Out Later
• Leadership is more concerned or very single-minded about saving money and
shoots down any attempts for the DS team to expand the budget needed for:
• Necessary hires, such as a data engineer, to consolidate all the data and upgrade databases
to not crash or hit above the allocated threshold while finding better database systems
more suitable for the purposes of how the data is queried
• Having them realize that a data architect or dev ops cannot substitute for a data engineer sometimes
• Buying and allocating local servers to save money over cloud servers
• Allocating certain local machines for parallel computing for machine learning purposes
• Leadership mimics their entire business model/strategy completely with a
different company that may not even be in the same domain
• Hiring gaps, layoffs, and unsustainable initial growth metrics for start ups
• Business strategy and data pipeline changes because of API changes, new
scraping policies, audits, pivots, and cease and desist letters
34. Work Computers
Specs Microsoft Surface Book 2 + Docking Station Dell Precision Workstation T7600
Year Released 2017 2012
Processor Speed 1.90 GHz 2.60 GHz
RAM 16 GB LPDDR3 128 GB (16x8 GB) DDR3 ECC RDIMM
Graphics card NVIDIA GeForce GTX 1060, 6GB Not included, self install*
Storage 256 GB SSD Not included, self install*
CPU Intel Core i7-8650U quad-core E5-2600 family 64-bit Xeon, 16 core
Display output USB-C to HDMI (or 2x mini display ports from dock) 2x DVI
USB ports 2x USB 3.0 + 1x USB-C (+ 4x USB 3.0 from dock) 8x USB 2.0 + 2x USB 3.0
eBay cost ~$1900 (New) + ~$90 Docking station (New) ~$920 (Manufacture Refurbished)
Surface Book 2 + docking station: https://www.ebay.com/itm/312250907713, https://www.ebay.com/itm/173753646668
Precision Workstation (Rig): https://www.ebay.com/itm/183161180290, also need hard drive and graphics card*
*May need hardware-literate person to help you install
If you need recommendations for work computers
(Not to scale)
35. Archetypes at Work
• Bootcamp guy – Certificate but superficially learned everything DS in 12 weeks.
Rests on laurels from undergrad, which is a field change from current job. You
have to carry him to the finish line and actively ask what he does not understand
• Antisocial coders – stay on topic with the task and keep conversations light
• DS who are rigid thinkers – Given that most insight, data merges, and feature
engineering have to achieve a certain amount of creativity for value, it does not
benefit the company if you remain ‘clerical’ in your attempts; less cookbooks
• People who are territorial about their data or projects – the whole department
looks bad if this person is selfish and takes too long because the project gets
canceled or reassigned. We get penalized for wasting time on incompletes
• People who compete for your projects from another department – corporate
culture sees this as ‘healthy competition’ to get the work done so beat them in
time and accuracy and unfortunately sacrifice nights and weekends
36. Archetypes at Work
• The Head honcho who does not disclose his proprietary company contributions
even for the sake of downstream collaborations – he does this for job security
and to justify his high salary and for no one to take over and understand his work
• Other departments who vote negatively towards new hires to join your team –
they see your DS field as a ‘novelty’ job title and does not fully understand the
extent of why certain high paying positions are needed when they can substitute
with a close job title from another department or self sustain without
• The business data analyst who thinks he can do DS – tell him to stop padding his
resume with buzzwords and algorithms he does not fully understand because
everybody thinks they are DS until they realize they are not; go get a master’s
• The politics bureaucratic head – Use gambits to receive more benefits and blind
eyes from leadership when they are underperforming
37. Archetypes at Work
• The phone-a-friend lifeline guy – The guy who passed the coding tests and
interview but actively seeks outside help to do this work due to underqualified
• The domain experts who refuse new insights – you present new findings to
them but either because they are getting outclassed or sees more work in the
inevitable future, they dismiss your findings and refuse to share it with their team
• The domain experts who thinks the world of you but simply do not understand
– they give you the time of day and are patient with how you see your findings
can affect their work but they do not want to fix something that is not broken
• Non-technical leadership people who thinks they can lead a tech department
meeting and direct methods – imposter syndrome is too strong for them to
realize that their ideas are good but leave the methodology to us
38. Archetypes at Work
• Coders with no DS background who thinks they can do DS – their findings
reported to leadership needs to be nullified if proven wrong
• Marketing PR people who leverage DS without knowing how to write findings–
it looks bad if they broadcast correlation is causation or common statistical
misconceptions upon our behalf unknowingly to the general public
• Domain experts and tech allies who help you on the job – Given that work
emails and communications are transitory, get their information for later projects
• People who learn unrelated software on the job – ignore; develop yourself
• Leadership who do not understand the budget and hiring needs of the DS
department and also do not understand why models take so long to run – This
is why we need DS leaders to educate and take their positions in the near future
39. Projects
• It is not a coincidence that Microsoft bought Github, Slack, Linkedin, and hosts
Azure: not only can they find your code and see your data, they can see your
messages, and hire you. Use different platforms operated by multiple companies
• Take full or more ownership of your projects
• Overdeliver based on what they require from you and add DS insights not considered
• Have resilience on projects and find methods to completion to increase project count
• Continue to do udacity and udemy courses to better yourself; read research papers
• Complete projects based on your standards and not (lesser) standards of others
• Develop and grow while not short changing yourself; have intrinsic motivation to complete
projects based on your capabilities and do not settle for consolation extrinsic motivation
• Use DS on pet projects outside of work and attend DS social gatherings
• Regular DS day looks like a 5th semester workload but you get to go home at 5pm
40. Generic Project
• To reduce overfitting, no ensemble but instead run 6 algorithms in parallel to find
the best model with RMSE for regression. Then in series, sum the daily values
into monthly and run 6 algorithms plus no model to find the best RMSE for
monthly estimates
XGBoost Reg
Extra Trees Reg
Random Forest Reg w/GS
XGBoost Reg w/GS
Extra Trees Reg w/GS
Random Forest Reg
Feature Engr Find best RMSE and
Sum from best model
Daily Level
XGBoost Reg
Extra Trees Reg
Random Forest Reg w/GS
XGBoost Reg w/GS
Extra Trees Reg w/GS
Random Forest Reg
Monthly Level
No Model
Find best RMSE and
Use best model
41. Multiprocessing
from multiprocessing import Pool,current_process
Import gc
def doWork(work):
process_id = str(current_process().pid)
file_ = work
print(process_id,':',file_)
## code here
return v ## flattened one row
if __name__ == '__main__':
allFiles = glob('C://Users//Yao//Desktop//all//**//*-large.jpg', recursive=True)
work = []
for file_ in allFiles:
w = (file_)
work += [w]
num_procs = 8 if len(work) > 8 else len(work)
with Pool(num_procs) as p:
results = p.map(doWork,work)
df= pd.concat(results,axis =0,ignore_index=True,sort=False).reset_index(drop=True)
df.to_csv('df.csv',index=False)
Given that a computer usually
has 16 cores, you can use this
generic multiprocessing code to
write a function that pulls 8
different rows or files and run it
simultaneously: very good for
running models simultaneously
or downloading files; teach my
company how to run code
PM me on LinkedIn to learn this
www.linkedin.com/in/yaoya0/
42. Kanopy / Your Job Hunt
• I will be a DS at Kanopy starting 10/28/19 in Irvine, CA
• Kanopy license films for library patrons to stream digitally
• I will be working on their recommendation engine for films
• Recommendation systems was my master’s project: https://youtu.be/uavbPKiUg9M
• Think like an opportunist; go mass apply everything DS
• Mass apply to retool, upgrade, and adapt
• Keep your interview pipeline full
• Accept the DS job offer in which conditions are met for your interests, career
growth, and the support to be successful
Happy Halloween, Good Luck!