SlideShare una empresa de Scribd logo
1 de 42
Lessons After Working as a
Data Scientist for 1 Year
Yao Yao
SMU DS Alumni Talks Oct ‘19
Graduated SMU MSDS Aug ’18
Hired as Data Scientist Nov ‘18
Data Science is an Expression of Self in Data
• Given a dataset and an open ended problem, every solution variation would be
different: from the way you code, to the libraries and algorithms used, to the
different views, to the data merges, to the visualization, to the final solution
• Is a culmination of all the experiences you underwent towards understanding
and extracting business insights with the dataset
• Transfer learning of unrelated disciplines, certificate courses, code references of what is
possible, and imagination of possibilities all feed into the solution that is rewardingly you
• The creative eye in data scientists are hard to evaluate in live coding puzzles and
technical questions; therefore, data challenges and interviews are needed
• Even then, some are rigid in their solution path with established cookbooks while
others are nonconformists, eclectic thinkers, and tailor their solution better than
what the problem and dataset suggested
Evaluating Data Scientists
• Fundamentally understand the nature and math of how solutions arrive
• Some data scientists thrive based on a certain domains and can find stability a
niche while others require multiple variations of different datasets and problems
to challenge them. Evaluate data scientists holistically instead of by one metric
• Data science sometimes is hard to monetize and justify higher salaries
• If the same insights is sold to multiple parties, the information is no longer exclusive and
eventually becomes public knowledge, which drives the value of the information down. The
insights are truly valuable if one party has exclusive access and can use it competitively
• To resist insight-value degradation due to economies of scale, find niche ways in
DS in which the value increases when there is more participation
• DNA testing of ancestry become more accurate as more people participate in the dataset
• Facebook and Google target ads and recommend news based on finding similar people
with your tastes based on cookie tracking as more people use their system
Evaluating Data Scientists
• Unlike software developers and engineers, most of our work is built on the ability
to produce insights based on our coding and statistics acumen, algorithm
methodology, and creativity factor to extract the necessary business insights to
induce action, build data pipeline products, and/or embed systems in IOT devices
• Because we are so many steps removed from building a software or a website to induce
monetary compensation, we get paid less and companies sometimes cannot gauge the
direct relationship between what we produce and how that ultimately generates money
• Find niche ways, such as FinTech, where the proprietary insights we produce has
a more direct route towards generating revenue based on decisions from insights
• Build products like recommendation/ad targeting systems where the perceived
value of the information distributed does not degrade due to economies of scale
or knockoffs bypassing R&D and instead increases once more people participate
Data Science has Clever Ways to Reflect Life
• Neural networks reflect dendrites in a human brain
• CNNs are filters in which a computer perceive images
• Generative Adversarial (GAN) learning, which is “sword sharpens sword”
• Adversarial: Be challenged, have thought experiments, and cover blind spots
• Reinforcement learning, which is "being an apprentice to a mentor“
• Become a better version of yourself by learning from established others
• Use DS to optimize for life given that life is the ultimate DS problem
• Be creative and have less inhibitions that prevent you from being successful
• Apply DS in rare domains such as mergers and acquisitions and portfolio asset management
• Given that life has imperfect information: help people by building products and
gaining insights with the compensation as a byproduct of what you do
DS is Relatable, For Better or Worse
• DS has so many algorithmic metaphors to explain everything intuitively
• It is beneficial for leadership with tech backgrounds to understand intuitively
what we do and can improve your communication skills to become manager
• However, every layman thinks that they are a DS until they are proven otherwise
• Bad when devs with only coding backgrounds tries to impede on DS work by
doing PowerBI without intuitively knowing statistics, or using correlation, which
is the bottom tier of analysis, for inferencing causation, or using improper cross
validation techniques and metrics to evaluate “cookbook methods” pulled online
• Worse when leadership only values percent change with endpoints for better
marketing PR instead of fitting a curve with a slope for better metrics
• When you explain things too simple so they would understand, it backfires when
leadership does not see the justification why certain algorithms still takes long
DS Discretion
• In industry, a lot of software are held together by “rope and duct tape”
• Not all completed code is perfect but to self-review or become a better coder,
use split screen and retype your code into the final version without excessive
exploratory functions and more efficient methodology
• Instead of being aspirational by importing all the data and using the most heavy
duty algorithm for your computer to run, which takes two weeks+, subsample
and prototype your results so that leadership can see and get a sense of initial
results for them to approve you of running the full model with all the data
• Visualizations always do better to persuade leadership, clients, customers,
investors, etc than the same tangible/numeric results
• Improvise, adapt, overcome; be a nonconformist, nonconventional, eclectic
thinker because all standard practice and common sense ideas have been done
Sequel to The Job Search Interview Offer
Letter Experience for Data Science
• I work for Viral Launch (5-year start-up): they do keyword and
category product discovery for Amazon sellers in Indianapolis, IN
• Given Amazon’s diverse dataset
• Soloed 6+ projects from data orchestration, workflow design, to production
code for online software release
• Dependent on interest level, company culture, the right mentor,
relevancy with previous projects/capstone, relative income, career
prospect, growth, and the ability to dive deeper into DS
• Previous Presentation: video and slides
https://www.youtube.com/watch?v=Fiz1Tn7ogP4
• https://www.slideshare.net/YaoYao44/yao-yao-msds-alum-the-job-
search-interview-offer-letter-experience
My Timeline
• 2012: Bachelor’s in material science engineering
• Job experience in engineering but did mostly data analyst
work and VBA automation of spreadsheets
• Minor DS mentorships and certificates
• 2017: Started MSDS program Jan 9th (Spring)
• 2018: Started DS job search August 2nd
• Still had capstone paper, ML pres, and QTW due
• 2018: Graduated MSDS program August 27th (Summer)
• 5-semester track; NLP in capstone, ML as elective
• 2018: Landed DS offer Nov 8th; Accepted Nov 10th
• Took 99 days (~3 months); 74 days since graduation
• 2019: Landed next DS offer Oct 6th; Accepted Oct 8th
• Took 28 days (1 month) since initial applications
Highlights from Previous Presentation (Slide 2/52): Nov 2018
Variations in Job Titles Related to MSDS
• Analytics Engineer – Supply chain, logistics, financial
• Business Intelligence Analyst – Operations; PowerBI; insights
• Decision Scientist – Ability to DS convert findings into business decisions
• Data Analyst – Mostly SAS, R, SQL; Less python
• Data Engineer – Data architects and databases for DS
• Data Scientist – Variation in prediction/NLP/ML
• Machine Learning Engineer – 2-5 years of previous DS experience; custom
applications; specific focus
• Predictive Analyst – More stats based; Less ML
• Quantitative Analyst – Financial technology, time series, stats
Last word in job title hints rigor and salary range
Highlights from Previous Presentation (Slide 18/52): Nov 2018
Variations in Job Titles Related to MSDS
• Applied Scientist – Catch-all for those who have MSDS or PhD in computationally
rigorous field like stats, physics, math, industrial operations, CS
• NLP Engineer – Mostly NLP: corpus, Stanford NLP, cosine similarity
• Computer Vision Engineer – Mostly CV: CNN filters, edge detection, robotics
• Data Science, Analyst – Rigorous coding requirements in the DS field: SQL, ML
but has sequestered projects due to large company size and many team divisions
• Deep Learning Engineer – Mostly Neural Networks/ML
• Python ML Developer – Ability to write custom ML algorithms/optimizations
from scratch, like ADAMAX, based on specific domain applications
• AI Developer – Catch-all; Most likely has imposter syndrome by HR or by person
Analyst: insights; Scientist: build/explore; Engineer: master; Developer: Hardcore
Highlights from Previous Presentation (Addendum): Nov 2018
What Semester to Apply Jobs?
• Streamline the process
• Once you learn QTW/NLP/ML and have enough time allocated to finish the
capstone project, apply for jobs
• Have the confidence, expertise, and credibility to mitigate imposter
syndrome when applying and interviewing
• Common DS questions: How does a neural network “learn”? How to tune for random
forest hyperparameters? What is xgboost? How to train for Naïve Bayes text
classifiers?
• Questions can be simple yet unravel an encyclopedia of knowledge to answer; If you know the
answers w/o googling, apply now
• Wait to master them to explain it satisfactory in your own words
• Job applications can be a “full time” endeavor; hard to balance time
• Streamline attempts; otherwise, waste time/effort in retry
Highlights from Previous Presentation (Slide 13/52): Nov 2018
What Job Title Should I Apply For?
• Data Scientist – your MSDS degree
• Machine Learning Engineer if ambitious, more years exp
• Data Engineer if you have prior software engr exp
• You don’t qualify for senior, lead, principal, chief, director DS positions just yet
• Full-fledged title, no junior or entry, no contract
• May be relevant to previous job field
• If field change
• Reflect which parts of capstone/projects/topics resonated with you
• Don’t sabotage yourself by recycling the same project themes
• Apply to the DS job fields that interest you
• The future can be non-contingent to the past as long as you plan it
• (Ranks to apply to: Intern, Junior, Entry, Contract, Associate, –, PhD, Senior,
Staff, Head, Principal, Director, Chief)
Highlights from Previous Presentation (Slide 20/52): Nov 2018
Techniques to Apply To Jobs Fast
• Filter by “data scientist” -senior or “machine learning” -senior and location
• -senior means senior is removed from the results and quotes makes it one phrase
• Apply everyday filtered by “most recent” until you see the job you applied to
yesterday or when the timestamp is “1 day ago”
• LinkedIn premium (optional): insights and morale boost for who viewed your profile
Highlights from Previous Presentation (Slide 34/52): Nov 2018
Techniques to Apply To Jobs Fast
• Apply to all available “Easy Apply” regardless of location for interview practice
• Do not apply to Senior, Staff, Head, Principal, Director, Chief positions if you do not qualify
• Optimize your resume format and LinkedIn profile so that information is
imported correctly automatically
• Make logins to applications with same credentials and use autocomplete
• Per external job site for large companies, search on their job board for more
data scientist or machine learning positions and apply to all of them once you
make a login profile (Walmart, Amazon, Apple, etc has 50+ any given time)
• Create Indeed, Dice, Glassdoor, Monster, Yahoo, etc profiles for quick apply but
only apply based on LinkedIn search results to not send duplicate applications
or get phished by fraudulent job listings
Highlights from Previous Presentation (Slide 35/52): Nov 2018
2018 DS Job Application Summary
• Systematically applied to ~800 DS job postings (solve conversion rate)
• Too many phone interviews and HR screeners
• Completed 6 data challenges (<1%)
• >15 video conference interviews by company (~2%)
• Columbus x5, Cincinnati x3, Raleigh x3, Seattle x3, St. Louis x3, Dallas x2, Austin x2, Atlanta x2, Tampa x2, Portland,
Indianapolis, New York, Bentonville, Miami, Denver, Hartford, Pittsburg, San Diego, Singapore, Irvine, D.C., San
Francisco
• 8 on-site interviews (1%)
• 4 local to Chicago
• Others in Columbus, Austin, Charlotte, Indianapolis
Show your best self w/o having to alter the resume by industry (80:20 rule)
The time of acceptance (Saturday Nov 10th)
• 2 DS job offers (Nov 8th and Nov 9th), both matching (0.25%)
• Competitive for national average for DS (for 0-1 year exp, check glass door)
• +1 verbal offer (Nov 9th), adjusted for location (Total 0.375%)
• 5 canceled phone interviews; 1 canceled video conference interview
Highlights from Previous Presentation (Slide 51/52): Nov 2018
2019 DS Job Application Summary
Started Applications
• Applied to 400+ jobs, mostly in California; PST 3 hours behind EST yet 5+ years ahead in tech
• 2 Offers, 1 site interview, 2 coding challenges, 4 timed aptitude coding and multiple choice
tests, 3 real-time interface coding challenges
• Verbal Offer Oct 4th; Official Offer on Oct 6th; Took 28 days since initial applications
Why Change Companies?
• Year’s lease for apartment is up; renewal is risky if DS not prominent nearby
• Salary increase, title promotion, new dataset opportunities elsewhere
• New DS grads can do current work; running out of challenging projects
• Need to develop towards ML/DL/NN/managerial to distinguish self
• Previous projects automated or solved – automated your own position
• Current dataset cannot solve aspirational problems given restrictions
• Company leaders does not give free reign or funding for DS to thrive
• Company pivots away from DS and is no longer the core value/business model
• Company does not know how to monetize or see benefits of DS despite
completed projects and explanations; DS has no promotions in title or wage
Benefits of Changing Companies
• Similar to the current SMU football coach, you join different
companies carrying out different roles to be more rounded
in experience to eventually level up to become CTO
• In some cases, a step back is needed to reevaluate, or to claim
parental leave/sabbatical benefits for positions with lower stakes
• Another 4 bullets on the resume to show versatility
• May be quicker to be promoted by job hopping than internal
• Optimize location/position while increasing salary/perks
• Have more references and opportunities to be inspired by
great coworkers, projects, and business opportunities
• Ride the wave of upcoming industries, new tech, and IPOs
Paths to get Promoted
• It is not in your best interest to stay a DS: as master programs get better, you
know less than the recent graduate over time if you stay in the same position
• There is always going to be another coder better than you but it is still in your
interest to get better at coding and dive deeper into ML/managerial
• Given that DS is new and positions are available, be reliable and go into
leadership roles so that more industries can better understand and utilize DS
• Technical:
• DS > Sr. DS > Staff DS > Lead DS > Principal DS > CDS/CIO/CTO **
• Harder: you have to oversee CS who are ML engineers to enhance/correct their code
• Managerial: <- Choose this
• DS > Sr. DS > DS Manager > Head DS > Director DS > CPO/CTO **
• Easier: you are just managing your team over time and less hands on with code over time
** Path depends on company and some titles are interchangeable
What to Look for in Great Companies
• Variety of datasets to prove yourself on multiple projects and not be sequestered
with specific project/dataset jurisdictions due to large company size and many
team divisions; ability to be autonomous and be trusted for your work
• Ability to access/merge datasets without much bureaucracy or clearance
• Company leadership understands DS to be in business model/technology
• Opportunity to get promoted once you prove yourself with projects and patents
• Not be shackled by “golden handcuffs” where everything you do for the company is top
secret to the point that you cannot even disclose generically in interviews what you did
• Not get promoted too soon and your experience cannot be easily transferred to another
company easily because of niche or you defer certain IPO benefits if you leave early
• You can envision yourself working there and can explain to a layman what you
are doing for the company and for the betterment of your career
What to Watch Out for in Companies
• Ownership of first hand data collection
• Ownership of data has always been key towards company sustainability
• When purchasing data (to keep hands clean), you do not know credibility or reliability of
the data; the distributors can get a cease and desist letter to shutdown (if black/grey hat)
• APIs change and access can be revoked; free methods suddenly become pay, which alters
bottom line budget, changes established data pipelines, or make past projects obsolete
• DS as a Service is still iffy compared to Software as a Service
• DSaS masquerades as a consulting firm for data science automation
• You end up being the data scientist on-call for technical troubleshooting for their
automation system helping other data scientists rather than doing data science yourself
• Consulting is not ideal for data scientists because firms are too concerned about saving
money than to go with the optimal solution
• Whether if DS is treated as a novelty or part of the business model
See Slide 33 of 2018 presentation
Yellow and Red Flags in the Interview Process
• Check Glassdoor for insights from interviews and current employees alike!
• HR, who are buying plane tickets and hotels for site interviews, are more
concerned about saving money by having layovers and lesser accommodations
than making the candidate feel they are wanted by the company
• HR or company culture has a superiority complex that they are doing you a favor than
mutual benefit once you are hired on to the team
• The company’s financials are restructured by a bank and is more concerned
about paying back the loan or making money for investors with short-run
frantic ”flashes in the pan” patches than long-term calculated solutions
• You are going to be the 19th DS there and they are basically hiring to “pad” their
employee count, diversity metrics, and marketing photoshoots of the team
• The existing data science team is already claiming who has seniority/leadership
roles over others simply they are hired first yet they are not proven to be better
data scientists or leaders than the people hired later
Yellow and Red Flags in the Interview Process
• Your base salary is under market value for your location and you forgo stock
options if you were to be terminated prior to your one-year’s employment
• There are some “funky” termination clauses/loopholes in your contract that you
must sign prior to accepting the job offer
• There is animosity among the data science teams yet the company view this as
“healthy competition” driven by company culture or peer pressure to get
projects delivered faster yet not as accurate/polished
• Multiple teams can get assigned the same projects without knowledge of the other
teams’ whereabouts, where certain teams can be on the “chopping block” if they fail to
consistently deliver those projects faster than the other teams
• There are subtle hints during the office tour that the team does not respect one another
and they consider themselves mercenaries and does not showcase any team camaraderie
• Historically, team members were restructured due to abandoned projects, layoffs, etc
Yellow and Red Flags in the Interview Process
• Somehow, all the data scientists for a company are internally trained to do DS
without a master’s program but instead with some company sanctioned ‘lower
tier’ courses, where you have to correct for their misconceptions
• Correcting for their misconception that collinearity affected random forest
• The person who wanted to interview you had an emergency but never came
back to reschedule for the interview and another person from an elite team had
to interview you and nobody knows the original team that you qualified for
• Companies that are hiring DS to “keep up with the Joneses” or are getting a free
crash course on what data science is through your interviews
• Having the HR write notes on VERY technical questions about NN and NLP when
it is supposed to be the job of the technical adept person to evaluate you
• Companies that are hiring 1500 DSs for the sake of publicity but they have 2 DS
openings on their job portal and half their hires are 3rd party contracts
Yellow and Red Flags in the Interview Process
• They claim that they are using anomaly detection but they are using linear
regression instead of NN/DL or unsupervised clustering
• The whole dashboard of the demo data that they have are based off of estimates
and barely off of real any real numbers
• Their CEO is in the news for controversial activity or have articles written about
him about his ego and extravagant flaunting of wealth
• Their website is written by a copywriting team to hit every jargon possible to
attract SEO/customers with cheeky one liners and also mentions “big data”
• Evaluate all the yellow and red flags you see
Interview Gaffes and More Flags
• Having someone they hired 2 weeks ago interview you for the technical interview
• Ask some hard questions that they themselves did not have to go through
• Instead of an one-on-one interviews, have a two-on-one interview session to save
time and tag-team hard questions towards the candidate; intimidation factor
• Asking if I knew of Andrew Ng’s course but not go further into that question
• Have everyone interview you one-by-one but not consult each other to eliminate
redundancies and then have a consortium interview with more redundancies
• Asking what pros and cons are with type 1 and 2 errors but then get offended that “cancer”
was brought up as an example that false positives are benign due to personal reasons
• The DS manager going through my data challenge slowly and redundantly not
because he was evaluating my code but instead he wanted to learn from me
something he did not know how to do
• Turning in 2 data challenges and not getting a scheduled phone call nor rejection letter
Bad Evaluators
• Companies can ask you questions and test you in an exorbitant amount of ways
but they still cannot seem to consistently detect talent using the same metrics
• People are coming from different programs and have different levels of difficultly and
cannot be easily compared; somewhat like the NFL draft
• You cannot measure heart on how dedicated certain people are over others
• Some positions, regardless of coding aptitude, are considered ‘clerical’ with most of the
constraints defined while others require more creativity in the problem solving process
• Coding in real time during interviews often are considered IQ or SAT tests
• Problems are usually dictionaries and lists and have no reflection of what you do as a DS
• Solve certain puzzles without the use of libraries within 30 minutes; list comprehension
• Practice using Leetcode and www.geeksforgeeks.org
• Technical questions of stats/NN require you to talk higher-end jargon even
though at work you are required to dumb it down for the layman to understand
Coding Puzzle Examples
Find the minimum number of coins for given cents
def num_coin_return(cents):
if cents < 1:
return 0
coins = 0
drawer = [25,10,5,1]
for coin_type in drawer:
coins += int(cents/coin_type)
cents = cents%coin_type
return coins
num_coin_return(32)
4 ## 1 quarter, 1 nickel, 2 pennies
## find remainder of cents given coin
denomination and then add to counter
Find the length of the shortest path from start S to end E
in maze given that X’s are walls and o’s are paths.
https://www.cs.bu.edu/teaching/alg/maze/ (recursion, least
steps; edge detection)
Example Input:
o o o X X o o o o o o o
o o E o X o o X o o X o
o o o o X o o X o o X o
X o o o o o o X o S o o
o o X X X o o o o o o o
Output: 11
Example Input:
o o o X X o o o o X o o
o o E o X o o X X o X o
o o o o X o o X o o X o
X o o o o o o X o S o o
o o X X X o o o X o o o
Output: None
Coding Puzzle Examples
Given nums = [2, 7, 11, 15], target = 9,
Because nums[0] + nums[1] = 2 + 7 = 9,
return [0, 1].
def twoSum(self, nums, target):
h = {}
for i, num in enumerate(nums):
n = target - num
if n not in h:
h[num] = i
else:
return [h[n], i]
## Create a dictionary to find the
difference between existing dictionary
and target and then re-reference the
dictionary with the difference number to
find the two pairs summing up to target
Input: 22
Output: 2
Explanation:
22 in binary is 0b10110. find inner 0s distance
In the binary representation of 22, there are three ones, and
two consecutive pairs of 1’s.
The first consecutive pair of 1's have distance 2.
The second consecutive pair of 1's have distance 1.
The answer is the largest of these two distances, which is 2.
len(max(bin(N)[2:].strip('0').strip('1').split('1’)))
## convert into binary, remove first two characters, strip the
outer 0s, then the outer 1s, and then split the results by 1s
to find the max length
Coding Puzzle Examples
Find day of week given initial day of week and
days after
def solution(S, K):
day_dict = {
0:'Sun',1:'Mon',2:'Tue',3:'Wed’,
4:'Thu',5:'Fri',6:'Sat'}
day_dict2 = {v:k for k,v in day_dict.items()}
a = K % 7
return day_dict[(day_dict2[S]+a) % 7]
solution('Mon',33)
Sat
##hard code the dictionary, reverse the
dictionary, solve for remainder of given days
and add to reference, then output date of week
Find the smallest positive number missing from an array
Input: { 2, 3, -7, 6, 8, 1, -10, 15 }
Output: 4
def solution(A):
m = max(A)
if m < 1:
return 1
if len(A) == 1:
return 2 if A[0] == 1 else 1
l = [0] * m
for i in range(len(A)):
if A[i] > 0:
if l[A[i] - 1] != 1:
l[A[i] - 1] = 1
for i in range(len(l)):
if l[i] == 0:
return i + 1
return i + 2
## if all are negative, return 1; return 2 or 1 given special
case; create a new array of 0s given the maximum value as
length and set it as 1 if the value exists. Loop through new
array until 0 is found and return the counter
Once you get the Offer
• Reflect and evaluate them with the keen eye they used to evaluate you
• Did you like your interviewers and team? Is there work life balance? Culture and personality match?
• Were the questions and circumstances fair? Were there any red or yellow flags?
• Does the people and projects there inspire you to do better? Will the team get new hires?
• Did they brush off certain questions with a polished answer or did they expose some weaknesses by
going out of their way to tell you the truth? Was there braggadocio in awards and numbers?
• Is there substantial contributions that you can see yourself making towards company development
and growth? Do you understand the business model and how DS contributes to it?
• Is there opportunity for promotion if you develop enough projects? Leverage this position for exp?
• See that you are able to envision yourself spending at least a year on the datasets given
and projects discussed; sign a year lease if not final destination.
• Negotiate for salary depending on how well you did in the interview process and how
much demand they have for your talent
• Based on market value, taxes, housing, commute, relocation, perks, standard of living
See Slide 38 and 46 of 2018 presentation
Things that are Unfortunately Found Out Later
• Leadership is more concerned or very single-minded about saving money and
shoots down any attempts for the DS team to expand the budget needed for:
• Necessary hires, such as a data engineer, to consolidate all the data and upgrade databases
to not crash or hit above the allocated threshold while finding better database systems
more suitable for the purposes of how the data is queried
• Having them realize that a data architect or dev ops cannot substitute for a data engineer sometimes
• Buying and allocating local servers to save money over cloud servers
• Allocating certain local machines for parallel computing for machine learning purposes
• Leadership mimics their entire business model/strategy completely with a
different company that may not even be in the same domain
• Hiring gaps, layoffs, and unsustainable initial growth metrics for start ups
• Business strategy and data pipeline changes because of API changes, new
scraping policies, audits, pivots, and cease and desist letters
Work Computers
Specs Microsoft Surface Book 2 + Docking Station Dell Precision Workstation T7600
Year Released 2017 2012
Processor Speed 1.90 GHz 2.60 GHz
RAM 16 GB LPDDR3 128 GB (16x8 GB) DDR3 ECC RDIMM
Graphics card NVIDIA GeForce GTX 1060, 6GB Not included, self install*
Storage 256 GB SSD Not included, self install*
CPU Intel Core i7-8650U quad-core E5-2600 family 64-bit Xeon, 16 core
Display output USB-C to HDMI (or 2x mini display ports from dock) 2x DVI
USB ports 2x USB 3.0 + 1x USB-C (+ 4x USB 3.0 from dock) 8x USB 2.0 + 2x USB 3.0
eBay cost ~$1900 (New) + ~$90 Docking station (New) ~$920 (Manufacture Refurbished)
Surface Book 2 + docking station: https://www.ebay.com/itm/312250907713, https://www.ebay.com/itm/173753646668
Precision Workstation (Rig): https://www.ebay.com/itm/183161180290, also need hard drive and graphics card*
*May need hardware-literate person to help you install
If you need recommendations for work computers
(Not to scale)
Archetypes at Work
• Bootcamp guy – Certificate but superficially learned everything DS in 12 weeks.
Rests on laurels from undergrad, which is a field change from current job. You
have to carry him to the finish line and actively ask what he does not understand
• Antisocial coders – stay on topic with the task and keep conversations light
• DS who are rigid thinkers – Given that most insight, data merges, and feature
engineering have to achieve a certain amount of creativity for value, it does not
benefit the company if you remain ‘clerical’ in your attempts; less cookbooks
• People who are territorial about their data or projects – the whole department
looks bad if this person is selfish and takes too long because the project gets
canceled or reassigned. We get penalized for wasting time on incompletes
• People who compete for your projects from another department – corporate
culture sees this as ‘healthy competition’ to get the work done so beat them in
time and accuracy and unfortunately sacrifice nights and weekends
Archetypes at Work
• The Head honcho who does not disclose his proprietary company contributions
even for the sake of downstream collaborations – he does this for job security
and to justify his high salary and for no one to take over and understand his work
• Other departments who vote negatively towards new hires to join your team –
they see your DS field as a ‘novelty’ job title and does not fully understand the
extent of why certain high paying positions are needed when they can substitute
with a close job title from another department or self sustain without
• The business data analyst who thinks he can do DS – tell him to stop padding his
resume with buzzwords and algorithms he does not fully understand because
everybody thinks they are DS until they realize they are not; go get a master’s
• The politics bureaucratic head – Use gambits to receive more benefits and blind
eyes from leadership when they are underperforming
Archetypes at Work
• The phone-a-friend lifeline guy – The guy who passed the coding tests and
interview but actively seeks outside help to do this work due to underqualified
• The domain experts who refuse new insights – you present new findings to
them but either because they are getting outclassed or sees more work in the
inevitable future, they dismiss your findings and refuse to share it with their team
• The domain experts who thinks the world of you but simply do not understand
– they give you the time of day and are patient with how you see your findings
can affect their work but they do not want to fix something that is not broken
• Non-technical leadership people who thinks they can lead a tech department
meeting and direct methods – imposter syndrome is too strong for them to
realize that their ideas are good but leave the methodology to us
Archetypes at Work
• Coders with no DS background who thinks they can do DS – their findings
reported to leadership needs to be nullified if proven wrong
• Marketing PR people who leverage DS without knowing how to write findings–
it looks bad if they broadcast correlation is causation or common statistical
misconceptions upon our behalf unknowingly to the general public
• Domain experts and tech allies who help you on the job – Given that work
emails and communications are transitory, get their information for later projects
• People who learn unrelated software on the job – ignore; develop yourself
• Leadership who do not understand the budget and hiring needs of the DS
department and also do not understand why models take so long to run – This
is why we need DS leaders to educate and take their positions in the near future
Projects
• It is not a coincidence that Microsoft bought Github, Slack, Linkedin, and hosts
Azure: not only can they find your code and see your data, they can see your
messages, and hire you. Use different platforms operated by multiple companies
• Take full or more ownership of your projects
• Overdeliver based on what they require from you and add DS insights not considered
• Have resilience on projects and find methods to completion to increase project count
• Continue to do udacity and udemy courses to better yourself; read research papers
• Complete projects based on your standards and not (lesser) standards of others
• Develop and grow while not short changing yourself; have intrinsic motivation to complete
projects based on your capabilities and do not settle for consolation extrinsic motivation
• Use DS on pet projects outside of work and attend DS social gatherings
• Regular DS day looks like a 5th semester workload but you get to go home at 5pm
Generic Project
• To reduce overfitting, no ensemble but instead run 6 algorithms in parallel to find
the best model with RMSE for regression. Then in series, sum the daily values
into monthly and run 6 algorithms plus no model to find the best RMSE for
monthly estimates
XGBoost Reg
Extra Trees Reg
Random Forest Reg w/GS
XGBoost Reg w/GS
Extra Trees Reg w/GS
Random Forest Reg
Feature Engr Find best RMSE and
Sum from best model
Daily Level
XGBoost Reg
Extra Trees Reg
Random Forest Reg w/GS
XGBoost Reg w/GS
Extra Trees Reg w/GS
Random Forest Reg
Monthly Level
No Model
Find best RMSE and
Use best model
Multiprocessing
from multiprocessing import Pool,current_process
Import gc
def doWork(work):
process_id = str(current_process().pid)
file_ = work
print(process_id,':',file_)
## code here
return v ## flattened one row
if __name__ == '__main__':
allFiles = glob('C://Users//Yao//Desktop//all//**//*-large.jpg', recursive=True)
work = []
for file_ in allFiles:
w = (file_)
work += [w]
num_procs = 8 if len(work) > 8 else len(work)
with Pool(num_procs) as p:
results = p.map(doWork,work)
df= pd.concat(results,axis =0,ignore_index=True,sort=False).reset_index(drop=True)
df.to_csv('df.csv',index=False)
Given that a computer usually
has 16 cores, you can use this
generic multiprocessing code to
write a function that pulls 8
different rows or files and run it
simultaneously: very good for
running models simultaneously
or downloading files; teach my
company how to run code
PM me on LinkedIn to learn this
www.linkedin.com/in/yaoya0/
Kanopy / Your Job Hunt
• I will be a DS at Kanopy starting 10/28/19 in Irvine, CA
• Kanopy license films for library patrons to stream digitally
• I will be working on their recommendation engine for films
• Recommendation systems was my master’s project: https://youtu.be/uavbPKiUg9M
• Think like an opportunist; go mass apply everything DS
• Mass apply to retool, upgrade, and adapt
• Keep your interview pipeline full
• Accept the DS job offer in which conditions are met for your interests, career
growth, and the support to be successful
Happy Halloween, Good Luck!

Más contenido relacionado

La actualidad más candente

Business Analyst Technical Interview
Business Analyst Technical InterviewBusiness Analyst Technical Interview
Business Analyst Technical InterviewNeka Allen
 
User Experience as an Organizational Development Tool
User Experience as an Organizational Development ToolUser Experience as an Organizational Development Tool
User Experience as an Organizational Development ToolDonovan Chandler
 
Analytical Skills Tools and Attitudes 2013 Survey lavastorm analytics
Analytical Skills Tools and Attitudes 2013 Survey   lavastorm analyticsAnalytical Skills Tools and Attitudes 2013 Survey   lavastorm analytics
Analytical Skills Tools and Attitudes 2013 Survey lavastorm analyticsjjoseph100
 
Generative Analysis Overview
Generative Analysis OverviewGenerative Analysis Overview
Generative Analysis OverviewJim Arlow
 
Choosing Technical Interview Questions (2006)
Choosing Technical Interview Questions (2006)Choosing Technical Interview Questions (2006)
Choosing Technical Interview Questions (2006)Adam Barr
 
Cracking the coding interview columbia - march 23 2011
Cracking the coding interview   columbia - march 23 2011Cracking the coding interview   columbia - march 23 2011
Cracking the coding interview columbia - march 23 2011careercup
 
Strategic IA Careers: Skills and Knowledge for Success
Strategic IA Careers: Skills and Knowledge for SuccessStrategic IA Careers: Skills and Knowledge for Success
Strategic IA Careers: Skills and Knowledge for SuccessAndrea L. Ames
 
Forget "Predict" the Future -- Create the Future! keynote
Forget "Predict" the Future -- Create the Future! keynoteForget "Predict" the Future -- Create the Future! keynote
Forget "Predict" the Future -- Create the Future! keynoteAndrea L. Ames
 
Guided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result ParadigmGuided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result Paradigmarnabdotorg
 
Test strategy for Conversational AI
Test strategy for Conversational AITest strategy for Conversational AI
Test strategy for Conversational AIShama Ugale
 
Advanced usability testing - moderating
Advanced usability testing - moderatingAdvanced usability testing - moderating
Advanced usability testing - moderatingRebecca Destello
 
Artificial Intelligence for Medicine
Artificial Intelligence for MedicineArtificial Intelligence for Medicine
Artificial Intelligence for MedicineTassilo Klein
 
Publishing Strategic Technology for Association of Catholic Publishers
Publishing Strategic Technology for Association of Catholic PublishersPublishing Strategic Technology for Association of Catholic Publishers
Publishing Strategic Technology for Association of Catholic PublishersCraig Miller
 
Session 2 into to qualitative research intro
Session 2   into to qualitative research introSession 2   into to qualitative research intro
Session 2 into to qualitative research introAngela Ferrara
 
DIY: Research on a shoestring budget
DIY: Research on a shoestring budgetDIY: Research on a shoestring budget
DIY: Research on a shoestring budgetJ. Todd Bennett
 
A Beginners Guide to Surveys & Research
A Beginners Guide to Surveys & ResearchA Beginners Guide to Surveys & Research
A Beginners Guide to Surveys & ResearchAbhishek Kumar
 
Pragmatic programmer
Pragmatic programmerPragmatic programmer
Pragmatic programmerMaulik Shah
 
Comu346 lecture 7 - user evaluation
Comu346   lecture 7 - user evaluationComu346   lecture 7 - user evaluation
Comu346 lecture 7 - user evaluationDavid Farrell
 
Python for data science
Python for  data sciencePython for  data science
Python for data scienceBrian Okinyi
 

La actualidad más candente (20)

Business Analyst Technical Interview
Business Analyst Technical InterviewBusiness Analyst Technical Interview
Business Analyst Technical Interview
 
User Experience as an Organizational Development Tool
User Experience as an Organizational Development ToolUser Experience as an Organizational Development Tool
User Experience as an Organizational Development Tool
 
Analytical Skills Tools and Attitudes 2013 Survey lavastorm analytics
Analytical Skills Tools and Attitudes 2013 Survey   lavastorm analyticsAnalytical Skills Tools and Attitudes 2013 Survey   lavastorm analytics
Analytical Skills Tools and Attitudes 2013 Survey lavastorm analytics
 
Generative Analysis Overview
Generative Analysis OverviewGenerative Analysis Overview
Generative Analysis Overview
 
Choosing Technical Interview Questions (2006)
Choosing Technical Interview Questions (2006)Choosing Technical Interview Questions (2006)
Choosing Technical Interview Questions (2006)
 
Cracking the coding interview columbia - march 23 2011
Cracking the coding interview   columbia - march 23 2011Cracking the coding interview   columbia - march 23 2011
Cracking the coding interview columbia - march 23 2011
 
Strategic IA Careers: Skills and Knowledge for Success
Strategic IA Careers: Skills and Knowledge for SuccessStrategic IA Careers: Skills and Knowledge for Success
Strategic IA Careers: Skills and Knowledge for Success
 
Forget "Predict" the Future -- Create the Future! keynote
Forget "Predict" the Future -- Create the Future! keynoteForget "Predict" the Future -- Create the Future! keynote
Forget "Predict" the Future -- Create the Future! keynote
 
Guided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result ParadigmGuided Interaction: Rethinking the Query-Result Paradigm
Guided Interaction: Rethinking the Query-Result Paradigm
 
Test strategy for Conversational AI
Test strategy for Conversational AITest strategy for Conversational AI
Test strategy for Conversational AI
 
Advanced usability testing - moderating
Advanced usability testing - moderatingAdvanced usability testing - moderating
Advanced usability testing - moderating
 
Artificial Intelligence for Medicine
Artificial Intelligence for MedicineArtificial Intelligence for Medicine
Artificial Intelligence for Medicine
 
Publishing Strategic Technology for Association of Catholic Publishers
Publishing Strategic Technology for Association of Catholic PublishersPublishing Strategic Technology for Association of Catholic Publishers
Publishing Strategic Technology for Association of Catholic Publishers
 
Session 2 into to qualitative research intro
Session 2   into to qualitative research introSession 2   into to qualitative research intro
Session 2 into to qualitative research intro
 
DIY: Research on a shoestring budget
DIY: Research on a shoestring budgetDIY: Research on a shoestring budget
DIY: Research on a shoestring budget
 
A Beginners Guide to Surveys & Research
A Beginners Guide to Surveys & ResearchA Beginners Guide to Surveys & Research
A Beginners Guide to Surveys & Research
 
Pragmatic programmer
Pragmatic programmerPragmatic programmer
Pragmatic programmer
 
Comu346 lecture 7 - user evaluation
Comu346   lecture 7 - user evaluationComu346   lecture 7 - user evaluation
Comu346 lecture 7 - user evaluation
 
Resume(TanChuanLeong-John)
Resume(TanChuanLeong-John)Resume(TanChuanLeong-John)
Resume(TanChuanLeong-John)
 
Python for data science
Python for  data sciencePython for  data science
Python for data science
 

Similar a Lessons after working as a data scientist for 1 year

Tips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseTips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseLisa Cohen
 
Data Science Highlights
Data Science Highlights Data Science Highlights
Data Science Highlights Joe Lamantia
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Project management for Big Data projects
Project management for Big Data projectsProject management for Big Data projects
Project management for Big Data projectsSandeep Kumar, PMP®
 
Project management for Big Data projects
Project management for Big Data projectsProject management for Big Data projects
Project management for Big Data projectsSandeep Kumar, PMP®
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?DIGITALSAI1
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadVamsiNihal
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabadsaitejavella
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training HyderabadNithinsunil1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
data science training and placement
data science training and placementdata science training and placement
data science training and placementSaiprasadVella
 
online data science training
online data science trainingonline data science training
online data science trainingDIGITALSAI1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabadVamsiNihal
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in HyderabadKumarNaik21
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training HyderabadNithinsunil1
 

Similar a Lessons after working as a data scientist for 1 year (20)

Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
 
Tips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseTips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the Enterprise
 
Data Science Highlights
Data Science Highlights Data Science Highlights
Data Science Highlights
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Project management for Big Data projects
Project management for Big Data projectsProject management for Big Data projects
Project management for Big Data projects
 
Project management for Big Data projects
Project management for Big Data projectsProject management for Big Data projects
Project management for Big Data projects
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabad
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 

Más de Yao Yao

Yelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYao Yao
 
Yelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYao Yao
 
Yelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYao Yao
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelYao Yao
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelYao Yao
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Yao Yao
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Yao Yao
 
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Yao Yao
 
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Yao Yao
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Prediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionPrediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionYao Yao
 
Data Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataData Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataYao Yao
 
Predicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionPredicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionYao Yao
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and DemonstrationYao Yao
 
API Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesAPI Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesYao Yao
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and DemonstrationYao Yao
 

Más de Yao Yao (19)

Yelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm PaperYelp's Review Filtering Algorithm Paper
Yelp's Review Filtering Algorithm Paper
 
Yelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm PosterYelp's Review Filtering Algorithm Poster
Yelp's Review Filtering Algorithm Poster
 
Yelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm PowerpointYelp's Review Filtering Algorithm Powerpoint
Yelp's Review Filtering Algorithm Powerpoint
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
 
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov ModelAudio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
Audio Separation Comparison: Clustering Repeating Period and Hidden Markov Model
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
 
Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...Estimating the initial mean number of views for videos to be on youtube's tre...
Estimating the initial mean number of views for videos to be on youtube's tre...
 
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
Lab 3: Attribute Visualization, Continuous Variable Correlation Heatmap, Trai...
 
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
Lab 1: Data cleaning, exploration, removal of outliers, Correlation of Contin...
 
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Prediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic RegressionPrediction of Future Employee Turnover via Logistic Regression
Prediction of Future Employee Turnover via Logistic Regression
 
Data Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity DataData Reduction and Classification for Lumosity Data
Data Reduction and Classification for Lumosity Data
 
Predicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear RegressionPredicting Sales Price of Homes Using Multiple Linear Regression
Predicting Sales Price of Homes Using Multiple Linear Regression
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
 
API Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random movesAPI Python Chess: Distribution of Chess Wins based on random moves
API Python Chess: Distribution of Chess Wins based on random moves
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Blockchain Security and Demonstration
Blockchain Security and DemonstrationBlockchain Security and Demonstration
Blockchain Security and Demonstration
 

Último

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 

Último (20)

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 

Lessons after working as a data scientist for 1 year

  • 1. Lessons After Working as a Data Scientist for 1 Year Yao Yao SMU DS Alumni Talks Oct ‘19 Graduated SMU MSDS Aug ’18 Hired as Data Scientist Nov ‘18
  • 2. Data Science is an Expression of Self in Data • Given a dataset and an open ended problem, every solution variation would be different: from the way you code, to the libraries and algorithms used, to the different views, to the data merges, to the visualization, to the final solution • Is a culmination of all the experiences you underwent towards understanding and extracting business insights with the dataset • Transfer learning of unrelated disciplines, certificate courses, code references of what is possible, and imagination of possibilities all feed into the solution that is rewardingly you • The creative eye in data scientists are hard to evaluate in live coding puzzles and technical questions; therefore, data challenges and interviews are needed • Even then, some are rigid in their solution path with established cookbooks while others are nonconformists, eclectic thinkers, and tailor their solution better than what the problem and dataset suggested
  • 3. Evaluating Data Scientists • Fundamentally understand the nature and math of how solutions arrive • Some data scientists thrive based on a certain domains and can find stability a niche while others require multiple variations of different datasets and problems to challenge them. Evaluate data scientists holistically instead of by one metric • Data science sometimes is hard to monetize and justify higher salaries • If the same insights is sold to multiple parties, the information is no longer exclusive and eventually becomes public knowledge, which drives the value of the information down. The insights are truly valuable if one party has exclusive access and can use it competitively • To resist insight-value degradation due to economies of scale, find niche ways in DS in which the value increases when there is more participation • DNA testing of ancestry become more accurate as more people participate in the dataset • Facebook and Google target ads and recommend news based on finding similar people with your tastes based on cookie tracking as more people use their system
  • 4. Evaluating Data Scientists • Unlike software developers and engineers, most of our work is built on the ability to produce insights based on our coding and statistics acumen, algorithm methodology, and creativity factor to extract the necessary business insights to induce action, build data pipeline products, and/or embed systems in IOT devices • Because we are so many steps removed from building a software or a website to induce monetary compensation, we get paid less and companies sometimes cannot gauge the direct relationship between what we produce and how that ultimately generates money • Find niche ways, such as FinTech, where the proprietary insights we produce has a more direct route towards generating revenue based on decisions from insights • Build products like recommendation/ad targeting systems where the perceived value of the information distributed does not degrade due to economies of scale or knockoffs bypassing R&D and instead increases once more people participate
  • 5. Data Science has Clever Ways to Reflect Life • Neural networks reflect dendrites in a human brain • CNNs are filters in which a computer perceive images • Generative Adversarial (GAN) learning, which is “sword sharpens sword” • Adversarial: Be challenged, have thought experiments, and cover blind spots • Reinforcement learning, which is "being an apprentice to a mentor“ • Become a better version of yourself by learning from established others • Use DS to optimize for life given that life is the ultimate DS problem • Be creative and have less inhibitions that prevent you from being successful • Apply DS in rare domains such as mergers and acquisitions and portfolio asset management • Given that life has imperfect information: help people by building products and gaining insights with the compensation as a byproduct of what you do
  • 6. DS is Relatable, For Better or Worse • DS has so many algorithmic metaphors to explain everything intuitively • It is beneficial for leadership with tech backgrounds to understand intuitively what we do and can improve your communication skills to become manager • However, every layman thinks that they are a DS until they are proven otherwise • Bad when devs with only coding backgrounds tries to impede on DS work by doing PowerBI without intuitively knowing statistics, or using correlation, which is the bottom tier of analysis, for inferencing causation, or using improper cross validation techniques and metrics to evaluate “cookbook methods” pulled online • Worse when leadership only values percent change with endpoints for better marketing PR instead of fitting a curve with a slope for better metrics • When you explain things too simple so they would understand, it backfires when leadership does not see the justification why certain algorithms still takes long
  • 7. DS Discretion • In industry, a lot of software are held together by “rope and duct tape” • Not all completed code is perfect but to self-review or become a better coder, use split screen and retype your code into the final version without excessive exploratory functions and more efficient methodology • Instead of being aspirational by importing all the data and using the most heavy duty algorithm for your computer to run, which takes two weeks+, subsample and prototype your results so that leadership can see and get a sense of initial results for them to approve you of running the full model with all the data • Visualizations always do better to persuade leadership, clients, customers, investors, etc than the same tangible/numeric results • Improvise, adapt, overcome; be a nonconformist, nonconventional, eclectic thinker because all standard practice and common sense ideas have been done
  • 8. Sequel to The Job Search Interview Offer Letter Experience for Data Science • I work for Viral Launch (5-year start-up): they do keyword and category product discovery for Amazon sellers in Indianapolis, IN • Given Amazon’s diverse dataset • Soloed 6+ projects from data orchestration, workflow design, to production code for online software release • Dependent on interest level, company culture, the right mentor, relevancy with previous projects/capstone, relative income, career prospect, growth, and the ability to dive deeper into DS • Previous Presentation: video and slides https://www.youtube.com/watch?v=Fiz1Tn7ogP4 • https://www.slideshare.net/YaoYao44/yao-yao-msds-alum-the-job- search-interview-offer-letter-experience
  • 9. My Timeline • 2012: Bachelor’s in material science engineering • Job experience in engineering but did mostly data analyst work and VBA automation of spreadsheets • Minor DS mentorships and certificates • 2017: Started MSDS program Jan 9th (Spring) • 2018: Started DS job search August 2nd • Still had capstone paper, ML pres, and QTW due • 2018: Graduated MSDS program August 27th (Summer) • 5-semester track; NLP in capstone, ML as elective • 2018: Landed DS offer Nov 8th; Accepted Nov 10th • Took 99 days (~3 months); 74 days since graduation • 2019: Landed next DS offer Oct 6th; Accepted Oct 8th • Took 28 days (1 month) since initial applications Highlights from Previous Presentation (Slide 2/52): Nov 2018
  • 10. Variations in Job Titles Related to MSDS • Analytics Engineer – Supply chain, logistics, financial • Business Intelligence Analyst – Operations; PowerBI; insights • Decision Scientist – Ability to DS convert findings into business decisions • Data Analyst – Mostly SAS, R, SQL; Less python • Data Engineer – Data architects and databases for DS • Data Scientist – Variation in prediction/NLP/ML • Machine Learning Engineer – 2-5 years of previous DS experience; custom applications; specific focus • Predictive Analyst – More stats based; Less ML • Quantitative Analyst – Financial technology, time series, stats Last word in job title hints rigor and salary range Highlights from Previous Presentation (Slide 18/52): Nov 2018
  • 11. Variations in Job Titles Related to MSDS • Applied Scientist – Catch-all for those who have MSDS or PhD in computationally rigorous field like stats, physics, math, industrial operations, CS • NLP Engineer – Mostly NLP: corpus, Stanford NLP, cosine similarity • Computer Vision Engineer – Mostly CV: CNN filters, edge detection, robotics • Data Science, Analyst – Rigorous coding requirements in the DS field: SQL, ML but has sequestered projects due to large company size and many team divisions • Deep Learning Engineer – Mostly Neural Networks/ML • Python ML Developer – Ability to write custom ML algorithms/optimizations from scratch, like ADAMAX, based on specific domain applications • AI Developer – Catch-all; Most likely has imposter syndrome by HR or by person Analyst: insights; Scientist: build/explore; Engineer: master; Developer: Hardcore Highlights from Previous Presentation (Addendum): Nov 2018
  • 12. What Semester to Apply Jobs? • Streamline the process • Once you learn QTW/NLP/ML and have enough time allocated to finish the capstone project, apply for jobs • Have the confidence, expertise, and credibility to mitigate imposter syndrome when applying and interviewing • Common DS questions: How does a neural network “learn”? How to tune for random forest hyperparameters? What is xgboost? How to train for Naïve Bayes text classifiers? • Questions can be simple yet unravel an encyclopedia of knowledge to answer; If you know the answers w/o googling, apply now • Wait to master them to explain it satisfactory in your own words • Job applications can be a “full time” endeavor; hard to balance time • Streamline attempts; otherwise, waste time/effort in retry Highlights from Previous Presentation (Slide 13/52): Nov 2018
  • 13. What Job Title Should I Apply For? • Data Scientist – your MSDS degree • Machine Learning Engineer if ambitious, more years exp • Data Engineer if you have prior software engr exp • You don’t qualify for senior, lead, principal, chief, director DS positions just yet • Full-fledged title, no junior or entry, no contract • May be relevant to previous job field • If field change • Reflect which parts of capstone/projects/topics resonated with you • Don’t sabotage yourself by recycling the same project themes • Apply to the DS job fields that interest you • The future can be non-contingent to the past as long as you plan it • (Ranks to apply to: Intern, Junior, Entry, Contract, Associate, –, PhD, Senior, Staff, Head, Principal, Director, Chief) Highlights from Previous Presentation (Slide 20/52): Nov 2018
  • 14. Techniques to Apply To Jobs Fast • Filter by “data scientist” -senior or “machine learning” -senior and location • -senior means senior is removed from the results and quotes makes it one phrase • Apply everyday filtered by “most recent” until you see the job you applied to yesterday or when the timestamp is “1 day ago” • LinkedIn premium (optional): insights and morale boost for who viewed your profile Highlights from Previous Presentation (Slide 34/52): Nov 2018
  • 15. Techniques to Apply To Jobs Fast • Apply to all available “Easy Apply” regardless of location for interview practice • Do not apply to Senior, Staff, Head, Principal, Director, Chief positions if you do not qualify • Optimize your resume format and LinkedIn profile so that information is imported correctly automatically • Make logins to applications with same credentials and use autocomplete • Per external job site for large companies, search on their job board for more data scientist or machine learning positions and apply to all of them once you make a login profile (Walmart, Amazon, Apple, etc has 50+ any given time) • Create Indeed, Dice, Glassdoor, Monster, Yahoo, etc profiles for quick apply but only apply based on LinkedIn search results to not send duplicate applications or get phished by fraudulent job listings Highlights from Previous Presentation (Slide 35/52): Nov 2018
  • 16. 2018 DS Job Application Summary • Systematically applied to ~800 DS job postings (solve conversion rate) • Too many phone interviews and HR screeners • Completed 6 data challenges (<1%) • >15 video conference interviews by company (~2%) • Columbus x5, Cincinnati x3, Raleigh x3, Seattle x3, St. Louis x3, Dallas x2, Austin x2, Atlanta x2, Tampa x2, Portland, Indianapolis, New York, Bentonville, Miami, Denver, Hartford, Pittsburg, San Diego, Singapore, Irvine, D.C., San Francisco • 8 on-site interviews (1%) • 4 local to Chicago • Others in Columbus, Austin, Charlotte, Indianapolis Show your best self w/o having to alter the resume by industry (80:20 rule) The time of acceptance (Saturday Nov 10th) • 2 DS job offers (Nov 8th and Nov 9th), both matching (0.25%) • Competitive for national average for DS (for 0-1 year exp, check glass door) • +1 verbal offer (Nov 9th), adjusted for location (Total 0.375%) • 5 canceled phone interviews; 1 canceled video conference interview Highlights from Previous Presentation (Slide 51/52): Nov 2018
  • 17. 2019 DS Job Application Summary Started Applications • Applied to 400+ jobs, mostly in California; PST 3 hours behind EST yet 5+ years ahead in tech • 2 Offers, 1 site interview, 2 coding challenges, 4 timed aptitude coding and multiple choice tests, 3 real-time interface coding challenges • Verbal Offer Oct 4th; Official Offer on Oct 6th; Took 28 days since initial applications
  • 18. Why Change Companies? • Year’s lease for apartment is up; renewal is risky if DS not prominent nearby • Salary increase, title promotion, new dataset opportunities elsewhere • New DS grads can do current work; running out of challenging projects • Need to develop towards ML/DL/NN/managerial to distinguish self • Previous projects automated or solved – automated your own position • Current dataset cannot solve aspirational problems given restrictions • Company leaders does not give free reign or funding for DS to thrive • Company pivots away from DS and is no longer the core value/business model • Company does not know how to monetize or see benefits of DS despite completed projects and explanations; DS has no promotions in title or wage
  • 19. Benefits of Changing Companies • Similar to the current SMU football coach, you join different companies carrying out different roles to be more rounded in experience to eventually level up to become CTO • In some cases, a step back is needed to reevaluate, or to claim parental leave/sabbatical benefits for positions with lower stakes • Another 4 bullets on the resume to show versatility • May be quicker to be promoted by job hopping than internal • Optimize location/position while increasing salary/perks • Have more references and opportunities to be inspired by great coworkers, projects, and business opportunities • Ride the wave of upcoming industries, new tech, and IPOs
  • 20. Paths to get Promoted • It is not in your best interest to stay a DS: as master programs get better, you know less than the recent graduate over time if you stay in the same position • There is always going to be another coder better than you but it is still in your interest to get better at coding and dive deeper into ML/managerial • Given that DS is new and positions are available, be reliable and go into leadership roles so that more industries can better understand and utilize DS • Technical: • DS > Sr. DS > Staff DS > Lead DS > Principal DS > CDS/CIO/CTO ** • Harder: you have to oversee CS who are ML engineers to enhance/correct their code • Managerial: <- Choose this • DS > Sr. DS > DS Manager > Head DS > Director DS > CPO/CTO ** • Easier: you are just managing your team over time and less hands on with code over time ** Path depends on company and some titles are interchangeable
  • 21. What to Look for in Great Companies • Variety of datasets to prove yourself on multiple projects and not be sequestered with specific project/dataset jurisdictions due to large company size and many team divisions; ability to be autonomous and be trusted for your work • Ability to access/merge datasets without much bureaucracy or clearance • Company leadership understands DS to be in business model/technology • Opportunity to get promoted once you prove yourself with projects and patents • Not be shackled by “golden handcuffs” where everything you do for the company is top secret to the point that you cannot even disclose generically in interviews what you did • Not get promoted too soon and your experience cannot be easily transferred to another company easily because of niche or you defer certain IPO benefits if you leave early • You can envision yourself working there and can explain to a layman what you are doing for the company and for the betterment of your career
  • 22. What to Watch Out for in Companies • Ownership of first hand data collection • Ownership of data has always been key towards company sustainability • When purchasing data (to keep hands clean), you do not know credibility or reliability of the data; the distributors can get a cease and desist letter to shutdown (if black/grey hat) • APIs change and access can be revoked; free methods suddenly become pay, which alters bottom line budget, changes established data pipelines, or make past projects obsolete • DS as a Service is still iffy compared to Software as a Service • DSaS masquerades as a consulting firm for data science automation • You end up being the data scientist on-call for technical troubleshooting for their automation system helping other data scientists rather than doing data science yourself • Consulting is not ideal for data scientists because firms are too concerned about saving money than to go with the optimal solution • Whether if DS is treated as a novelty or part of the business model See Slide 33 of 2018 presentation
  • 23. Yellow and Red Flags in the Interview Process • Check Glassdoor for insights from interviews and current employees alike! • HR, who are buying plane tickets and hotels for site interviews, are more concerned about saving money by having layovers and lesser accommodations than making the candidate feel they are wanted by the company • HR or company culture has a superiority complex that they are doing you a favor than mutual benefit once you are hired on to the team • The company’s financials are restructured by a bank and is more concerned about paying back the loan or making money for investors with short-run frantic ”flashes in the pan” patches than long-term calculated solutions • You are going to be the 19th DS there and they are basically hiring to “pad” their employee count, diversity metrics, and marketing photoshoots of the team • The existing data science team is already claiming who has seniority/leadership roles over others simply they are hired first yet they are not proven to be better data scientists or leaders than the people hired later
  • 24. Yellow and Red Flags in the Interview Process • Your base salary is under market value for your location and you forgo stock options if you were to be terminated prior to your one-year’s employment • There are some “funky” termination clauses/loopholes in your contract that you must sign prior to accepting the job offer • There is animosity among the data science teams yet the company view this as “healthy competition” driven by company culture or peer pressure to get projects delivered faster yet not as accurate/polished • Multiple teams can get assigned the same projects without knowledge of the other teams’ whereabouts, where certain teams can be on the “chopping block” if they fail to consistently deliver those projects faster than the other teams • There are subtle hints during the office tour that the team does not respect one another and they consider themselves mercenaries and does not showcase any team camaraderie • Historically, team members were restructured due to abandoned projects, layoffs, etc
  • 25. Yellow and Red Flags in the Interview Process • Somehow, all the data scientists for a company are internally trained to do DS without a master’s program but instead with some company sanctioned ‘lower tier’ courses, where you have to correct for their misconceptions • Correcting for their misconception that collinearity affected random forest • The person who wanted to interview you had an emergency but never came back to reschedule for the interview and another person from an elite team had to interview you and nobody knows the original team that you qualified for • Companies that are hiring DS to “keep up with the Joneses” or are getting a free crash course on what data science is through your interviews • Having the HR write notes on VERY technical questions about NN and NLP when it is supposed to be the job of the technical adept person to evaluate you • Companies that are hiring 1500 DSs for the sake of publicity but they have 2 DS openings on their job portal and half their hires are 3rd party contracts
  • 26. Yellow and Red Flags in the Interview Process • They claim that they are using anomaly detection but they are using linear regression instead of NN/DL or unsupervised clustering • The whole dashboard of the demo data that they have are based off of estimates and barely off of real any real numbers • Their CEO is in the news for controversial activity or have articles written about him about his ego and extravagant flaunting of wealth • Their website is written by a copywriting team to hit every jargon possible to attract SEO/customers with cheeky one liners and also mentions “big data” • Evaluate all the yellow and red flags you see
  • 27. Interview Gaffes and More Flags • Having someone they hired 2 weeks ago interview you for the technical interview • Ask some hard questions that they themselves did not have to go through • Instead of an one-on-one interviews, have a two-on-one interview session to save time and tag-team hard questions towards the candidate; intimidation factor • Asking if I knew of Andrew Ng’s course but not go further into that question • Have everyone interview you one-by-one but not consult each other to eliminate redundancies and then have a consortium interview with more redundancies • Asking what pros and cons are with type 1 and 2 errors but then get offended that “cancer” was brought up as an example that false positives are benign due to personal reasons • The DS manager going through my data challenge slowly and redundantly not because he was evaluating my code but instead he wanted to learn from me something he did not know how to do • Turning in 2 data challenges and not getting a scheduled phone call nor rejection letter
  • 28. Bad Evaluators • Companies can ask you questions and test you in an exorbitant amount of ways but they still cannot seem to consistently detect talent using the same metrics • People are coming from different programs and have different levels of difficultly and cannot be easily compared; somewhat like the NFL draft • You cannot measure heart on how dedicated certain people are over others • Some positions, regardless of coding aptitude, are considered ‘clerical’ with most of the constraints defined while others require more creativity in the problem solving process • Coding in real time during interviews often are considered IQ or SAT tests • Problems are usually dictionaries and lists and have no reflection of what you do as a DS • Solve certain puzzles without the use of libraries within 30 minutes; list comprehension • Practice using Leetcode and www.geeksforgeeks.org • Technical questions of stats/NN require you to talk higher-end jargon even though at work you are required to dumb it down for the layman to understand
  • 29. Coding Puzzle Examples Find the minimum number of coins for given cents def num_coin_return(cents): if cents < 1: return 0 coins = 0 drawer = [25,10,5,1] for coin_type in drawer: coins += int(cents/coin_type) cents = cents%coin_type return coins num_coin_return(32) 4 ## 1 quarter, 1 nickel, 2 pennies ## find remainder of cents given coin denomination and then add to counter Find the length of the shortest path from start S to end E in maze given that X’s are walls and o’s are paths. https://www.cs.bu.edu/teaching/alg/maze/ (recursion, least steps; edge detection) Example Input: o o o X X o o o o o o o o o E o X o o X o o X o o o o o X o o X o o X o X o o o o o o X o S o o o o X X X o o o o o o o Output: 11 Example Input: o o o X X o o o o X o o o o E o X o o X X o X o o o o o X o o X o o X o X o o o o o o X o S o o o o X X X o o o X o o o Output: None
  • 30. Coding Puzzle Examples Given nums = [2, 7, 11, 15], target = 9, Because nums[0] + nums[1] = 2 + 7 = 9, return [0, 1]. def twoSum(self, nums, target): h = {} for i, num in enumerate(nums): n = target - num if n not in h: h[num] = i else: return [h[n], i] ## Create a dictionary to find the difference between existing dictionary and target and then re-reference the dictionary with the difference number to find the two pairs summing up to target Input: 22 Output: 2 Explanation: 22 in binary is 0b10110. find inner 0s distance In the binary representation of 22, there are three ones, and two consecutive pairs of 1’s. The first consecutive pair of 1's have distance 2. The second consecutive pair of 1's have distance 1. The answer is the largest of these two distances, which is 2. len(max(bin(N)[2:].strip('0').strip('1').split('1’))) ## convert into binary, remove first two characters, strip the outer 0s, then the outer 1s, and then split the results by 1s to find the max length
  • 31. Coding Puzzle Examples Find day of week given initial day of week and days after def solution(S, K): day_dict = { 0:'Sun',1:'Mon',2:'Tue',3:'Wed’, 4:'Thu',5:'Fri',6:'Sat'} day_dict2 = {v:k for k,v in day_dict.items()} a = K % 7 return day_dict[(day_dict2[S]+a) % 7] solution('Mon',33) Sat ##hard code the dictionary, reverse the dictionary, solve for remainder of given days and add to reference, then output date of week Find the smallest positive number missing from an array Input: { 2, 3, -7, 6, 8, 1, -10, 15 } Output: 4 def solution(A): m = max(A) if m < 1: return 1 if len(A) == 1: return 2 if A[0] == 1 else 1 l = [0] * m for i in range(len(A)): if A[i] > 0: if l[A[i] - 1] != 1: l[A[i] - 1] = 1 for i in range(len(l)): if l[i] == 0: return i + 1 return i + 2 ## if all are negative, return 1; return 2 or 1 given special case; create a new array of 0s given the maximum value as length and set it as 1 if the value exists. Loop through new array until 0 is found and return the counter
  • 32. Once you get the Offer • Reflect and evaluate them with the keen eye they used to evaluate you • Did you like your interviewers and team? Is there work life balance? Culture and personality match? • Were the questions and circumstances fair? Were there any red or yellow flags? • Does the people and projects there inspire you to do better? Will the team get new hires? • Did they brush off certain questions with a polished answer or did they expose some weaknesses by going out of their way to tell you the truth? Was there braggadocio in awards and numbers? • Is there substantial contributions that you can see yourself making towards company development and growth? Do you understand the business model and how DS contributes to it? • Is there opportunity for promotion if you develop enough projects? Leverage this position for exp? • See that you are able to envision yourself spending at least a year on the datasets given and projects discussed; sign a year lease if not final destination. • Negotiate for salary depending on how well you did in the interview process and how much demand they have for your talent • Based on market value, taxes, housing, commute, relocation, perks, standard of living See Slide 38 and 46 of 2018 presentation
  • 33. Things that are Unfortunately Found Out Later • Leadership is more concerned or very single-minded about saving money and shoots down any attempts for the DS team to expand the budget needed for: • Necessary hires, such as a data engineer, to consolidate all the data and upgrade databases to not crash or hit above the allocated threshold while finding better database systems more suitable for the purposes of how the data is queried • Having them realize that a data architect or dev ops cannot substitute for a data engineer sometimes • Buying and allocating local servers to save money over cloud servers • Allocating certain local machines for parallel computing for machine learning purposes • Leadership mimics their entire business model/strategy completely with a different company that may not even be in the same domain • Hiring gaps, layoffs, and unsustainable initial growth metrics for start ups • Business strategy and data pipeline changes because of API changes, new scraping policies, audits, pivots, and cease and desist letters
  • 34. Work Computers Specs Microsoft Surface Book 2 + Docking Station Dell Precision Workstation T7600 Year Released 2017 2012 Processor Speed 1.90 GHz 2.60 GHz RAM 16 GB LPDDR3 128 GB (16x8 GB) DDR3 ECC RDIMM Graphics card NVIDIA GeForce GTX 1060, 6GB Not included, self install* Storage 256 GB SSD Not included, self install* CPU Intel Core i7-8650U quad-core E5-2600 family 64-bit Xeon, 16 core Display output USB-C to HDMI (or 2x mini display ports from dock) 2x DVI USB ports 2x USB 3.0 + 1x USB-C (+ 4x USB 3.0 from dock) 8x USB 2.0 + 2x USB 3.0 eBay cost ~$1900 (New) + ~$90 Docking station (New) ~$920 (Manufacture Refurbished) Surface Book 2 + docking station: https://www.ebay.com/itm/312250907713, https://www.ebay.com/itm/173753646668 Precision Workstation (Rig): https://www.ebay.com/itm/183161180290, also need hard drive and graphics card* *May need hardware-literate person to help you install If you need recommendations for work computers (Not to scale)
  • 35. Archetypes at Work • Bootcamp guy – Certificate but superficially learned everything DS in 12 weeks. Rests on laurels from undergrad, which is a field change from current job. You have to carry him to the finish line and actively ask what he does not understand • Antisocial coders – stay on topic with the task and keep conversations light • DS who are rigid thinkers – Given that most insight, data merges, and feature engineering have to achieve a certain amount of creativity for value, it does not benefit the company if you remain ‘clerical’ in your attempts; less cookbooks • People who are territorial about their data or projects – the whole department looks bad if this person is selfish and takes too long because the project gets canceled or reassigned. We get penalized for wasting time on incompletes • People who compete for your projects from another department – corporate culture sees this as ‘healthy competition’ to get the work done so beat them in time and accuracy and unfortunately sacrifice nights and weekends
  • 36. Archetypes at Work • The Head honcho who does not disclose his proprietary company contributions even for the sake of downstream collaborations – he does this for job security and to justify his high salary and for no one to take over and understand his work • Other departments who vote negatively towards new hires to join your team – they see your DS field as a ‘novelty’ job title and does not fully understand the extent of why certain high paying positions are needed when they can substitute with a close job title from another department or self sustain without • The business data analyst who thinks he can do DS – tell him to stop padding his resume with buzzwords and algorithms he does not fully understand because everybody thinks they are DS until they realize they are not; go get a master’s • The politics bureaucratic head – Use gambits to receive more benefits and blind eyes from leadership when they are underperforming
  • 37. Archetypes at Work • The phone-a-friend lifeline guy – The guy who passed the coding tests and interview but actively seeks outside help to do this work due to underqualified • The domain experts who refuse new insights – you present new findings to them but either because they are getting outclassed or sees more work in the inevitable future, they dismiss your findings and refuse to share it with their team • The domain experts who thinks the world of you but simply do not understand – they give you the time of day and are patient with how you see your findings can affect their work but they do not want to fix something that is not broken • Non-technical leadership people who thinks they can lead a tech department meeting and direct methods – imposter syndrome is too strong for them to realize that their ideas are good but leave the methodology to us
  • 38. Archetypes at Work • Coders with no DS background who thinks they can do DS – their findings reported to leadership needs to be nullified if proven wrong • Marketing PR people who leverage DS without knowing how to write findings– it looks bad if they broadcast correlation is causation or common statistical misconceptions upon our behalf unknowingly to the general public • Domain experts and tech allies who help you on the job – Given that work emails and communications are transitory, get their information for later projects • People who learn unrelated software on the job – ignore; develop yourself • Leadership who do not understand the budget and hiring needs of the DS department and also do not understand why models take so long to run – This is why we need DS leaders to educate and take their positions in the near future
  • 39. Projects • It is not a coincidence that Microsoft bought Github, Slack, Linkedin, and hosts Azure: not only can they find your code and see your data, they can see your messages, and hire you. Use different platforms operated by multiple companies • Take full or more ownership of your projects • Overdeliver based on what they require from you and add DS insights not considered • Have resilience on projects and find methods to completion to increase project count • Continue to do udacity and udemy courses to better yourself; read research papers • Complete projects based on your standards and not (lesser) standards of others • Develop and grow while not short changing yourself; have intrinsic motivation to complete projects based on your capabilities and do not settle for consolation extrinsic motivation • Use DS on pet projects outside of work and attend DS social gatherings • Regular DS day looks like a 5th semester workload but you get to go home at 5pm
  • 40. Generic Project • To reduce overfitting, no ensemble but instead run 6 algorithms in parallel to find the best model with RMSE for regression. Then in series, sum the daily values into monthly and run 6 algorithms plus no model to find the best RMSE for monthly estimates XGBoost Reg Extra Trees Reg Random Forest Reg w/GS XGBoost Reg w/GS Extra Trees Reg w/GS Random Forest Reg Feature Engr Find best RMSE and Sum from best model Daily Level XGBoost Reg Extra Trees Reg Random Forest Reg w/GS XGBoost Reg w/GS Extra Trees Reg w/GS Random Forest Reg Monthly Level No Model Find best RMSE and Use best model
  • 41. Multiprocessing from multiprocessing import Pool,current_process Import gc def doWork(work): process_id = str(current_process().pid) file_ = work print(process_id,':',file_) ## code here return v ## flattened one row if __name__ == '__main__': allFiles = glob('C://Users//Yao//Desktop//all//**//*-large.jpg', recursive=True) work = [] for file_ in allFiles: w = (file_) work += [w] num_procs = 8 if len(work) > 8 else len(work) with Pool(num_procs) as p: results = p.map(doWork,work) df= pd.concat(results,axis =0,ignore_index=True,sort=False).reset_index(drop=True) df.to_csv('df.csv',index=False) Given that a computer usually has 16 cores, you can use this generic multiprocessing code to write a function that pulls 8 different rows or files and run it simultaneously: very good for running models simultaneously or downloading files; teach my company how to run code PM me on LinkedIn to learn this www.linkedin.com/in/yaoya0/
  • 42. Kanopy / Your Job Hunt • I will be a DS at Kanopy starting 10/28/19 in Irvine, CA • Kanopy license films for library patrons to stream digitally • I will be working on their recommendation engine for films • Recommendation systems was my master’s project: https://youtu.be/uavbPKiUg9M • Think like an opportunist; go mass apply everything DS • Mass apply to retool, upgrade, and adapt • Keep your interview pipeline full • Accept the DS job offer in which conditions are met for your interests, career growth, and the support to be successful Happy Halloween, Good Luck!