VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
So you want to be a Data Scientist?
1. So You Want To Be A Data Scientist?
What It Means To Be A Data Scientist
2. About:Me
Mohd Izhar Firdaus Ismail
- Current: Solution Architect @ ABYRES Enterprise
Technologies Sdn Bhd
- Open Source Activist & (self-proclaimed) Hacker, Open Data
Advocate, Fedora Ambassador, Data Architect, Data Engineer,
Consultant, Python Programmer, Analyst, Trainer, and bunch of
other hats ;-)
- Contributing to Open Source projects for over 8 years
- Over 6 years building systems related to data, content,
information and knowledge management
- http://linkedin.com/in/kagesenshi
- izhar@abyres.net / kagesenshi.87@gmail.com
3. The People I Work For
● Open Source Technology
Company
– Specialize in Cloud, Big Data &
Enterprise Application
Development
– Red Hat & Hortonworks Partner
● IT Consulting & Professional
Services around Open Source
Softwares
– Design, development,
implementation and training
services
– Consulting practice around
leveraging Open Source
technologies and implementing
Big Data project
● The largest organized mafia of
pure play open source geeks in
Malaysia ;-)
4. Before I Start
Some people call me a data scientist,
But I don't consider myself one (yet)
(( its a personal integrity thing – Machine Learning & Stats is not (yet) my strong point ))
But I do work quite a bit with data: designing application,
infrastructure, algorithms, processes and pipelines for big data
workload – from data acquisition to visualization
6. "Data scientists are involved with gathering data,
massaging it into a tractable form, making it tell its
story, and presenting that story to others."
- Mike Loukides, VP, O’Reilly Media.
"A data scientist is someone who can obtain, scrub,
explore, model and interpret data, blending hacking,
statistics and machine learning. Data scientists not only are
adept at working with data, but appreciate data itself as a
first-class product."
- Hillary Mason, Data Scientist, Accel, Scientist
Emeritus, bitly, co-founder, HackNY.
12. Domain Knowledge & Soft Skills
● Knowledge to find what matters
– Knowing the statistics does not mean knowing
what is the significance of the results to a
business
– Business rules, terminologies, problem solving
techniques, scientific theories & formulas
– Identifying actionable informations
●
Problem solving & Hacker mindset
– New & creative ways to find, acquire,
transform, manipulate, mashing, and using
data
– Possibily unconventional uses of the same
result
– Knowing what data needed, and houw to get
them, to solve particular business problem
13. Math & Statistics
● People use your output for
decision making – wrong numbers
might end up with bad decisions
– Lies, damned lies, and statistics
● Machine Learning
– Predict future values
– Analyze patterns in structured and
unstructured data
– Automated decision support
systems
14. Programming & Database
● Programming
– Calculating few thousand rows on excel might be
okay, but dealing with distributed processing need
some skills
● Query over distributed data – you don't want a query that
stuck in a single core on a hundreds node cluster
– Simple visualizations can be done with drag-drop
builders, complex visualization will require you to get
yourself dirty
– Advanced decision system capabilities can only be
implemented through some sort of rule programming
– Develop data pipelines both batch and stream
– Develop data collection, scraping, machine learning &
artificial intelligence softwares
● Database
– Ingesting data from various type of sources,
managing data format, data storage, governance
15. Communication & Visualization
● Spreading information and discoveries
– Presenting data in the form that non-
scientist can understand
– Knowing how to explain to business users
as to why a result matters, how it can be
used to benefit the business,
organization, society
● Identifying patterns through visual
analysis
– Some insights might not be obvious when
presented in column and rows
– Knowing how to visualize information so
to make hidden patterns more obvious
19. The Key Differences
● Data Science
– Problem solving through
strategies around data
– Hindsight, Insight,
Foresight
– Understanding of patterns,
behaviors, etc
– Automated Data Driven
Decision Making
● Data Engineering
– Ingestion pipelines
– Data integration
– Data enrichment
– Data cleansing
– Data preparation
– Data pipeline
21. Hadoop is for Big Data
● Core of "Big Data"
– Techniques, technologies &
strategies, to handle ingestion,
storage, and processing of high
velocity, high volume, high
variety datasets
– Historical data, and not just
current state
– Transaction + interaction +
observation = Big Data
22.
23. Data Science Need Big Data
"The reaction of one man could be forecast by no known mathematics;
the reaction of a billion is something else again"
– Asimov
● Without rich historical data, analysis and development become
more challenging
– Patterns will start to show itself in rich historical data
– Models that accurate with small data, might start to fall apart when
more parameters/data are introduced
● Start collecting data today!, you never know when you need it,
and when you do, the historical data is there for you to mine
25. Attn.
● Courses, trainings, documents, tools, etc will definitely
help you to establish your foundations and basics in
data science
– but, like any technical field, what important is your ability to
mash everything up and apply it to solve problems
● Anybody can learn how to draw, anybody can draw, but
not anybody can be an artist.
26. Domain & Business
● Learn more about your industry (or your target industry)
● Learn what make they tick, what number that matters,
what are scientific knowledge around the domain
● Businesses exist for they key purpose of making profit,
which usually translates to; increase sales & reduce
cost
– Find how to help your organization business by collecting
data and analyze to produce visualizations that will help in
organization make more profit
27. Math & Statistics
● Find that old textbook you had from university, and
study them again ;-)
● Learn, understand and start to apply how statistics can
be used for estimation, predictions.
28. Programming & Information System
● If you haven't know programming yet, start to pick up one
– I suggest Python as it has strong background in scientific computing
communities, and was designed by a mathematician – Guido Van Rossum
– Though I'm a biased parseltongue :P
– Books:
●
Packt's Practical Data Analysis
●
How to Think Like A Computer Scientist
● SQL is important
– Pretty much the most mature method for declaring data queries
● Pick up Big Data technologies to help you handle massive datasets