Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Course 8 : How to start your big data project by Eric Rodriguez

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 51 Anuncio

Course 8 : How to start your big data project by Eric Rodriguez

Descargar para leer sin conexión

For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
---------
"Data is the new oil" - Many companies and professionals do not know how to use their data or are not aware of the added value they could gain from it.

It is in response to these problems that the project “Brussels: The Beating Heart of Big Data” was born.

This project, financed by the Region of Brussels Capital and organised by Betacowork, offers 3 training cycles of 10 courses on big data, at both beginner and advanced levels. These 3 cycles will be followed by a Hackathon weekend.

No prerequisites are required to start these courses. The aim of these courses is to familiarize participants with the principles of Big Data.
------
For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/

For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
---------
"Data is the new oil" - Many companies and professionals do not know how to use their data or are not aware of the added value they could gain from it.

It is in response to these problems that the project “Brussels: The Beating Heart of Big Data” was born.

This project, financed by the Region of Brussels Capital and organised by Betacowork, offers 3 training cycles of 10 courses on big data, at both beginner and advanced levels. These 3 cycles will be followed by a Hackathon weekend.

No prerequisites are required to start these courses. The aim of these courses is to familiarize participants with the principles of Big Data.
------
For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Course 8 : How to start your big data project by Eric Rodriguez (20)

Anuncio

Course 8 : How to start your big data project by Eric Rodriguez

  1. 1. Why most Big Data projects fail 1 2 3 BIG DATA FAILURE METHODOLOGY & LIFECYCLE DO’s & DONT’s How to approach Big Data projects? Key steps to a successful project. Typical pitfalls and some tips to make it work ;) 4 5 BUILD YOUR BIG DATA STACK STEP-BY-STEP EXAMPLE Elements to build your Big Data environment Use case: your first Big Data project
  2. 2. Course by Eric Rodriguez Stop thinking about experiments and get back to identifying classic business problems and using data to find solutions Companies focus on collecting data but they're not able to answer questions from the beginning. Before thinking about technology, we must be clear on which are our business needs. We must start asking the “why” and then move on to the “how” Also, data is not being valued as a strategic asset to the company.
  3. 3. Course by Eric Rodriguez The Data Lake Fallacy All Water and Little Substance
  4. 4. Course by Eric Rodriguez Start with a small dataset then become familiar with the business need and go into production to get value out. Big Data Workloads tend to be bursty, making it difficult to allocate capacity for ressources Many companies fail to take into account how quickly a big data project can grow and evolve To achieve scalability you need to build your application a certain way thus understand how the technology scales
  5. 5. Course by Eric Rodriguez PROCESSING TIME HARD TO TEST LARGE SYSTEMS TECHNOLOGY CAN/WILL FAIL
  6. 6. Course by Eric Rodriguez Challenging and fast-evolving tools 57% of organizations cite skill gap as major inhibitor to Hadoop adoption Businesses need data experts with domain knowledge and people skills Currently, it is difficult to hire good data analysts, since they are expensive and scarce. Many Big Data vendors seek to overcome this challenge by providing educational resources or by providing more automation of the platform management
  7. 7. Course by Eric Rodriguez
  8. 8. Course by Eric Rodriguez Specific Challenges include : ✓ User authentication for every team and team member accessing the data ✓ Restricting access based on a user’s need ✓ Recording data access histories and meeting other compliance regulations ✓ Proper use of encryption on data in-transit and at rest
  9. 9. Course by Eric Rodriguez
  10. 10. Course by Eric Rodriguez UNDERSTAND INDUSTRY POINT-OF-VIEW ON BIG DATA EVALUATE CURRENT TOOLS AND TECHNOLOGY IDENTIFY BUSINESS CASE PROOF OF CONCEPT (POC) DEVELOP BIG DATA IMPLEMENTATION FRAMEWORK & PROCESS STEPS FINALIZE ARCHITECTURE FOR POC/PILOT PROJECT CAPTURE BUSINESS MEASURES OF SUCCESSFUL POCS ENVISION BIG DATA ROADMAP 1 2 3 4 5 6 7
  11. 11. Course by Eric Rodriguez INGEST the data sources to allow ease of exploration INDEX content to make it accessible and queryable INTEGRATE and LINK data elements INVESTIGATE by exploring through data models Discover INSIGHT Mimimum Viable Insight (MVI) Minimum hurdle that validates a new approach to problem solving by delivering new insight INVEST discovered insights by implementing and deploying into the organization ITERATE 1 2 3 4 5 6 7
  12. 12. Course by Eric Rodriguez
  13. 13. Course by Eric Rodriguez
  14. 14. Course by Eric Rodriguez
  15. 15. Course by Eric Rodriguez
  16. 16. Course by Eric Rodriguez
  17. 17. Course by Eric Rodriguez DETERMINE DELIVERABLES (THE OUTPUTS OF THE PROJECT) EXAMINE THE OVERALL SCOPE OF THE WORK IDENTIFY THE KEY BUSINESS OBJECTIVES IDENTIFY THE KEY BUSINESS OBJECTIVES 1. How much or how many? (regression) 2. Which category? (classification) 3. Which group? (clustering) 4. Is this weird? (anomaly detection) 5. Which option should be taken? (recommendation) TYPICAL QUESTIONS
  18. 18. Course by Eric Rodriguez GATHER AND SCRAPE THE NECESSARY DATA FOR YOUR PROJECT Connect to a database Get data that’s available, or open your private database up, and start digging through it, and understanding what information your company has been collecting. Here are a few ways to get yourself some data: Use APIs think of the APIs to all the tools your company’s been using, and the data these guys have been collecting. You have to work on getting these all set up so you can use those email open/click stats, the information your sales team put in Pipedrive or Salesforce, the support ticket somebody submitted, etc. Look for open data the Internet is full of datasets to enrich what you have with extra information; census data will help you add the average revenue for the district where your user lives, or open street maps can show you how many coffee shops are on his street. Use more APIs another great way to start a personal project is to make it super personal by working on your own data! You can connect to your social media tools, like twitter, or facebook, to analyze your followers and friends.
  19. 19. Course by Eric Rodriguez Fix the inconsistencies and handle the missing values Start digging and try to link everything together to answer your original goal Analyze and ask questions to business people or IT, to understand what all your variables mean
  20. 20. Course by Eric Rodriguez ⚠️ Warning This is probably the longest, most annoying step of your data project. Data scientists report data cleaning is about 80% of the time spent on a project.
  21. 21. Course by Eric Rodriguez Data exploration is typically conducted using a combination of automated and manual activities Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through http://adilyalcin.me/
  22. 22. Course by Eric Rodriguez http://www.jeannjoroge.com/significance-of-exploratory-data-anaysis/
  23. 23. Course by Eric Rodriguez Select important features and construct more meaningful ones using the raw data that you have
  24. 24. Course by Eric Rodriguez By working with clustering algorithms ( unsupervised), you can build models to uncover trends in the data that were not distinguishable in graphs and stats. These create groups of similar events (or clusters) and more or less explicitly express what feature is decisive in these results.
  25. 25. Course by Eric Rodriguez
  26. 26. Course by Eric Rodriguez Remember: Not all data is clean or useable Understand the computational limits Don't obsess over tools. Ignore the trends, Worry about what's cost-effective for you Create an analytics plan and process Start small, low-risk project Allow for a learning curve Don't expect to find a data science unicorn when hiring
  27. 27. Course by Eric Rodriguez Lack of Clarity: In order to gain the maximum benefit out of it, you need to point your Big Data to a specific need or problem of your business. In order to justify your investments for Big Data projects, you would require showcasing your results continuously. A Huge hurdle in terms of ROI: Many entreprises can’t cope up with the heavy amount to be invested in making their existing data setup in synch with new challenges. The way we think of Big Data is wrong: The way Big Data gets treated is like it is a known beginning with a known end rather than an agile journey leading through constant exploration.
  28. 28. Course by Eric Rodriguez BIG DATA STACK
  29. 29. Course by Eric Rodriguez
  30. 30. Course by Eric Rodriguez
  31. 31. Course by Eric Rodriguez DATA ARCHITECTURE OVERVIEW
  32. 32. Course by Eric Rodriguez PARADOX OF CHOICE
  33. 33. Course by Eric Rodriguez
  34. 34. Course by Eric Rodriguez Things You Must Consider Before you Decide to Adopt a NoSQL : • Community Strength and Commercial Support • APIs • Model based on Data • Model based on Queries • Model based on Consistency
  35. 35. Course by Eric Rodriguez
  36. 36. Course by Eric Rodriguez DATA INGESTION - SIMPLE
  37. 37. Course by Eric Rodriguez DATA INGESTION - ADVANCED
  38. 38. Course by Eric Rodriguez • Apache Hadoop (free) Hadoop is a leading tool for big data analysis and is a top big data tool as well. • Microsoft HDInsight (Paid) HDInsight provides low-cost infrastructure for the Hadoop storage. • NoSQL Databases [MongoDB, HBase, and Cassandra] (free) No particular schema is needed when you are working with NoSQL databases and each row will have their own set of column values. Another benefit of using the NoSQL databases are the better performance while storing a massive amount of data. • Apache Hive (Free) Hive is majorly used for data mining purpose and works on the top of Hadoop. • Apache Pig (free) You don’t need to define the schema before storing any file and directly you can start working. Both Hive and Pig almost fulfill the same situation. • Talend Talend offers many products like Big Data Integration, Master Data Management (MDM) which combines real-time data, applications, and process integration with embedded data quality and stewardship. • OpenRefine (free) OpenRefine is a pretty user-friendly tool and if your data is little unstructured also, it can be easily managed. Using this tool, you can explore data, Clean, Transform, Reconcile and Match Data easily. • DataCleaner (Paid) DataCleaner is mainly the pre-stage of the data visualization where only structured and clean data can be used. • Tableau Tableau is a data visualization tool which is used to visualize the structured data. You can connect to Hive directly and start visualizing the data. • Import.io Data extraction tool that enables you to convert any website into structured, machine-readable data with no coding required. • Apache Sqoop (Free) Data Transfer tool allowing to import data from RDBMS to Hadoop and export Hadoop data to RDBMS easily
  39. 39. Course by Eric Rodriguez Here is a list of 24 Data Science Projects (free access) to practice: https://www.analyticsvidhya.com/ blog/2018/05/24-ultimate-data- science-projects-to-boost-your- knowledge-and-skills/
  40. 40. 1. FINDING A TOPIC 2. EXTRACTING DATA FROM THE WEB AND CLEANING IT 3. GAINING DEEPER INSIGHTS 4. ENGINEERING OF FEATURES USING EXTERNAL APIS
  41. 41. Course by Eric Rodriguez Move up the information ladder by asking users for input Combine, correlate and improve quality of data sets Bring new value from raw (open) data sets Bring new value from raw (open) data sets EXAMPLE : what are the main drivers of rental prices in Berlin?
  42. 42. Course by Eric Rodriguez GETTING THE DATA There are tons of amazing data repositories, such as : • Kaggle, UCI ML Repository • dataset search engines, • and websites containing academic papers with datasets… Alternatively, you could use web scraping. CLEANING THE DATA Once you starting getting the data, it is very important to have a look at it as early as possible in order to find any possible issues. EXAMPLE: Possible issues with the data gathered in our example : • Duplicated apartments because they had been online for a while, • Agencies had input errors and they would publish a completely new ad with corrected values and additional description modifications • Some prices were changed after a month for the same apartment) • …
  43. 43. Course by Eric Rodriguez EXAMPLE: Interactive dashboard of Berlin rental prices: one can select all the possible configurations and see the corresponding price distribution.
  44. 44. Course by Eric Rodriguez Visualization helps you to identify important attributes, or “features,” that could be used by these machine learning algorithms. If the features you use are very uninformative, any algorithm will produce bad predictions. With very strong features, even a very simple algorithm can produce pretty decent results. EXAMPLE: In the rental price project, price is a continuous variable, so it is a typical regression problem. Taking all extracted information, we collected the features above in order to be able to predict a rental price. i i
  45. 45. Course by Eric Rodriguez EXAMPLE : PROBLEM One feature that was problematic was the address. There were 6.6K apartments and around 4.4K unique addresses of different granularity. There were around 200 unique postcodes which could be converted into the dummy variables but then very precious information of a particular location would be lost. i i EXAMPLE : SOLUTION By using an external API following the four additional features given, the apartment’s address could be calculated: • duration of a train trip to the S-Bahn Friedrichstrasse (central station) • distance to U-Bahn Stadtmitte (city center) by car • duration of a walking trip to the nearest metro station • number of metro stations within one kilometer from the apartment These four features boosted the performance significantly.

×