Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Spark tutorial py con 2016 part 2

Discover insight about car manufacturers from Twitter Data using a Python Notebook connected to Apache Spark

  • Inicia sesión para ver los comentarios

Spark tutorial py con 2016 part 2

  1. 1. David Taieb STSM - IBM Cloud Data Services Developer advocate david_taieb@us.ibm.com HANDS-ON SESSION: DEVELOPING ANALYTIC APPLICATIONS USING APACHE SPARK™ AND PYTHON Part 2: Analyzing car twiQer data with Spark and DashDb PyCon 2016, Portland
  2. 2. ©2016 IBM Corpora6on Agenda •  Provision the applica6on services on Bluemix: Spark, DashDb, IBM Insight for TwiJer •  Load car related tweets from IBM Insight for TwiJer into DashDb warehouse •  Run Analy6cs in Python Notebook and discover new insights
  3. 3. ©2016 IBM Corpora6on Sign up for Bluemix •  Access IBM Bluemix website on hJps://console.ng.bluemix.net •  Click on Get Started for Free •  Complete the form and click Create account •  Look for confirma6on email and click on confirm you account link Create new Space
  4. 4. ©2016 IBM Corpora6on Create a new space on Bluemix In prepara6on for running the project, we create a new space on Bluemix Create a Spark Instance Op6onal: You can skip this step if you already have a space with Spark instance that you would like to reuse
  5. 5. ©2016 IBM Corpora6on Create a Spark Instance Op6onal: You can skip this step if you already have a space with Spark instance that you would like to reuse
  6. 6. ©2016 IBM Corpora6on Create New Spark Instance Op6onal: You can skip this step if you already have a space with Spark instance that you would like to reuse
  7. 7. ©2016 IBM Corpora6on Acquiring the data •  In the next sec6on, we show how to acquire the twiJer data and store it into DashDb. •  We use the TwiJer loading connector available as a menu in DashDb console Create a DashDb instance
  8. 8. ©2016 IBM Corpora6on Create an instance of IBM Dash DB on Bluemix Create an IBM Insight for TwiJer instance
  9. 9. ©2016 IBM Corpora6on Create an instance of IBM Insight for Twitter on Bluemix
  10. 10. ©2016 IBM Corpora6on Agenda •  Provision the applica6on services on Bluemix: Spark, DashDb, IBM Insight for TwiJer •  Load car related tweets from IBM Insight for TwiJer into DashDb warehouse •  Run Analy6cs in Python Notebook and discover new insights
  11. 11. ©2016 IBM Corpora6on Launch DashDb Console Click on the DashDb Service 6le to open this dashboard, then click on Launch buJon Load TwiJer Data
  12. 12. ©2016 IBM Corpora6on Load Twitter Data DashDb Console offered mul6ple data connectors including a TwiJer connector that automa6cally connects to IBM Insight for TwiJer Connect to TwiJer
  13. 13. ©2016 IBM Corpora6on Connect to Twitter Reusing the TwiJer service instance created in previous step
  14. 14. ©2016 IBM Corpora6on Select the data to be loaded TwiJer Query being used: posted:2015-01-01,2015-12-31 followers_count:2000 listed_count:1000 (volkswagen OR vw OR toyota OR daimler OR mercedes OR bmw OR gm OR "general motors" OR tesla) Specify twiJer query Provide preview count of output data
  15. 15. ©2016 IBM Corpora6on Select the DashDb Table Name of the schema under which the tables will be created Prefix (Namespace) for the created tables List of tables that will be created
  16. 16. ©2016 IBM Corpora6on Loading data monitoring page Warning: loading 6me may vary based on bandwidth. It may take between 15 mns and 1hour
  17. 17. ©2016 IBM Corpora6on Complete the load: Statistics
  18. 18. ©2016 IBM Corpora6on Complete the load: explore the data
  19. 19. ©2016 IBM Corpora6on Get connection information Copy the User id, password and jdbc url, you’ll need this informa6on later
  20. 20. ©2016 IBM Corpora6on Agenda •  Provision the applica6on services on Bluemix: Spark, DashDb, IBM Insight for TwiJer •  Load car related tweets from IBM Insight for TwiJer into DashDb warehouse •  Run Analy6cs in Python Notebook and discover new insights
  21. 21. ©2016 IBM Corpora6on Create new Notebook from URL Import required Python packages • Create notebook from URL • Use hJps://github.com/ibm-cds-labs/spark.samples/raw/master/notebook/ DashDB%20TwiJer%20Car%202015%20Python%20Notebook.ipynb
  22. 22. ©2016 IBM Corpora6on Step 1: Import Python Packages • Install nltk package (Natural language toolkit) • We will use it to filter stop words later in the tutorial
  23. 23. ©2016 IBM Corpora6on Import Python modules and setup the SQLContext
  24. 24. ©2016 IBM Corpora6on Step 2: Define global Variables Set up various data structures we’ll need throughout the Notebook This is the SCHEMA and PREFIX you used in Step 3 of the TwiJer connector wizard
  25. 25. ©2016 IBM Corpora6on Set up some global helper functions JavaScript Google map visualiza6on Misc helper that fill in missing dates
  26. 26. ©2016 IBM Corpora6on Step 3: Acquire the data from DashDB User ID and password from Connec6on page User ID and password from Connec6on page
  27. 27. ©2016 IBM Corpora6on Join the Tweets and Sentiment Table In this step, we want to add a sen6ment score for each tweet record: •  Join the Tweets and Sen6ments table •  Encode the sen6ment into a number e.g. POSITIVE=+1, NEGATIVE=-1, AMBIVALENT=0 •  Create an average for each sen6ment associated with a tweet •  %6me instruments the code to provide profile execu6on stats.
  28. 28. ©2016 IBM Corpora6on Step 4: Transform the data Create a clean Working dataframe that will be easier to use in our analy6cs
  29. 29. ©2016 IBM Corpora6on Step 5: Geographic distribution of tweets GroupBy countries and aggregate the tweets count Convert Spark SQL dataframe to Pandas data structure for visualiza6on
  30. 30. ©2016 IBM Corpora6on Bar chart visualization of Tweet distribution by Geo
  31. 31. ©2016 IBM Corpora6on Google map visualization of tweet distribution by Geos Call GeoChart Helper that set up the JavaScript code
  32. 32. ©2016 IBM Corpora6on Clean up memory before next analytics Resources including memory on the Spark Driver machine are not infinite. It is good prac6ce to clean up when data is not needed anymore
  33. 33. ©2016 IBM Corpora6on Step 6: Analyzing tweets sentiment GroupBy Sen6ments and aggregate the tweets count Convert Spark SQL dataframe to Pandas data structure for visualiza6on
  34. 34. ©2016 IBM Corpora6on Sentiment visualization Use Matplot pie chart
  35. 35. ©2016 IBM Corpora6on Step 7: Analyze Tweet timeline Convert Spark SQL dataframe to Pandas data structure for visualiza6on GroupBy Pos6ng 6me and sen6ment tuples Aggregate the tweet counts GroupBy Pos6ng 6me and sen6ment tuples Aggregate the sum of the tweet counts
  36. 36. ©2016 IBM Corpora6on Prepare the timeline data structures
  37. 37. ©2016 IBM Corpora6on Time series visualization for all tweets
  38. 38. ©2016 IBM Corpora6on Deep dive into car manufacturers Create new DataFrame that enrich tweets with extra metadata: -Boolean for each car manufacturer -Boolean for electric car -Boolean for self driving car
  39. 39. ©2016 IBM Corpora6on Re-analyze tweeter timeline for each car manufacturer Create new DataFrame for each car manufacturer Aggregate the tweet counts, order by pos6ng 6me
  40. 40. ©2016 IBM Corpora6on Timeline series visualization No6ce the peak of tweets for VW between September and October 2015
  41. 41. ©2016 IBM Corpora6on Explain why the peak of tweets for VW between September and October 2015 Filter for all VW tweets between Sept and Oct 2015 Pie chart visualiza6on of the top 10 words being used in these tweets Create map count of all non-stop words used in the tweets Use NLTK stopwords module to filter out stop words
  42. 42. ©2016 IBM Corpora6on Peak explained We can clearly see from the list of most used words that the peak correspond to the VW scandal around fraudulent emissions tes6ng
  43. 43. ©2016 IBM Corpora6on Follow the notebook for many more interesting analytics
  44. 44. ©2016 IBM Corpora6on Resource •  hJps://developer.ibm.com/clouddataservices/ •  hJps://github.com/ibm-cds-labs/simple-data-pipe •  hJps://github.com/ibm-cds-labs/pipes-connector-flightstats •  hJp://spark.apache.org/docs/latest/mllib-guide.html •  hJps://console.ng.bluemix.net/data/analy6cs/
  45. 45. ©2016 IBM Corpora6on Thank You

×