SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
Streaming Data,
Concurrency And R

     Rory Winston

   rory@theresearchkitchen.com
About Me




      Independent Software Consultant
      M.Sc. Applied Computing, 2000
      M.Sc. Finance, 2008
      Apache Committer
      Working in the financial sector for the last 7 years or so
      Interested in practical applications of functional languages and
      machine learning
      Relatively recent convert to R ( ≈ 2 years)
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
Parallelization vs. Concurrency



        R interpreter is single threaded
        Some historical context for this (BLAS implementations)
        Not necessarily a limitation in the general context
        Multithreading can be complex and problematic
        Instead a focus on parallelization:
             Distributed computation: gridR, nws, snow
             Multicore/multi-cpu scaling: Rmpi, Romp, pnmath/pnmath0
             Interfaces to Pthreads/PBLAS/OpenMP/MPI/Globus/etc.
        Parallelization suits cpu-bound large data processing
        applications
Other Scalability and Performance Work




        JIT/bytecode compilation (Ra)
        Implicit vectorization a la Matlab (code analysis)
        Large (≥ RAM) dataset handling (bigmemory,ff)
        Many incremental performance improvements (e.g. less
        internal copying)
        Next: GPU/massive multicore...?
What Benefit Concurrency?




       Real-time (streaming to be more precise) data analysis
       Growing Interest in using R for streaming data, not just offline
       analyis
       GUI toolkit integration
       Fine-grained control over independent task execution
       "I believe that explicit concurrency management tools (i.e. a
       threads toolkit) are what we really need in R at this point." -
       Luke Tierney, 2001
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Example Application




        Based on work I did last year and presented at UseR! 2008
        Wrote a real-time and historical market data service from
        Reuters/R
        The real-time interface used the Reuters C++ API
        R extension in C++ that spawned listening thread and
        handled updates
Simplified Architecture




                                R


                         extension (C++)



                           realtime bus
Example Usage



          rsub <- function(duration, items, callback)


   The call rsub will subscribe to the specified rate(s) for the duration
   of time specified by duration (ms). When a tick arrives, the
   callback function callback is invoked, with a data frame
   containing the fields specified in items.

   Multiple market data items may be subscribed to, and any
   combination of fields may be be specified.

   Uses the underlying RFA API, which provides a C++ interface to
   real-time market updates.
Real-Time Example


   # Specify field names to retrieve
   fields <- c("BID","ASK","TIMCOR")

   # Subscribe to EUR/USD and GBP/USD ticks
   items <- list()
   items[[1]] <- c("IDN_SELECTFEED", "EUR=", fields)
   items[[2]] <- c("IDN_SELECTFEED", "GBP=", fields)

   # Simple Callback Function
   callback <- function(df) { print(paste("Received",df)) }

   # Subscribe for 1 hour
   ONE_HOUR <- 1000*(60)^2
   rsub(ONE_HOUR, items, callback)
Issues With This Approach




        As R interpreter is single threaded, cannot spawn thread for
        callbacks
        Thus, interpreter thread is locked for the duration of
        subscription
        Not a great user experience
        Need to find alternative mechanism
Alternative Approach



        If we cannot run subscriber threads in-process, need to
        decouple
        Standard approach: add an extra layer and use some form of
        IPC
        For instance, we could:
            Subscribe in a dedicated R process (A)
            Push incoming data onto a socket
            R process (B) reads from a listening socket
        Sockets could also be another IPC primitive, e.g. pipes
        Also note that R supports asynchronous I/O (?isIncomplete)
        Look at the ibrokers package for examples of this
The bigmemoRy package



       From the description: "Use C++ to create, store,
       access, and manipulate massive matrices"
       Allows creation of large matrices
       These matrices can be mapped to files/shared memory
       It is the shared memory functionality that we will use
       The next version (3.0) will be unveiled at UseR! 2009

   big.matrix(nrow, ncol, type = "integer", ....)
   shared.big.matrix(nrow, ncol, type = "integer", ...)
   filebacked.big.matrix(nrow, ncol, type = "integer", ...)
Sample Usage




   > library(bigmemory) # Note: I'm using pre-release
   > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000)
   > X
   An object of class “big.matrix”
   Slot "address":
   <pointer: 0x7378a0>
Create Shared Memory Descriptor

   > desc <- describe(X)
   > desc
   $sharedType
   [1] "SharedMemory"

   $sharedName
   [1] "53f14925-dca1-42a8-a547-e1bccae999ce"

   $nrow
   [1] 1000

   $ncol
   [1] 1000

   $rowNames
   NULL
Export the Descriptor




    In R session 1:

    > dput(desc, file="~/matrix.desc")

    In R session 2:

    > library(bigmemory)
    > desc <- dget("~/matrix.desc")
    > X <- attach.big.matrix(desc)

    Now R sessions A and B share the same big.matrix instance
Share Data Between Sessions




   R session 1:

   > X[1,1] <- 1.2345

   R session 2:

   > X[1,1]
   [1] 1.2345

   Thus, streaming data can be continuously fed into session A
   And concurrently processed in session B
Summary




      Lack of threads not a barrier to concurrent analysis
      Packages like bigmemory, nws, etc. facilitate decoupling via
      IPC
      nws goes a step further, with a distributed workspace
      Many applications for streaming data:
          Data collection/monitoring
          Development of pricing/risk algorithms
          Low-frequency execution (??)
          ...
References




        http://cran.r-project.org/web/packages/bigmemory/
        http://www.cs.uiowa.edu/ luke/R/thrgui/
        http://www.milbo.users.sonic.net/ra/index.html
        http://www.cs.kent.ac.uk/projects/cxxr/
        http://www.theresearchkitchen.com/blog

Más contenido relacionado

Destacado

Kat01 2012
Kat01 2012Kat01 2012
Kat01 2012hekama
 
conroling slides by sohar bakhsh
conroling slides by sohar bakhshconroling slides by sohar bakhsh
conroling slides by sohar bakhshSohar Bakhsh
 
Equine Emergencies Part 4
Equine Emergencies Part 4Equine Emergencies Part 4
Equine Emergencies Part 4Ernie Martinez
 
7. susret 17.11.2011. konkretno lice boga oca
7. susret 17.11.2011.   konkretno lice boga oca7. susret 17.11.2011.   konkretno lice boga oca
7. susret 17.11.2011. konkretno lice boga ocaMeri-Lucijeta
 
Figurative Painter - Vicente Romero Redondo
Figurative Painter - Vicente Romero RedondoFigurative Painter - Vicente Romero Redondo
Figurative Painter - Vicente Romero RedondoMakala (D)
 
Best kitchen knives
Best kitchen knivesBest kitchen knives
Best kitchen knivesbestkit3
 
5 Worst States for Identity Theft
5 Worst States for Identity Theft5 Worst States for Identity Theft
5 Worst States for Identity TheftIDT911
 
Food Combining For Beginners.
Food Combining For Beginners.Food Combining For Beginners.
Food Combining For Beginners.mikefouse
 
Market advertizing
Market advertizingMarket advertizing
Market advertizingSohar Bakhsh
 
Unit 1-vocab jarod f
Unit 1-vocab jarod fUnit 1-vocab jarod f
Unit 1-vocab jarod fjarodf2238
 
Safety Meeting Starter (SMS) Aug 2012
Safety Meeting Starter (SMS) Aug 2012Safety Meeting Starter (SMS) Aug 2012
Safety Meeting Starter (SMS) Aug 2012safestrat
 
Unit fourteen will future
Unit fourteen will futureUnit fourteen will future
Unit fourteen will futurewedaa23
 

Destacado (17)

Pti finish
Pti finishPti finish
Pti finish
 
Yoleo
YoleoYoleo
Yoleo
 
Kat01 2012
Kat01 2012Kat01 2012
Kat01 2012
 
Inventario
InventarioInventario
Inventario
 
conroling slides by sohar bakhsh
conroling slides by sohar bakhshconroling slides by sohar bakhsh
conroling slides by sohar bakhsh
 
Equine Emergencies Part 4
Equine Emergencies Part 4Equine Emergencies Part 4
Equine Emergencies Part 4
 
7. susret 17.11.2011. konkretno lice boga oca
7. susret 17.11.2011.   konkretno lice boga oca7. susret 17.11.2011.   konkretno lice boga oca
7. susret 17.11.2011. konkretno lice boga oca
 
Figurative Painter - Vicente Romero Redondo
Figurative Painter - Vicente Romero RedondoFigurative Painter - Vicente Romero Redondo
Figurative Painter - Vicente Romero Redondo
 
slideshow_funerals
slideshow_funeralsslideshow_funerals
slideshow_funerals
 
Best kitchen knives
Best kitchen knivesBest kitchen knives
Best kitchen knives
 
5 Worst States for Identity Theft
5 Worst States for Identity Theft5 Worst States for Identity Theft
5 Worst States for Identity Theft
 
Food Combining For Beginners.
Food Combining For Beginners.Food Combining For Beginners.
Food Combining For Beginners.
 
Market advertizing
Market advertizingMarket advertizing
Market advertizing
 
Unit 1-vocab jarod f
Unit 1-vocab jarod fUnit 1-vocab jarod f
Unit 1-vocab jarod f
 
Safety Meeting Starter (SMS) Aug 2012
Safety Meeting Starter (SMS) Aug 2012Safety Meeting Starter (SMS) Aug 2012
Safety Meeting Starter (SMS) Aug 2012
 
Why Is Tympanometry Performed?
Why Is Tympanometry Performed?Why Is Tympanometry Performed?
Why Is Tympanometry Performed?
 
Unit fourteen will future
Unit fourteen will futureUnit fourteen will future
Unit fourteen will future
 

Similar a Streaming Data and Concurrency in R

Building Europeana - The Rivers
Building Europeana - The RiversBuilding Europeana - The Rivers
Building Europeana - The RiversEuropeana
 
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...Flexsin
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012Tom-Cramer
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Workflows in the Virtual Observatory
Workflows in the Virtual ObservatoryWorkflows in the Virtual Observatory
Workflows in the Virtual ObservatoryJose Enrique Ruiz
 
Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?hemayadav41
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseDATAVERSITY
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Gautier Poupeau
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFiHortonworks
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel ArchitecturesJoel Falcou
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
End-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooEnd-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooJason Dai
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
 
Open Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data CenterOpen Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data CenterCole Crawford
 
Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)lennartkats
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution Analytics
 

Similar a Streaming Data and Concurrency in R (19)

Building Europeana - The Rivers
Building Europeana - The RiversBuilding Europeana - The Rivers
Building Europeana - The Rivers
 
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Workflows in the Virtual Observatory
Workflows in the Virtual ObservatoryWorkflows in the Virtual Observatory
Workflows in the Virtual Observatory
 
Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
End-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooEnd-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics Zoo
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
Open Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data CenterOpen Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data Center
 
Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Revolution Analytics Podcast
Revolution Analytics PodcastRevolution Analytics Podcast
Revolution Analytics Podcast
 

Último

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Streaming Data and Concurrency in R

  • 1. Streaming Data, Concurrency And R Rory Winston rory@theresearchkitchen.com
  • 2. About Me Independent Software Consultant M.Sc. Applied Computing, 2000 M.Sc. Finance, 2008 Apache Committer Working in the financial sector for the last 7 years or so Interested in practical applications of functional languages and machine learning Relatively recent convert to R ( ≈ 2 years)
  • 3. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 4. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 5. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 6. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 7. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 8. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 9. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 10. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 11. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 12. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 13. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 14. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 15. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 16. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 17. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 18. Parallelization vs. Concurrency R interpreter is single threaded Some historical context for this (BLAS implementations) Not necessarily a limitation in the general context Multithreading can be complex and problematic Instead a focus on parallelization: Distributed computation: gridR, nws, snow Multicore/multi-cpu scaling: Rmpi, Romp, pnmath/pnmath0 Interfaces to Pthreads/PBLAS/OpenMP/MPI/Globus/etc. Parallelization suits cpu-bound large data processing applications
  • 19. Other Scalability and Performance Work JIT/bytecode compilation (Ra) Implicit vectorization a la Matlab (code analysis) Large (≥ RAM) dataset handling (bigmemory,ff) Many incremental performance improvements (e.g. less internal copying) Next: GPU/massive multicore...?
  • 20. What Benefit Concurrency? Real-time (streaming to be more precise) data analysis Growing Interest in using R for streaming data, not just offline analyis GUI toolkit integration Fine-grained control over independent task execution "I believe that explicit concurrency management tools (i.e. a threads toolkit) are what we really need in R at this point." - Luke Tierney, 2001
  • 21. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 22. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 23. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 24. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 25. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 26. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 27. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 28. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 29. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 30. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 31. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 32. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 33. Example Application Based on work I did last year and presented at UseR! 2008 Wrote a real-time and historical market data service from Reuters/R The real-time interface used the Reuters C++ API R extension in C++ that spawned listening thread and handled updates
  • 34. Simplified Architecture R extension (C++) realtime bus
  • 35. Example Usage rsub <- function(duration, items, callback) The call rsub will subscribe to the specified rate(s) for the duration of time specified by duration (ms). When a tick arrives, the callback function callback is invoked, with a data frame containing the fields specified in items. Multiple market data items may be subscribed to, and any combination of fields may be be specified. Uses the underlying RFA API, which provides a C++ interface to real-time market updates.
  • 36. Real-Time Example # Specify field names to retrieve fields <- c("BID","ASK","TIMCOR") # Subscribe to EUR/USD and GBP/USD ticks items <- list() items[[1]] <- c("IDN_SELECTFEED", "EUR=", fields) items[[2]] <- c("IDN_SELECTFEED", "GBP=", fields) # Simple Callback Function callback <- function(df) { print(paste("Received",df)) } # Subscribe for 1 hour ONE_HOUR <- 1000*(60)^2 rsub(ONE_HOUR, items, callback)
  • 37. Issues With This Approach As R interpreter is single threaded, cannot spawn thread for callbacks Thus, interpreter thread is locked for the duration of subscription Not a great user experience Need to find alternative mechanism
  • 38. Alternative Approach If we cannot run subscriber threads in-process, need to decouple Standard approach: add an extra layer and use some form of IPC For instance, we could: Subscribe in a dedicated R process (A) Push incoming data onto a socket R process (B) reads from a listening socket Sockets could also be another IPC primitive, e.g. pipes Also note that R supports asynchronous I/O (?isIncomplete) Look at the ibrokers package for examples of this
  • 39. The bigmemoRy package From the description: "Use C++ to create, store, access, and manipulate massive matrices" Allows creation of large matrices These matrices can be mapped to files/shared memory It is the shared memory functionality that we will use The next version (3.0) will be unveiled at UseR! 2009 big.matrix(nrow, ncol, type = "integer", ....) shared.big.matrix(nrow, ncol, type = "integer", ...) filebacked.big.matrix(nrow, ncol, type = "integer", ...)
  • 40. Sample Usage > library(bigmemory) # Note: I'm using pre-release > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000) > X An object of class “big.matrix” Slot "address": <pointer: 0x7378a0>
  • 41. Create Shared Memory Descriptor > desc <- describe(X) > desc $sharedType [1] "SharedMemory" $sharedName [1] "53f14925-dca1-42a8-a547-e1bccae999ce" $nrow [1] 1000 $ncol [1] 1000 $rowNames NULL
  • 42. Export the Descriptor In R session 1: > dput(desc, file="~/matrix.desc") In R session 2: > library(bigmemory) > desc <- dget("~/matrix.desc") > X <- attach.big.matrix(desc) Now R sessions A and B share the same big.matrix instance
  • 43. Share Data Between Sessions R session 1: > X[1,1] <- 1.2345 R session 2: > X[1,1] [1] 1.2345 Thus, streaming data can be continuously fed into session A And concurrently processed in session B
  • 44. Summary Lack of threads not a barrier to concurrent analysis Packages like bigmemory, nws, etc. facilitate decoupling via IPC nws goes a step further, with a distributed workspace Many applications for streaming data: Data collection/monitoring Development of pricing/risk algorithms Low-frequency execution (??) ...
  • 45. References http://cran.r-project.org/web/packages/bigmemory/ http://www.cs.uiowa.edu/ luke/R/thrgui/ http://www.milbo.users.sonic.net/ra/index.html http://www.cs.kent.ac.uk/projects/cxxr/ http://www.theresearchkitchen.com/blog