SlideShare una empresa de Scribd logo
1 de 33
Astronomical Image
Processing with Hadoop
Keith Wiley*
Survey Science Group
Dept. of Astronomy, Univ. of Washington
                                 *Keith Wiley, Andrew Connolly,
                                 YongChul Kwon, Magdalena
                                 Balazinska, Bill Howe, Jeffrey
                                 Garder, Simon Krughoff, Yingyi Bu,
                                 Sarah Loebman and Matthew Kraus




                       1
Session Agenda
   Astronomical Survey Science
   Image Coaddition
   Implementing Coaddition within MapReduce
   Optimizing the Coaddition Process
   Conclusions
   Future Work




                              2
Astronomical Topics of Study
   Dark energy
   Large scale structure of universe
   Gravitational lensing
   Asteroid detection/tracking




                                 3
What is Astronomical Survey Science?
     Dedicated sky surveys,
      usually from a single calibrated telescope/camera pair.

     Run for years at a time.

     Gather millions of images and TBs of storage*.

     Require high-throughput data reduction pipelines.

     Require sophisticated off-line data analysis tools.



*   Next generation surveys will gather PBs of 4image data.
Sky Surveys: Today and Tomorrow
   SDSS* (1999-2005)                LSST† (2015-2025)
   Founded in part by UW            8.4m mirror, 3.2 gigapixel camera
   1/4 of the sky                   Half sky every three nights
   80TBs total data                 30TB per night...
                                      ...one SDSS every three nights
                                     60PBs total (nonstop ten years)
                                     1000s of exposures of each location




    * Sloan Digital Sky Survey                 That’s a person!
                                           5
    † Large Synoptic Survey Telescope
FITS (Flexible Image Transport System)
     Common astronomical image representation file format
     Metadata tags (like EXIF):
     › Most importantly: Precise astrometry*
     › Other:
       • Geolocation (telescope location)
       • Sky conditions, image quality, etc.

     Bottom line:
     › An image format that
       knows where it is looking.



*   Position on sky                 6
Image Coaddition
 Give multiple partially overlapping images
  and a query (color and sky bounds):
 › Find images’ intersections with the query bounds.
 › Project bitmaps to the bounds.
 › Stack and mosaic into a final product.




                               7
Image Coaddition              Expensive
 Give multiple partially overlapping images
  and a query (color and sky bounds):
 › Find images’ intersections with the query bounds.
 › Project bitmaps to the bounds.
 › Stack and mosaic into a final product.


                          Cheap




                               8
Image Stacking (Signal Averaging)
                  Stacking improves SNR:
                  › Makes fainter objects visible.


                  Example (SDSS, Stripe 82):
                  › Top: Single image, R-band
                  › Bottom: 79-deep stack
                    • (~9x SNR improvement)
                    • Numerous additional detections

                  Variable conditions (e.g.,
                   atmosphere, PSF, haze) mean
                   stacking algorithm complexity can
                   exceed a mere sum.

                      9
Coaddition in Hadoop
Input FITS image        Mapper

                        Detect                       Projected
                   intersection with                intersection
                    query bounds.

                   Project bitmap to
                   query’s coord sys.
                                                Reducer
                                                                   Final coadd
                                             Stack and mosaic
  HDFS                  Mapper                   projected
                                               intersections.



                        Mapper




                        Mapper



                     Parallel                  Serial
                                        10
Driver Prefiltering
 To assist the process we prefilter
  the FITS files in the driver.

 SDSS camera has 30 CCDs:
  › 5 colors
  › 6 abutting strips of sky
                                        Query bounds
                                        Relevant FITS
 Prefilter (path glob) by color        Prefilter-excluded FITS
                                        False positives           1 FITS
  and sky coverage (single axis):
 › Exclude many irrelevant FITS files.
 › Sky coverage filter is only single axis:
   • Thus, false positives slip through...
     ...to be discarded in the mappers.
                                11
Performance Analysis
          45
                                                                                           Running time:
          40
                                                                                           › 2 query sizes
          35
                                                                                           › Run against
          30
                                                                                             1/10th of SDSS
          25                                                                                 (100,058 FITS)
Minutes




          20


          15                                                                               Conclusion:
          10
                                                                                           › Considering the

          5
                                                                                             small dataset, this
                                                                                             is too slow!
          0
                       1° sq (3885 FITS)                               ¼° sq (465 FITS)
                                           Query Sky Bounds Extent
               Error bars show 95% confidence intervals — Outliers removed via Chauvenet
                                                                                           ›   Remember 42
                                                                                               minutes for the
                                                                                               next slide.
                                                                             12
Performance Analysis
                Driver                 MapReduce                   runJob() main()      Breakdown of large
          40
                                                                                         query running time

          30
                                                                                        main() is sum of:
                                                                                        › Driver
Minutes




          20                                                                            › runJob()


          10


                                                                                        runJob() is sum of
          0
                   Degl       Cons         Mapp        R          runJo       main
                                                                                         MapReduce parts.
                       ob In
                            put P truct File     er Do educer D         b()       ()
                                 aths        Splits   ne        one
                                          Hadoop Stage




                                                                        13
Performance Analysis
                                                         Total job time                  Breakdown of large
                      RPCs from
          40
                    client to HDFS                                                        query running time

          30
                                                                                         Observation:
                                                                                         › Running time
Minutes




          20                                                                               dominated by
                                                                                           RPCs from client
                                                                                           to HDFS to
                                                                                           process 1000s of
          10


                                                                                           FITS file paths.
          0
                  Degl       Cons         Mapp        R          runJo         main
                      ob In
                           put P truct File     er Do educer D         b()         ()
                                aths        Splits   ne        one
                                         Hadoop Stage                                    Conclusion:
                                                                                         › Need to reduce
                                                                                           number of files.
                                                                          14
Sequence Files
 Sequence files group many small files into a few large files.
 Just what we need!
 Real-time images may not be amenable to logical grouping.
 › Therefore, sequence files filled in an arbitrary manner:


      FITS                              FITS
               filename              assignment
                                       to Seq
                                                  Sequence
      FITS
                           Hash
                          Function
      FITS
                                                  Sequence

      FITS


                                15
Performance Analysis
          45
                                                                              FITS
                                                                              Seq hashed
                                                                                            Comparison:
          40
                                                                                            › FITS input vs.
          35
                                                                                              unstructured
          30                                                                                  sequence file
          25
                                                                                              input*
Minutes




          20


          15
                                                                                            Conclusion:
                                                                                            › 5x speedup!
          10


          5


          0
                                                                                            Hmmm...
                       1° sq (3885 FITS)
                                           Query Sky Bounds Extent
                                                                       ¼° sq (465 FITS)
                                                                                             Can we do better?
               Error bars show 95% confidence intervals — Outliers removed via Chauvenet




           *   360 seq files in hashed seq DB.                               16
Performance Analysis
          45

                              Processed a subset
                                                                              FITS
                                                                              Seq hashed
                                                                                            Comparison:
          40
                                of the database                                             › FITS input vs.
                                after prefiltering
          35
                                                                                              unstructured
          30                                                                                  sequence file
          25
                                                                                              input*
Minutes




          20

                                                                                            Conclusion:
          15
                                     Processed
                                      the entire                                            › 5x speedup!
          10
                                    database, but
          5                         still ran faster
          0
                                                                                            Hmmm...
                       1° sq (3885 FITS)
                                           Query Sky Bounds Extent
                                                                       ¼° sq (465 FITS)
                                                                                             Can we do better?
               Error bars show 95% confidence intervals — Outliers removed via Chauvenet




           *   360 seq files in hashed seq DB.                               17
Structured Sequence Files
 Similar to the way we prefiltered
  FITS files...

 SDSS camera has 30 CCDs:
  › 5 colors
  › 6 abutting strips of sky
 › Thus, 30 sequence file types       Query bounds
                                      Relevant FITS
                                      Prefilter-excluded FITS
                                      False positives           1 FITS
 Prefilter by color and
  sky coverage (single axis):
 › Exclude irrelevant sequence files.
 › Still have false positives.
 › Catch them in the mappers as before.
                                18
Performance Analysis
          45
                                                                              FITS
                                                                              Seq hashed
                                                                                                         Comparison:
          40                                                                  Seq grouped, prefiltered
                                                                                                         › FITS vs.
          35
                                                                                                           unstructured
          30                                                                                               sequence* vs.
          25
                                                                                                           structured
Minutes




                                                                                                           sequence files†
          20


          15
                                                                                                         Conclusion:
          10
                                                                                                         › Another 2x
          5
                                                                                                           speedup for the
          0
                       1° sq (3885 FITS)                               ¼° sq (465 FITS)
                                                                                                           large query, 1.5x
                                           Query Sky Bounds Extent
               Error bars show 95% confidence intervals — Outliers removed via Chauvenet
                                                                                                           speedup for the
                                                                                                           small query.

           * 360 seq files in hashed seq DB.
           † 1080 seq files in structured DB.
                                                                             19
Performance Analysis
          10
               Seq hashed
               Seq grouped, prefiltered
                                                                                                    Breakdown of large
          9
                                                                                                     query running time
          8

          7
                                                                                                    Prediction:
          6
                                                                                                    › Prefiltering should
Minutes




          5
                                                                                                      gain performance
                                                                                                      in the mapper.
          4

          3

          2
                                                                                                    Does it?
          1

          0
                           Degl             Cons        Mapp        R         runJo       main
                                ob In
                                         put P truct File     er Do educer D        b()       ()
                                              aths        Splits   ne       one
                                                     Hadoop Stage




                                                                                  20
Performance Analysis
          10
               Seq hashed
               Seq grouped, prefiltered
                                                                                                    Breakdown of large
          9
                                                                                                     query running time
          8

          7
                                                                                                    Prediction:
          6
                                                                                                    › Prefiltering should
Minutes




          5
                                                                                                      gain performance
                                                                                                      in the mapper.
          4

          3

          2
                                                                                                    Conclusion:
          1
                                                                                                    › Yep, just as
          0
                           Degl
                                ob In       Cons        Mapp
                                         put P truct File
                                              aths        Splits
                                                                    R
                                                              er Do educer D
                                                                   ne
                                                                              runJo
                                                                            one
                                                                                    b()
                                                                                          main
                                                                                              ()      expected.
                                                     Hadoop Stage




                                                                                  21
Performance Analysis
          45
                                                                              FITS
                                                                              Seq hashed
                                                                                                         Experiments were
          40                                                                  Seq grouped, prefiltered
                                                                                                          performed on a
          35                                                                                              100,058 FITS
          30
                                                                                                          database
                                                                                                          (1/10th SDSS).
          25
Minutes




          20
                                                                                                         How much of this
          15
                                                                                                          database is
          10
                                                                                                          Hadoop churning
          5                                                                                               through?
          0
                       1° sq (3885 FITS)                               ¼° sq (465 FITS)
                                           Query Sky Bounds Extent
               Error bars show 95% confidence intervals — Outliers removed via Chauvenet




                                                                             22
Performance Analysis
          45
                                                                              FITS
                                                                              Seq hashed
                                                                                                         Comparison:
          40                                                                  Seq grouped, prefiltered
                                                                                                         › Number of FITS
                                     13,415 FITS files
          35
                                     499 mappers‡                                                          considered in
          30                                                                                               mappers vs.
          25                                    100,058 FITS files                                         number
Minutes




                                                360 sequence files*                                        contributing to
          20
                                                2077 mappers‡                                              coadd
          15

                                                           13,335 FITS files
                                                                                                         Conclusion:
          10
                                                           144 sequence files†
          5                                                341 mappers‡                                  › Mappers must
          0
                             3885 FITS Sky Bounds Extent
                       1° sq (3885 FITS)                               ¼° sq (465 FITS)
                                                                                                           discard many
                                     Query
               Error bars show 95% confidence intervals — Outliers removed via Chauvenet
                                                                                                           FITS files due to
                                                                                                           nonoverlap of
           * 360 seq files in hashed seq DB.
                                                                                                           query bounds.
           † 1080 seq files in structured DB.
           ‡ 800 mapper slots on cluster.
                                                                             23
Using SQL to Find Intersections
 Store all image colors and sky bounds in a database:
 › First, query color and intersections via SQL.
 › Second, send only relevant images to MapReduce.
 Consequence:
  All images processed by mappers contribute to coadd.
  No time wasted considering irrelevant images.
                                         SQL
                   (1) Retrieve        Database
                 FITS filenames
                   that overlap
                  query bounds

      Driver

                  (2) Only load relevant          MapReduce
                 images into MapReduce

                                  24
Performance Analysis
          10
                                                                     Seq hashed
                                                                     Seq grouped, prefiltered
                                                                                                Comparison:
          9                                                          SQL ! Seq hashed
                                                                     SQL ! Seq grouped          › nonSQL vs. SQL
          8

          7

          6
                                                                                                Conclusion:
                                                                                                › Sigh, no major
Minutes




          5

          4
                                                                                                  improvement
                                                                                                  (SQL is not
          3
                                                                                                  remarkably
          2
                                                                                                  superior to
          1
                                                                                                  nonSQL for given
          0
                 1° sq (3885 FITS)                             ¼° sq (465 FITS)
                                                                                                  pairs of bars).
                                     Query Sky Bounds Extent




                                                                    25
Performance Analysis
          10
                                                                           Seq hashed
                                                                           Seq grouped, prefiltered
                                                                                                      Comparable
          9                                                                SQL ! Seq hashed
                                                                           SQL ! Seq grouped           performance here
          8
                                                                                                       makes sense:
                                                                                                      › In essence,
          7

          6
                                                                                                        prefiltering and
Minutes




          5                                                                                             SQL performed
          4                                                                                             similar tasks,
          3                                                                                             albeit with 3.5x
          2      100058    3885                Num mapper          100058    465
                                                                                                        different mapper
          1
                    13335 3885
                      13335     3885        input records (FITS)        6674     465                    inputs (FITS).
                  2077         1714                                 2149         499
                         341          338   Num launched mappers           144         128
          0
                 1° sq (3885 FITS)                                 ¼° sq (465 FITS)
                                        Query Sky Bounds Extent                                       Conclusion:
                                                                                                      › Cost of discarding
                                                                                                        many images in
                                                                                                        the nonSQL case
                                                                           26                           was negligible.
Performance Analysis
          10
                                                                            Seq hashed
                                                                            Seq grouped, prefiltered
                                                                                                       Low improvement
          9                                                                 SQL ! Seq hashed
                                                                            SQL ! Seq grouped           for SQL in the
          8
                                                                                                        hashed case is
          7                                                                                             surprising at first
          6
                                                                                                       › ...especially
Minutes




          5                                                                                               considering 26x
          4                                                                                               different mapper
          3                                                                                               inputs (FITS).
          2
                100058 3885
                 100058 3885                    Num mapper
                                             input records (FITS)
                                                                    100058    465
                      13335           3885                               6674     465
          1
                  2077         1714                                  2149         499
                         341          338    Num launched mappers           144         128
          0
                 1° sq (3885 FITS)                                  ¼° sq (465 FITS)
                                         Query Sky Bounds Extent




                                                                            27
Performance Analysis
          10
                                                                        Seq hashed
                                                                        Seq grouped, prefiltered
                                                                                                   Low improvement
          9                                                             SQL ! Seq hashed
                                                                        SQL ! Seq grouped           for SQL in the
          8
                                                                                                    hashed case is
          7                                                                                         surprising at first
          6
                                                                                                   › ...especially
Minutes




          5                                                                                           considering 26x
          4                                                                                           different mapper
          3                                                                                           inputs (FITS).
          2
                100058 3885
                 100058 3885                Num mapper
                                         input records (FITS)
                                                                100058    465
                      13335      3885                                6674     465
          1                                                                       Theory:
                   2077    1714 338
                              1714
                                     Num launched mappers
                                                                 2149         499
          0
                          341      338                          144        128
                                                                                   › Scattered
                1° sq (3885 FITS)                         ¼° sq (465 FITS)
                                  Query Sky Bounds Extent
                                                                                     distribution of
                                                                                     relevant FITS
                                                                               ently
          Mapper requirements exceeded Consequ                                       prevented
          cluster capacity(~800). We lost                                            efficient mapper
          parallelism (mappers were queued)!                            28           reuse.
Results
          10
                                                                           Seq hashed
                                                                           Seq grouped, prefiltered
                                                                                                      Just to be clear:
          9                                                                SQL ! Seq hashed
                                                                           SQL ! Seq grouped

          8

          7
                                                                                                      ›   Prefiltering
                                                                                                          improved due to
          6
                                                                                                          reduction of
Minutes




          5
                                                                                                          mapper load.
          4

          3
                                                                                                      ›   SQL improved
          2      100058
                      13335
                           3885
                                3885
                                               Num mapper
                                            input records (FITS)
                                                                   100058
                                                                        6674
                                                                             465
                                                                                 465
                                                                                                          due to data
          1
                  2077         1714
                                            Num launched mappers
                                                                    2149         499                      locality and more
          0
                         341
                 1° sq (3885 FITS)
                                      338                                  144
                                                                   ¼° sq (465 FITS)
                                                                                       128
                                                                                                          efficient mapper
                                        Query Sky Bounds Extent
                                                                                                          allocation – the
                                                                                                          required work
                                                                                                          was unaffected
                                                                                                          (3885 FITS).
                                                                           29
Utility of SQL Method
     Despite our results
      (which show SQL to be equivalent to prefiltering)...

     ...we predict that SQL should outperform prefiltering on
      larger databases.

     Why?
     › Prefiltering would contend with an increasing number of
       false positives in the mappers*.
     › SQL would incur little additional overhead.


     No experiments on this yet.

*   A spacing-filling curve for grouping the data might help.
                                                30
Conclusions
 Packing many small files into a few large files is essential.
 Structured packing and associated prefiltering offers
  significant gains (reduces mapper load).
 SQL prefiltering of unstructured sequence files yields little
  improvement (failure to combine scattered HDFS file-splits
  leads to mapper bloat).
 SQL prefiltering of structured sequence files performs
  comparably to driver prefiltering, but we anticipate superior
  performance on larger databases.

 On a shared cluster (e.g. the cloud), performance variance
  is high – doesn’t bode well for online applications. Also
  makes precise performance profiling difficult.


                                31
Future Work
 Parallelize the reducer.
 Less conservative CombineFileSplit builder.
 Conversion to C++, usage of existing C++ libraries.
 Query by time-range.
 Increase complexity of projection/interpolation:
 › PSF matching
 Increase complexity of stacking algorithm:
 › Convert straight sum to weighted sum by image quality.
 Work toward the larger science goals:
 › Study the evolution of galaxies.
 › Look for moving objects (asteroids, comets).
 › Implement fast parallel machine learning algorithms for
    detection/classification of anomalies.
                               32
Questions?


   kbwiley@astro.washington.edu




                  33

Más contenido relacionado

La actualidad más candente

A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
guest11b095
 
Dealing with multiple source spatio-temporal data in urban dynamics analysis ...
Dealing with multiple source spatio-temporal data in urban dynamics analysis ...Dealing with multiple source spatio-temporal data in urban dynamics analysis ...
Dealing with multiple source spatio-temporal data in urban dynamics analysis ...
Beniamino Murgante
 
05_글로벌일루미네이션
05_글로벌일루미네이션05_글로벌일루미네이션
05_글로벌일루미네이션
noerror
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2
ozlael ozlael
 
Single photon 3D Imaging with Deep Sensor Fusion
Single photon 3D Imaging with Deep Sensor FusionSingle photon 3D Imaging with Deep Sensor Fusion
Single photon 3D Imaging with Deep Sensor Fusion
David Lindell
 
Angel cunado_The Terrain Of KUF2
Angel cunado_The Terrain Of KUF2Angel cunado_The Terrain Of KUF2
Angel cunado_The Terrain Of KUF2
drandom
 

La actualidad más candente (20)

Volumetric Lighting for Many Lights in Lords of the Fallen
Volumetric Lighting for Many Lights in Lords of the FallenVolumetric Lighting for Many Lights in Lords of the Fallen
Volumetric Lighting for Many Lights in Lords of the Fallen
 
Reyes
ReyesReyes
Reyes
 
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
 
SSII2021 [OS3-01] 設備や環境の高品質計測点群取得と自動モデル化技術
SSII2021 [OS3-01] 設備や環境の高品質計測点群取得と自動モデル化技術SSII2021 [OS3-01] 設備や環境の高品質計測点群取得と自動モデル化技術
SSII2021 [OS3-01] 設備や環境の高品質計測点群取得と自動モデル化技術
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
 
Dealing with multiple source spatio-temporal data in urban dynamics analysis ...
Dealing with multiple source spatio-temporal data in urban dynamics analysis ...Dealing with multiple source spatio-temporal data in urban dynamics analysis ...
Dealing with multiple source spatio-temporal data in urban dynamics analysis ...
 
GDC 2014 - Deformable Snow Rendering in Batman: Arkham Origins
GDC 2014 - Deformable Snow Rendering in Batman: Arkham OriginsGDC 2014 - Deformable Snow Rendering in Batman: Arkham Origins
GDC 2014 - Deformable Snow Rendering in Batman: Arkham Origins
 
Cloud Science
Cloud ScienceCloud Science
Cloud Science
 
05_글로벌일루미네이션
05_글로벌일루미네이션05_글로벌일루미네이션
05_글로벌일루미네이션
 
Computer Science Thesis Defense
Computer Science Thesis DefenseComputer Science Thesis Defense
Computer Science Thesis Defense
 
Around the World in 80 Shaders
Around the World in 80 ShadersAround the World in 80 Shaders
Around the World in 80 Shaders
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2
 
Single photon 3D Imaging with Deep Sensor Fusion
Single photon 3D Imaging with Deep Sensor FusionSingle photon 3D Imaging with Deep Sensor Fusion
Single photon 3D Imaging with Deep Sensor Fusion
 
The Rendering Technology of Killzone 2
The Rendering Technology of Killzone 2The Rendering Technology of Killzone 2
The Rendering Technology of Killzone 2
 
Practical Occlusion Culling in Killzone 3
Practical Occlusion Culling in Killzone 3Practical Occlusion Culling in Killzone 3
Practical Occlusion Culling in Killzone 3
 
Siggraph 2011: Occlusion culling in Alan Wake
Siggraph 2011: Occlusion culling in Alan WakeSiggraph 2011: Occlusion culling in Alan Wake
Siggraph 2011: Occlusion culling in Alan Wake
 
SIGGRAPH 2018 - PICA PICA and NVIDIA Turing
SIGGRAPH 2018 - PICA PICA and NVIDIA TuringSIGGRAPH 2018 - PICA PICA and NVIDIA Turing
SIGGRAPH 2018 - PICA PICA and NVIDIA Turing
 
Killzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemKillzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo Postmortem
 
SSII2018企画: センシングデバイスの多様化と空間モデリングの未来
SSII2018企画: センシングデバイスの多様化と空間モデリングの未来SSII2018企画: センシングデバイスの多様化と空間モデリングの未来
SSII2018企画: センシングデバイスの多様化と空間モデリングの未来
 
Angel cunado_The Terrain Of KUF2
Angel cunado_The Terrain Of KUF2Angel cunado_The Terrain Of KUF2
Angel cunado_The Terrain Of KUF2
 

Destacado

A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
A Survey on Medical Image Retrieval Based on Hadoop
A Survey on Medical Image Retrieval Based on HadoopA Survey on Medical Image Retrieval Based on Hadoop
A Survey on Medical Image Retrieval Based on Hadoop
Akshay Mamulwar
 
Hipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large ScaleHipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large Scale
Liu Liu
 
15 minute presentation about Thesis
15 minute presentation about Thesis15 minute presentation about Thesis
15 minute presentation about Thesis
Sven Meys
 

Destacado (20)

A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
 
Virtualizing Hadoop
Virtualizing HadoopVirtualizing Hadoop
Virtualizing Hadoop
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
A Survey on Medical Image Retrieval Based on Hadoop
A Survey on Medical Image Retrieval Based on HadoopA Survey on Medical Image Retrieval Based on Hadoop
A Survey on Medical Image Retrieval Based on Hadoop
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
 
Mild reminder
Mild reminderMild reminder
Mild reminder
 
Hipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large ScaleHipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large Scale
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
 
15 minute presentation about Thesis
15 minute presentation about Thesis15 minute presentation about Thesis
15 minute presentation about Thesis
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster management
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014The Evolution of Data Analysis with Hadoop - StampedeCon 2014
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
 
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageWebinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
 

Similar a Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010

Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal StructuresLarge Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
ayubimoak
 
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
inside-BigData.com
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
Johan Andersson
 
Efficient LDI Representation (TPCG 2008)
Efficient LDI Representation (TPCG 2008)Efficient LDI Representation (TPCG 2008)
Efficient LDI Representation (TPCG 2008)
Matthias Trapp
 

Similar a Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010 (20)

Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal StructuresLarge Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
Large Scale Parallel FDTD Simulation of Full 3D Photonic Crystal Structures
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
 
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time RaytracingSIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
 
Image ORB feature
Image ORB featureImage ORB feature
Image ORB feature
 
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
GoogleSky Status at Google
GoogleSky Status at GoogleGoogleSky Status at Google
GoogleSky Status at Google
 
Tomoya Sato Master Thesis
Tomoya Sato Master ThesisTomoya Sato Master Thesis
Tomoya Sato Master Thesis
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
 
Efficient architecture to condensate visual information driven by attention ...
Efficient architecture to condensate visual information driven by attention ...Efficient architecture to condensate visual information driven by attention ...
Efficient architecture to condensate visual information driven by attention ...
 
Efficient LDI Representation (TPCG 2008)
Efficient LDI Representation (TPCG 2008)Efficient LDI Representation (TPCG 2008)
Efficient LDI Representation (TPCG 2008)
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
Data Processing Using THEOS Satellite Imagery for Disaster Monitoring (Case S...
Data Processing Using THEOS Satellite Imagery for Disaster Monitoring (Case S...Data Processing Using THEOS Satellite Imagery for Disaster Monitoring (Case S...
Data Processing Using THEOS Satellite Imagery for Disaster Monitoring (Case S...
 
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
Big Science, Big Data: Simon Metson at Eduserv Symposium 2012
 
9553477.ppt
9553477.ppt9553477.ppt
9553477.ppt
 
GeoCAPE Strategies
GeoCAPE StrategiesGeoCAPE Strategies
GeoCAPE Strategies
 
RasterFrames - FOSS4G NA 2018
RasterFrames - FOSS4G NA 2018RasterFrames - FOSS4G NA 2018
RasterFrames - FOSS4G NA 2018
 

Más de Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 

Más de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010

  • 1. Astronomical Image Processing with Hadoop Keith Wiley* Survey Science Group Dept. of Astronomy, Univ. of Washington *Keith Wiley, Andrew Connolly, YongChul Kwon, Magdalena Balazinska, Bill Howe, Jeffrey Garder, Simon Krughoff, Yingyi Bu, Sarah Loebman and Matthew Kraus 1
  • 2. Session Agenda  Astronomical Survey Science  Image Coaddition  Implementing Coaddition within MapReduce  Optimizing the Coaddition Process  Conclusions  Future Work 2
  • 3. Astronomical Topics of Study  Dark energy  Large scale structure of universe  Gravitational lensing  Asteroid detection/tracking 3
  • 4. What is Astronomical Survey Science?  Dedicated sky surveys, usually from a single calibrated telescope/camera pair.  Run for years at a time.  Gather millions of images and TBs of storage*.  Require high-throughput data reduction pipelines.  Require sophisticated off-line data analysis tools. * Next generation surveys will gather PBs of 4image data.
  • 5. Sky Surveys: Today and Tomorrow  SDSS* (1999-2005)  LSST† (2015-2025)  Founded in part by UW  8.4m mirror, 3.2 gigapixel camera  1/4 of the sky  Half sky every three nights  80TBs total data  30TB per night... ...one SDSS every three nights  60PBs total (nonstop ten years)  1000s of exposures of each location * Sloan Digital Sky Survey That’s a person! 5 † Large Synoptic Survey Telescope
  • 6. FITS (Flexible Image Transport System)  Common astronomical image representation file format  Metadata tags (like EXIF): › Most importantly: Precise astrometry* › Other: • Geolocation (telescope location) • Sky conditions, image quality, etc.  Bottom line: › An image format that knows where it is looking. * Position on sky 6
  • 7. Image Coaddition  Give multiple partially overlapping images and a query (color and sky bounds): › Find images’ intersections with the query bounds. › Project bitmaps to the bounds. › Stack and mosaic into a final product. 7
  • 8. Image Coaddition Expensive  Give multiple partially overlapping images and a query (color and sky bounds): › Find images’ intersections with the query bounds. › Project bitmaps to the bounds. › Stack and mosaic into a final product. Cheap 8
  • 9. Image Stacking (Signal Averaging)  Stacking improves SNR: › Makes fainter objects visible.  Example (SDSS, Stripe 82): › Top: Single image, R-band › Bottom: 79-deep stack • (~9x SNR improvement) • Numerous additional detections  Variable conditions (e.g., atmosphere, PSF, haze) mean stacking algorithm complexity can exceed a mere sum. 9
  • 10. Coaddition in Hadoop Input FITS image Mapper Detect Projected intersection with intersection query bounds. Project bitmap to query’s coord sys. Reducer Final coadd Stack and mosaic HDFS Mapper projected intersections. Mapper Mapper Parallel Serial 10
  • 11. Driver Prefiltering  To assist the process we prefilter the FITS files in the driver.  SDSS camera has 30 CCDs: › 5 colors › 6 abutting strips of sky Query bounds Relevant FITS  Prefilter (path glob) by color Prefilter-excluded FITS False positives 1 FITS and sky coverage (single axis): › Exclude many irrelevant FITS files. › Sky coverage filter is only single axis: • Thus, false positives slip through... ...to be discarded in the mappers. 11
  • 12. Performance Analysis 45  Running time: 40 › 2 query sizes 35 › Run against 30 1/10th of SDSS 25 (100,058 FITS) Minutes 20 15  Conclusion: 10 › Considering the 5 small dataset, this is too slow! 0 1° sq (3885 FITS) ¼° sq (465 FITS) Query Sky Bounds Extent Error bars show 95% confidence intervals — Outliers removed via Chauvenet › Remember 42 minutes for the next slide. 12
  • 13. Performance Analysis Driver MapReduce runJob() main()  Breakdown of large 40 query running time 30  main() is sum of: › Driver Minutes 20 › runJob() 10  runJob() is sum of 0 Degl Cons Mapp R runJo main MapReduce parts. ob In put P truct File er Do educer D b() () aths Splits ne one Hadoop Stage 13
  • 14. Performance Analysis Total job time  Breakdown of large RPCs from 40 client to HDFS query running time 30  Observation: › Running time Minutes 20 dominated by RPCs from client to HDFS to process 1000s of 10 FITS file paths. 0 Degl Cons Mapp R runJo main ob In put P truct File er Do educer D b() () aths Splits ne one Hadoop Stage  Conclusion: › Need to reduce number of files. 14
  • 15. Sequence Files  Sequence files group many small files into a few large files.  Just what we need!  Real-time images may not be amenable to logical grouping. › Therefore, sequence files filled in an arbitrary manner: FITS FITS filename assignment to Seq Sequence FITS Hash Function FITS Sequence FITS 15
  • 16. Performance Analysis 45 FITS Seq hashed  Comparison: 40 › FITS input vs. 35 unstructured 30 sequence file 25 input* Minutes 20 15  Conclusion: › 5x speedup! 10 5 0  Hmmm... 1° sq (3885 FITS) Query Sky Bounds Extent ¼° sq (465 FITS) Can we do better? Error bars show 95% confidence intervals — Outliers removed via Chauvenet * 360 seq files in hashed seq DB. 16
  • 17. Performance Analysis 45 Processed a subset FITS Seq hashed  Comparison: 40 of the database › FITS input vs. after prefiltering 35 unstructured 30 sequence file 25 input* Minutes 20  Conclusion: 15 Processed the entire › 5x speedup! 10 database, but 5 still ran faster 0  Hmmm... 1° sq (3885 FITS) Query Sky Bounds Extent ¼° sq (465 FITS) Can we do better? Error bars show 95% confidence intervals — Outliers removed via Chauvenet * 360 seq files in hashed seq DB. 17
  • 18. Structured Sequence Files  Similar to the way we prefiltered FITS files...  SDSS camera has 30 CCDs: › 5 colors › 6 abutting strips of sky › Thus, 30 sequence file types Query bounds Relevant FITS Prefilter-excluded FITS False positives 1 FITS  Prefilter by color and sky coverage (single axis): › Exclude irrelevant sequence files. › Still have false positives. › Catch them in the mappers as before. 18
  • 19. Performance Analysis 45 FITS Seq hashed  Comparison: 40 Seq grouped, prefiltered › FITS vs. 35 unstructured 30 sequence* vs. 25 structured Minutes sequence files† 20 15  Conclusion: 10 › Another 2x 5 speedup for the 0 1° sq (3885 FITS) ¼° sq (465 FITS) large query, 1.5x Query Sky Bounds Extent Error bars show 95% confidence intervals — Outliers removed via Chauvenet speedup for the small query. * 360 seq files in hashed seq DB. † 1080 seq files in structured DB. 19
  • 20. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Breakdown of large 9 query running time 8 7  Prediction: 6 › Prefiltering should Minutes 5 gain performance in the mapper. 4 3 2  Does it? 1 0 Degl Cons Mapp R runJo main ob In put P truct File er Do educer D b() () aths Splits ne one Hadoop Stage 20
  • 21. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Breakdown of large 9 query running time 8 7  Prediction: 6 › Prefiltering should Minutes 5 gain performance in the mapper. 4 3 2  Conclusion: 1 › Yep, just as 0 Degl ob In Cons Mapp put P truct File aths Splits R er Do educer D ne runJo one b() main () expected. Hadoop Stage 21
  • 22. Performance Analysis 45 FITS Seq hashed  Experiments were 40 Seq grouped, prefiltered performed on a 35 100,058 FITS 30 database (1/10th SDSS). 25 Minutes 20  How much of this 15 database is 10 Hadoop churning 5 through? 0 1° sq (3885 FITS) ¼° sq (465 FITS) Query Sky Bounds Extent Error bars show 95% confidence intervals — Outliers removed via Chauvenet 22
  • 23. Performance Analysis 45 FITS Seq hashed  Comparison: 40 Seq grouped, prefiltered › Number of FITS 13,415 FITS files 35 499 mappers‡ considered in 30 mappers vs. 25 100,058 FITS files number Minutes 360 sequence files* contributing to 20 2077 mappers‡ coadd 15 13,335 FITS files  Conclusion: 10 144 sequence files† 5 341 mappers‡ › Mappers must 0 3885 FITS Sky Bounds Extent 1° sq (3885 FITS) ¼° sq (465 FITS) discard many Query Error bars show 95% confidence intervals — Outliers removed via Chauvenet FITS files due to nonoverlap of * 360 seq files in hashed seq DB. query bounds. † 1080 seq files in structured DB. ‡ 800 mapper slots on cluster. 23
  • 24. Using SQL to Find Intersections  Store all image colors and sky bounds in a database: › First, query color and intersections via SQL. › Second, send only relevant images to MapReduce.  Consequence: All images processed by mappers contribute to coadd. No time wasted considering irrelevant images. SQL (1) Retrieve Database FITS filenames that overlap query bounds Driver (2) Only load relevant MapReduce images into MapReduce 24
  • 25. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Comparison: 9 SQL ! Seq hashed SQL ! Seq grouped › nonSQL vs. SQL 8 7 6  Conclusion: › Sigh, no major Minutes 5 4 improvement (SQL is not 3 remarkably 2 superior to 1 nonSQL for given 0 1° sq (3885 FITS) ¼° sq (465 FITS) pairs of bars). Query Sky Bounds Extent 25
  • 26. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Comparable 9 SQL ! Seq hashed SQL ! Seq grouped performance here 8 makes sense: › In essence, 7 6 prefiltering and Minutes 5 SQL performed 4 similar tasks, 3 albeit with 3.5x 2 100058 3885 Num mapper 100058 465 different mapper 1 13335 3885 13335 3885 input records (FITS) 6674 465 inputs (FITS). 2077 1714 2149 499 341 338 Num launched mappers 144 128 0 1° sq (3885 FITS) ¼° sq (465 FITS) Query Sky Bounds Extent  Conclusion: › Cost of discarding many images in the nonSQL case 26 was negligible.
  • 27. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Low improvement 9 SQL ! Seq hashed SQL ! Seq grouped for SQL in the 8 hashed case is 7 surprising at first 6 › ...especially Minutes 5 considering 26x 4 different mapper 3 inputs (FITS). 2 100058 3885 100058 3885 Num mapper input records (FITS) 100058 465 13335 3885 6674 465 1 2077 1714 2149 499 341 338 Num launched mappers 144 128 0 1° sq (3885 FITS) ¼° sq (465 FITS) Query Sky Bounds Extent 27
  • 28. Performance Analysis 10 Seq hashed Seq grouped, prefiltered  Low improvement 9 SQL ! Seq hashed SQL ! Seq grouped for SQL in the 8 hashed case is 7 surprising at first 6 › ...especially Minutes 5 considering 26x 4 different mapper 3 inputs (FITS). 2 100058 3885 100058 3885 Num mapper input records (FITS) 100058 465 13335 3885 6674 465 1  Theory: 2077 1714 338 1714 Num launched mappers 2149 499 0 341 338 144 128 › Scattered 1° sq (3885 FITS) ¼° sq (465 FITS) Query Sky Bounds Extent distribution of relevant FITS ently Mapper requirements exceeded Consequ prevented cluster capacity(~800). We lost efficient mapper parallelism (mappers were queued)! 28 reuse.
  • 29. Results 10 Seq hashed Seq grouped, prefiltered  Just to be clear: 9 SQL ! Seq hashed SQL ! Seq grouped 8 7 › Prefiltering improved due to 6 reduction of Minutes 5 mapper load. 4 3 › SQL improved 2 100058 13335 3885 3885 Num mapper input records (FITS) 100058 6674 465 465 due to data 1 2077 1714 Num launched mappers 2149 499 locality and more 0 341 1° sq (3885 FITS) 338 144 ¼° sq (465 FITS) 128 efficient mapper Query Sky Bounds Extent allocation – the required work was unaffected (3885 FITS). 29
  • 30. Utility of SQL Method  Despite our results (which show SQL to be equivalent to prefiltering)...  ...we predict that SQL should outperform prefiltering on larger databases.  Why? › Prefiltering would contend with an increasing number of false positives in the mappers*. › SQL would incur little additional overhead.  No experiments on this yet. * A spacing-filling curve for grouping the data might help. 30
  • 31. Conclusions  Packing many small files into a few large files is essential.  Structured packing and associated prefiltering offers significant gains (reduces mapper load).  SQL prefiltering of unstructured sequence files yields little improvement (failure to combine scattered HDFS file-splits leads to mapper bloat).  SQL prefiltering of structured sequence files performs comparably to driver prefiltering, but we anticipate superior performance on larger databases.  On a shared cluster (e.g. the cloud), performance variance is high – doesn’t bode well for online applications. Also makes precise performance profiling difficult. 31
  • 32. Future Work  Parallelize the reducer.  Less conservative CombineFileSplit builder.  Conversion to C++, usage of existing C++ libraries.  Query by time-range.  Increase complexity of projection/interpolation: › PSF matching  Increase complexity of stacking algorithm: › Convert straight sum to weighted sum by image quality.  Work toward the larger science goals: › Study the evolution of galaxies. › Look for moving objects (asteroids, comets). › Implement fast parallel machine learning algorithms for detection/classification of anomalies. 32
  • 33. Questions? kbwiley@astro.washington.edu 33