Pandas v.0.23 brought to life a new extension interface through which you can extend NumPy's type system. This talk will explain what that means in more detail and provide practical examples of how the new interface can be leveraged to drastically improve your reporting.
2. Motivation
• Pandas historically bound to NumPy type system and its limitations:
• Integer and bool types cannot store missing data
• Non-numeric types (ex: Categorical, Datetime w/ TZ) not natively supported
• Custom types required extensive updates to pandas internals
3. Important Concepts
• Extension Type
• Description of the data type; similar to NumPy
• Can be registered for creation via string (i.e. …astype(“foo”))
• Extension Array
• Class which does the actual “heavy lifting”
• No restrictions on construction, though must be convertible to NumPy array
• Limited to one dimension, though may be backed by 0..n arrays
• A Series is a container for an “array-like” thing - Tom Augspurger
13. Behind the Scenes
• Instead of addressable units bitarray
packs bool values into bit
sequences
• Serialization between bitarray and
NumPy allows EA to fit into pandas
paradigm
* Under Development - actual implementation may differ
14. Potential Benefits
• Memory footprint of mask potentially reduced up to a factor of 8 (using bits
instead of bytes)
• Theoretical bitarray backed BoolNA implementation could reduce footprint
by a factor of 16 (both values and mask go from bytes to bits)
16. Further Research
• Extending Pandas in the pandas documentation
• Extension Arrays for Pandas by Tom Augspurger
• Extending Pandas using Apache Arrow and Numba by Uwe Korn