Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.
5. • Pan-African financial services provider
– Providing services to South Africa (ABSA) and 11 other
African countries (Barclays)
• Subject to strict regulatory compliance
– Basel Committee on Banking Supervision (BCBS)
• Accuracy
• Comprehensiveness
• Clarity
• Usefulness
5#EUent3
Barclays Africa Group Limited
6. • SPark LINEage
• Open source project
• Goals
– Satisfy initial interpretations of regulatory
requirements, specifically on “Clarity”
• Data lineage from Spark’s execution plans
• Visualize in an “explorable” user-friendly format
6#EUent3
Spline
7. Spline – How it works
7#EUent3
Spark job
Spark Job
Spark library
Spark Session
Action
Transformations
Generate execution plans
8. Spline – How it works
8#EUent3
Spark job
1 line initialization
Use SQLContext listeners
Generate execution plans
Spark Job & Spline
Spark library
Spark Session
Action
Spline
Transformations
9. Spline UI
Spline – How it works
9#EUent3
Spark job
1 line initialization
Use SQLContext listeners
Generate execution plans
Spark Job & Spline
Spark library
Spark Session
Action
Spline
Transformations
…
10. Demo use case
• Find the countries with the highest annual beer
consumption per person
– Correlation with GDP??
10#EUent3
11. Data
11#EUent3
Country 2011 2010 2009
Czech Republic 15,583,000 15,549,000 16,190,000
Ireland 4,721,000 4,814,000 4,832,000
Country Metric 2011 2010 2009
Czech Republic GDP $21,717 $19,764 $19,698
Czech Republic Population 10,496,088 10,474,410 10,443,936
Ireland GDP $52,567 $48,538 $51,983
Ireland Population 4,576,794 4,560,155 4,535,375
Beer consumption per country
Development indicators from the world bank
12. Analysis
• Marek’s job
– Data prep
– Analyze the correlation between beer consumption
and GDP growth
• Jan’s beer job
– Calculate the consumption of beer per country per
capita
12#EUent3
16. Next steps
• Enterprise features
– Authentication (Kerberos, SSO)
– Authorization
– User management
• Interoperability with other tools
– Cloudera Manager, Informatica, Apache Atlas
• Support other Spark data sources and actions
– Streaming, ML
16#EUent3
17. The bigger picture
• Develop open source conformance & ingestion
engine on Spark
– BCBS compliant (lineage, dataflow controls, error tracking)
– Transfer & transform data from different source systems
• To strongly typed datasets
• On Hadoop
• Conforming to enterprise level data dictionaries & data quality
• In development – stay tuned J
17#EUent3
18. We’re open source!
• Contributions are most welcome
• Released versions mirrored on
– https://github.com/AbsaOSS/spline
• Wiki and docs on
– https://absaoss.github.io/spline/
18#EUent3
19. Questions
19#EUent3
• Now is a good time
• Or feel free to contact us
– Jan Scherbaum
• jan.scherbaum@barclays.com
– Marek Novotny
• marek.x.novotny@barclays.com
– Oleksandr Vayda
• oleksandr.vayda@barclays.com
• Acknowledgements:
– Dennis Chu, Aaisha Bibi Osman, Adam Smyczek, Andrew Baker