Hydra - Content Processing Framework for Search Driven Solutions

•Descargar como PPTX, PDF•

1 recomendación•3,654 vistas

Presented at Lucene Revolution, 7-8 May in Boston and Berlin Buzzwords 4-5 June, 2012. When working with free text search, the quality of the data in the index is a key factor on the quality of the results delivered and has a major impact on the information consumption experience. Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. Providing a scalable and efficient pipeline which the documents pass through before being indexed into the search engine does this.

Tecnología Educación

Introducing Hydra
An Open Source Document Processing Framework

Joel Westberg

© FINDWISE 2012

About Findwise

• Founded in 2005

• Offices in Sweden, Denmark,
Norway and Poland

• 80 employees (April 2012)

• Our objective is to be a leading provider of Findability solutions utilising
the full potential of search technology to create customer business value

Technology independent
Creating search-driven Findability solutions based on market-leading
commercial and open source search technology platforms:

 Autonomy IDOL
 Microsoft (SharePoint and FAST Search products)
 Google GSA
 IBM ICA/OmniFind
 LucidWorks
 Apache Lucene/Solr
 Elastic Search
 and more…

Connecting source to search
Garbage in, garbage out. But what about unstructured data in?
• Flat data is richer than it appears
• Don’t discard information too soon!

The unstructured structured data paradox
Example: News articles
Plain text that contains invaluable metadata for search, such as:
• Title
• Author byline
• Lead paragraph

Enrichment and structuring possibilities

• Enrich your documents with metadata, to power your search
• Language detection
• Sentiment analysis
• Headline extraction
• Regular expression matching and extraction
• Filter out unwanted documents
• Collect statistics
• Export to Staging environments

Main Design Objectives
Scalability
• Horizontally scalable central repository
• Independent processing nodes
Failiure tolerant
• Failiure of a stage affects only a single document
• Failiure of a node affects at most n documents
• Failiures can be automaticly detected
Robustness
• Independent stages
Development ease
• Debug stages from IDE against actual data
• Allow test driven pipeline development

Hadoop/Big Data integration

Usecases for document enrichment
• Pagerank
• Analytics
Hadoop & Map/Reduce advantages
• Huge scalability
• Ability to work on entire document set at once
Hadoop & Map/Reduce drawbacks
• Batch processing
• Time-to-index

Hadoop/Big Data integration

Blue – First round of indexing only
Red – Second round of indexing
Purple – All documents

Open Source initiative

• Other committers
• The role of Findwise

For more information:
• http://www.findwise.com/hydra
• http://findwise.github.com/Hydra

• Email: joel.westberg@findwise.com

Thank Joel Westberg
joel.westberg@findwise.com

you!

Más contenido relacionado

La actualidad más candente

The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

DataPad Inc.

Introduction to basic data analytics tools

Nascenia IT

We’ve all seen Google results that feature detailed contact and location information, recipe details and reviews, browsable discographies and more when we’re looking for information. These kind of rich search results give users instant access to the most actionable and shareable content on your website, creating a great user experience before they even get to your site. What kind of dark arts are site owners using to create this kind of detailed, rich information that search engines gobble up and use to create meaningful, easy to digest search results for their audiences? The answer lies in structured data formats. Attendees will leave this talk with a basic understanding of what structured data is, the formats available, and what types of structured data benefit a site the most when it comes to SEO. We’ll also look at what tools are available to most efficiently use these principles within Drupal.

The SEO Magic of Structured Data

Katherine White (McCann)

Narasimhan Sampath and Avinash Ramineni share how Choice Hotels International used Spark Streaming, Kafka, Spark, and Spark SQL to create an advanced analytics platform that enables business users to be self-reliant by accessing the data they need from a variety of sources to generate customer insights and property dashboards and enable data-driven decisions with minimal IT engagement. Narasimhan and Avinash highlight the architecture, lessons learned, and the challenges that were overcome on both the business and technology fronts. The analytics platform is designed as a framework to enable self-service data intake, data processing, and report/model generation by the business users. The data-driven framework consists of a distributed hybrid-cloud data ingestor for data intake and a Cloudera CDH cluster with Spark as the distributed compute engine. The solution is built in such a way that storage and compute have been decoupled and encourages the concept of BYOC (bring your own compute). The platform uses EC2 instances to run CDH and leverages Amazon S3 as a data warehouse storage layer (data lake), Spark as an ETL engine, and Spark SQL as a distributed query engine. Results (computations/derived tables) are exposed to the end users via Spark SQL and are discovered via Tableau. The platform supports both batch and streaming use cases and is built on the following technology stack: AWS (S3, EC2, SQS, SNS), Cloudera CDH (YARN, Navigator, Sentry), Spark, Kafka, Spark SQL, and Spark Streaming.

Strata+Hadoop World NY 2016 - Avinash Ramineni

Avinash Ramineni

The Kasabi Information Marketplace

Knud Möller

Big Data, Big Dream

Wayne Weixin

Big Data Architecture For enterprise

Wei Zhang

Machine Learning on the Microsoft Stack

Lynn Langit

AuditBucket - an introduction and overview

AuditBucket

As data sets continue to grow, search remains a key technology for many applications. But what is the current state of the enterprise search market? Which providers are gaining market share, and what are the latest developments and innovations? Based on experience from dozens of recent search projects using a range of technologies, this presentation will summarize market conditions, discuss current best practices for creating great search systems, and suggest some future trends to watch out for.

The Enterprise Search Market in a Nutshell

Dr. Haxel Consult

Data mining tools overall

Mohamed Sharique Vellikan

Pentaho Analytics on MongoDB

Mark Kromer

Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"

Fwdays

All about documents in o365 - aOS Brussels 2016/12/05

Sébastien Paulet

Chezkie Kasnett, Digital Projects Manager, IT Division, The National Library of Israel The National Library of Israel is leading a national project to deploy an online centralized network of historical archives with cultural heritage material. The objectives of the project are to provide free, on-line access to hundreds of historical archives and their holdings through cataloging and digitization, and secondly, the long-term digital preservation of these holdings.

Implementing an AMS on a National Scale – A Model

Axiell ALM

Lju Lazarevic

Connected Data World

AzureDay - Introduction Big Data Analytics.

Łukasz Grala

Navigating the World of User Data Management and Data Discovery

DataWorks Summit/Hadoop Summit

II-SDV 2015, 20 - 21 April, in Nice

Dr. Haxel Consult

Дмитрий Попович "How to build a data warehouse?"

Fwdays

La actualidad más candente (20)

The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

Introduction to basic data analytics tools

The SEO Magic of Structured Data

Strata+Hadoop World NY 2016 - Avinash Ramineni

The Kasabi Information Marketplace

Big Data, Big Dream

Big Data Architecture For enterprise

Machine Learning on the Microsoft Stack

AuditBucket - an introduction and overview

The Enterprise Search Market in a Nutshell

Data mining tools overall

Pentaho Analytics on MongoDB

Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"

All about documents in o365 - aOS Brussels 2016/12/05

Implementing an AMS on a National Scale – A Model

Lju Lazarevic

AzureDay - Introduction Big Data Analytics.

Navigating the World of User Data Management and Data Discovery

II-SDV 2015, 20 - 21 April, in Nice

Дмитрий Попович "How to build a data warehouse?"

Similar a Hydra - Content Processing Framework for Search Driven Solutions

Presented by Joel Westberg, Findwise AB - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 This presentation will detail the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.

Introducing Hydra – An Open Source Document Processing Framework

lucenerevolution

Cassandra eu

Jeremy Hanna

Architecting Your First Big Data Implementation

Adaryl "Bob" Wakefield, MBA

Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.

How to use Big Data and Data Lake concept in business using Hadoop and Spark...

Institute of Contemporary Sciences

Slides for the talk at AI in Production meetup: https://www.meetup.com/LearnDataScience/events/255723555/ Abstract: Demystifying Data Engineering With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood. In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.

Demystifying data engineering

Thang Bui (Bob)

Big Data Strategy for the Relational World

Andrew Brust

Big data and hadoop

Prashanth Yennampelli

Designing and Implementing Search Solutions

Findwise

Nashville analytics summit aug9 no sql mike king dell v1.5

Mike King

Apache drill

MapR Technologies

In recent years, Apache™ Hadoop® has emerged from humble beginnings to disrupt the traditional disciplines of information management. As with all technology innovation, hype is rampant, and data professionals are easily overwhelmed by diverse opinions and confusing messages. Even seasoned practitioners sometimes miss the point, claiming for example that Hadoop replaces relational databases and is becoming the new data warehouse. It is easy to see where these claims originate since both Hadoop and Teradata® systems run in parallel, scale up to enormous data volumes and have shared-nothing architectures. At a conceptual level, it is easy to think they are interchangeable, but the differences overwhelm the similarities. This session will shed light on the differences and help architects, engineering executives, and data scientists identify when to deploy Hadoop and when it is best to use MPP relational database in a data warehouse, discovery platform, or other workload-specific applications. Two of the most trusted experts in their fields, Steve Wooledge, VP of Product Marketing from Teradata and Jim Walker of Hortonworks will examine how big data technologies are being used today by practical big data practitioners.

Hadoop and the Data Warehouse: When to Use Which

DataWorks Summit

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...

Flink Forward

Get A Head on Your Repository

eosadler

When to Use MongoDB

MongoDB

Choosing the Right Big Data Tools for the Job - A Polyglot Approach

DATAVERSITY

Modern big data solutions often incorporate Hadoop as one of the components and require the integration of Hadoop with other components including Oracle Database. This presentation explains how Hadoop integrates with Oracle products focusing specifically on the Oracle Database products. Explore various methods and tools available to move data between Oracle Database and Hadoop, how to transparently access data in Hadoop from Oracle Database, and review how other products, such as Oracle Business Intelligence Enterprise Edition and Oracle Data Integrator integrate with Hadoop.

Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...

Alex Gorbachev

Hadoop as data refinery

Steve Loughran

Apache Hadoop is often described as a "Big Data Platform" but what does that mean? One way to better understand Hadoop is to talk about how Hadoop is used. This talk discusses using Hadoop as a "Data Refinery", which is a common use case. The concept is very much like a traditional oil refinery except with data, pulling in large quantities of "crude data" over pipelines, refining some into useful business intelligence; refining other pieces into slightly less crude data that stays in the cluster until needed later. This metaphor proves useful when considering how Hadoop could be adopted in an organisation that already has data warehousing and business intelligence systems -and when contemplating how to hook up a Hadoop cluster to the sources of data inside and outside that organisation. A key point to remember is that storing data in Hadoop is not a means to an end any more than storing data in a database is: it is extracting information from that data. Using Hadoop as a front end "data refinery" means that it can integrate with existing Business Intelligence systems, while providing the platform for new applications.

Hadoop as Data Refinery - Steve Loughran

JAX London

The way we store and manage data is changing. In the old days, there were only a handful of file formats and databases. Now there are countless databases and numerous file formats. The methods by which we access the data has also increased in number. As R users, we often access and analyze data in highly inefficient ways. Big Data tech has solved some of those problems. This presentation will take attendees on a quick tour of the various relevant Big Data technologies. I’ll explain how these technologies fit together to form a stack for various data analysis uses cases. We’ll talk about what these technologies mean for the future of analyzing data with R. Even if you work with “small data” this presentation will still be of interest because some Big Data tech has a small data use case.

Big Data Technologies and Why They Matter To R Users

Adaryl "Bob" Wakefield, MBA

Foxvalley bigdata

Tom Rogers

Más de Findwise

White Arkitekter - Findability Day Roadshow 2017

Findwise

AI och maskininlärning - Findability Day Roadshow 2017

Findwise

De kognitiva eran med IBM Watson - Findability Day Roadshow 2017

Findwise

Findwise and IBM Watson

Findwise

Findability Day 2016 - Enterprise Search and Findability Survey 2016

Findwise

Findability Day 2016 - Enterprise Search and Findability Survey 2016

Findwise

Findability Day 2016 - Big data analytics and machine learning

Findwise

Findability Day 2016 - Enterprise social collaboration

Findwise

Findability Day 2016 - SKF case study

Findwise

Findability Day 2016 - Structuring content for user experience

Findwise

Findability Day 2016 - Augmented intelligence

Findwise

Findability Day 2016 - What is GDPR?

Findwise

Findability Day 2016 - Get started with GDPR

Findwise

Digital workplace och informationshantering i office 365

Findwise

Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...

Findwise

Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess

Findwise

Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...

Findwise

Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...

Findwise

Findability Day 2015 - Martin White - The future is search!

Findwise

Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...

Findwise

Más de Findwise (20)

White Arkitekter - Findability Day Roadshow 2017

AI och maskininlärning - Findability Day Roadshow 2017

De kognitiva eran med IBM Watson - Findability Day Roadshow 2017

Findwise and IBM Watson

Findability Day 2016 - Enterprise Search and Findability Survey 2016

Findability Day 2016 - Big data analytics and machine learning

Findability Day 2016 - Enterprise social collaboration

Findability Day 2016 - SKF case study

Findability Day 2016 - Structuring content for user experience

Findability Day 2016 - Augmented intelligence

Findability Day 2016 - What is GDPR?

Findability Day 2016 - Get started with GDPR

Digital workplace och informationshantering i office 365

Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...

Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess

Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...

Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...

Findability Day 2015 - Martin White - The future is search!

Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...

Último

Angeliki Cooney has spent over twenty years at the forefront of the life sciences industry, working out of Wynantskill, NY. She is highly regarded for her dedication to advancing the development and accessibility of innovative treatments for chronic diseases, rare disorders, and cancer. Her professional journey has centered on strategic consulting for biopharmaceutical companies, facilitating digital transformation, enhancing omnichannel engagement, and refining strategic commercial practices. Angeliki's innovative contributions include pioneering several software-as-a-service (SaaS) products for the life sciences sector, earning her three patents. As the Senior Vice President of Life Sciences at Avenga, Angeliki orchestrated the firm's strategic entry into the U.S. market. Avenga, a renowned digital engineering and consulting firm, partners with significant entities in the pharmaceutical and biotechnology fields. Her leadership was instrumental in expanding Avenga's client base and establishing its presence in the competitive U.S. market.

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Angeliki Cooney

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

Dubai, known for its towering skyscrapers, luxurious lifestyle, and relentless pursuit of innovation, often finds itself in the global spotlight. However, amidst the glitz and glamour, the emirate faces its own set of challenges, including the occasional threat of flooding. In recent years, Dubai has experienced sporadic but significant floods, disrupting normalcy and posing unique challenges to its infrastructure. Among the critical nodes in this bustling metropolis is the Dubai International Airport, a vital hub connecting the world. This article delves into the intersection of Dubai flood events and the resilience demonstrated by the Dubai International Airport in the face of such challenges.

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Orbitshub

Exploring Multimodal Embeddings with Milvus

Zilliz

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Artificial Intelligence Chap.5 : Uncertainty

Khushali Kathiriya

The microservices honeymoon is over. When starting a new project or revamping a legacy monolith, teams started looking for alternatives to microservices. The Modular Monolith, or 'Modulith', is an architecture that reaps the benefits of (vertical) functional decoupling without the high costs associated with separate deployments. This talk will delve into the advantages and challenges of this progressive architecture, beginning with exploring the concept of a 'module', its internal structure, public API, and inter-module communication patterns. Supported by spring-modulith, the talk provides practical guidance on addressing the main challenges of a Modultith Architecture: finding and guarding module boundaries, data decoupling, and integration module-testing. You should not miss this talk if you are a software architect or tech lead seeking practical, scalable solutions. About the author With two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Victor Rentea

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

[BuildWithAI] Introduction to Gemini.pdf

Sandro Moreira

Elevate Developer Efficiency & build GenAI Application with Amazon Q

Bhuvaneswari Subramani

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

Retrieval augmented generation (RAG) is the most popular style of large language model application to emerge from 2023. The most basic style of RAG works by vectorizing your data and injecting it into a vector database like Milvus for retrieval to augment the text output generated by an LLM. This is just the beginning. One of the ways that we can extend RAG, and extend AI, is through multilingual use cases. Typical RAG is done in English using embedding models that are trained in English. In this talk, we’ll explore how RAG could work in languages other than English. We’ll explore French, Chinese, and Polish.

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Zilliz

Discover the innovative features and strategic vision that keep WSO2 an industry leader. Explore the exciting 2024 roadmap of WSO2 API management, showcasing innovations, unified APIM/APK control plane, natural language API interaction, and cloud native agility. Discover how open source solutions, microservices architecture, and cloud native technologies unlock seamless API management in today's dynamic landscapes. Leave with a clear blueprint to revolutionize your API journey and achieve industry success!

WSO2's API Vision: Unifying Control, Empowering Developers

WSO2

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Understanding the FAA Part 107 License ..

Christopher Logan Kennedy

Dubai, often portrayed as a shimmering oasis in the desert, faces its own set of challenges, including the occasional threat of flooding. Despite its reputation for opulence and modernity, the emirate is not immune to the forces of nature. In recent years, Dubai has experienced sporadic but significant floods, testing the resilience of its infrastructure and communities. Among the critical lifelines in this bustling metropolis is the Dubai International Airport, a bustling hub that connects the city to the world. This article explores the intersection of Dubai flood events and the resilience demonstrated by the Dubai International Airport in the face of such challenges.

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Orbitshub

Hydra - Content Processing Framework for Search Driven Solutions

2. About Findwise • Founded in 2005 • Offices in Sweden, Denmark, Norway and Poland • 80 employees (April 2012) • Our objective is to be a leading provider of Findability solutions utilising the full potential of search technology to create customer business value

4. Technology independent Creating search-driven Findability solutions based on market-leading commercial and open source search technology platforms:  Autonomy IDOL  Microsoft (SharePoint and FAST Search products)  Google GSA  IBM ICA/OmniFind  LucidWorks  Apache Lucene/Solr  Elastic Search  and more…

5. Generic Search Architecture

6. Connecting source to search Garbage in, garbage out. But what about unstructured data in? • Flat data is richer than it appears • Don’t discard information too soon! The unstructured structured data paradox Example: News articles Plain text that contains invaluable metadata for search, such as: • Title • Author byline • Lead paragraph

7. Enrichment and structuring possibilities • Enrich your documents with metadata, to power your search • Language detection • Sentiment analysis • Headline extraction • Regular expression matching and extraction • Filter out unwanted documents • Collect statistics • Export to Staging environments

8. Classic Pipeline

9. Classic Architecture

10. The Hydra Architecture

11. Main Design Objectives Scalability • Horizontally scalable central repository • Independent processing nodes Failiure tolerant • Failiure of a stage affects only a single document • Failiure of a node affects at most n documents • Failiures can be automaticly detected Robustness • Independent stages Development ease • Debug stages from IDE against actual data • Allow test driven pipeline development

12. The Hydra Architecture

13. Hadoop/Big Data integration Usecases for document enrichment • Pagerank • Analytics Hadoop & Map/Reduce advantages • Huge scalability • Ability to work on entire document set at once Hadoop & Map/Reduce drawbacks • Batch processing • Time-to-index

14. Hadoop/Big Data integration Blue – First round of indexing only Red – Second round of indexing Purple – All documents

15. Future Configuration UI

16. Open Source initiative • Other committers • The role of Findwise For more information: • http://www.findwise.com/hydra • http://findwise.github.com/Hydra • Email: joel.westberg@findwise.com

17. Questions?

18. Thank Joel Westberg joel.westberg@findwise.com you!

Notas del editor

Findwise is a consultingcompany, focused on deliveringFindability solutions toourcustomers.
That’s where we come from, now let’s talk about whatLet’s start off with a view of how Apache Solr in the middle, Manifold CF on the far left
In examples: “Post the XML file to the DataImportHandler, and now you’ve indexed wikipedia!”In reality: “Here’s a jumbled heap of files, and in these documents when it says “product id: XYZ”, that’s the most important thing to index”
Today we use OpenPipeline.One stage might extract references from the text and store them in separate fieldsOne might extract entities from the textOne might be parsing the pdf itself.
Also forking is a problem etcetc
Core is the framework itself, e.g. “java –jar hydra-jar-with-dependencies.jar” to launch Hydra.
Light blue – all documents prior to being exported to hbasePurple – All documentsRed – all documents that have been reinserted from hbase
Give special thanks to Johannes

Hydra - Content Processing Framework for Search Driven Solutions

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Hydra - Content Processing Framework for Search Driven Solutions

Similar a Hydra - Content Processing Framework for Search Driven Solutions (20)

Más de Findwise

Más de Findwise (20)

Último

Último (20)

Hydra - Content Processing Framework for Search Driven Solutions

Notas del editor