Hello everyone and welcome to today's webinar – an Introduction to Amazon CloudSearch.Before I get started I wanted to say that you have joined the webinar muted, but throughout the webinar you can submit questions at any time using the Question Panel of your Go To Meeting control panel.We'll be answering as many questions as we can at the end of the presentation. GO TO NEXT SLIDE
So what is Amazon CloudSearch? Here’s a summary.It’s a full-managed and fully featured search service that runs in the AWS cloudIt scales automatically as your data and traffic fluctuates It can handle both structured and unstructured data and supports near-real time indexing of your documentsIt’s designed to be up and running in less than a hour Built by Amazon A9.comLeverages 10+ years of R&D in Search
For the audience members who are not already familiar with searchSearch engines automate finding a particular item (a document) in a collection of documents.What goes on under the covers?
Search engines let you retrieve information from a large collection based on matching some terms you’re curious about with each of the items in the collectionAn Encyclopedia is a good example of a large collection of items – the articlesLet’s say you’re looking for references to US Presidents in an EncyclopediaEach article in the encyclopedia is a possible matchYou could read from the beginning and examine each oneYou would have examined many articles that were not related to what you were looking forChapters or volumes are a way to limit what you look atBut there’s an even faster way
Another way is to create an index, like in a bookIndexes contain the list of all important terms that appear in all the articlesFor each term, the index lists all of the articles that have that termFor instance the index entry for ”president” would contain page numbers of the articles for Washington, Jefferson, Adams, and so onNow, you can look up presidents much more efficiently – just go to the entry for “president” and get the list of all the page numbersLet’s say that you wanted to find presidents that had lived in virginia.You could use the index to find “president” and then read all those pages to find the word “vriginia”A better way would be to look up “president”, write down the list of pages, look up “virginia” and write down that list of pages. If a page number appears on both lists (“president” AND virginia) then that would be the place to look.When you send a query to a search engine it does much the same thing.
Of course, we’ve all come to expect more from our search experience then strict matchingLet’s have a look at Amazon.com to see the kind of features that we expect.
We need to order the match set in a way that the most useful results are at the top. We need to sort by relevance.
Each document has metadata attributes that help users navigate to that documentOn Amazon.com, you’ll find counts for these facet values in the left rail of the page.For instance, searching for us president gives us a range of genres we can use to narrow the results.
In addition it’s useful to be able to search ranges of integers like prices or user ratings
Fielded searching allows us to narrow our results based on the values of particular document attributes
We are able to add combinations of different field values and use Boolean logic to build complex queries
Finally, documents are sorted based on more complex aspects of the query and user behavior. Values for popularity or clicks are added in to the relevance computation.All of these features provide users with a rich search experience that gets them to the items they want quickly.
At this point we’ll be diving into Amazon CloudSearchWe’ll be walking through the service and how to use itWhat you’ll see is a full-featured search engine in action that’s easy to set up and useI’ve listed out some of the search features we’ll coverWe’ll also cover how to interact with CloudSearch, including your http endpoints, the console, command line tools and search and document service APIs
When you sign up for CloudSearch you set up a search domain. This wraps a search engine in three RESTful endpoints shown on the bottom of the diagramYou send document batches to the document serviceYou send search requests to the search serviceYou configure the indexing of your document, access to your domain, text processing options and more The endpoints are stable DNS entries that stay the same as your domain scalesAs a search developer, you will interact with all of these endpoints through different modesCloudSearch has a console within the aws consoleYou can also interact with CloudSearch from the command lineFinally, you can operate directly with the APIYour application will primarily interact with the search serviceIt may also send updates direct to the document service
We’re going to walk through an example in some depth.* We’ll create a search domainUpload documentsConfigure the domainAnd run some searches
As we work through the process, we’ll be showcasing the console primarilyThe CloudSearch console is designed to simplify the process of interacting with your domain. It hides complexity behind a convenient UIIt is also the main portal for managing and monitoring the status of your domain. You’ll find your endpoints here as well as your domain’s sizing informationThe first time you log into the console, you’ll be greeted with a welcome screen with an invitation to create your domain.Selecting create domain will bring up a wizard that will allow you toName your domainPreconfigure with sample documentsFrom disk, or S3And upload those documents
CloudSearch will take several minutes to create the domainWe can see here a snapshot of the CloudSearch dashboardIt’s processing the domain creationThe dashboard shows you the current number of searchable documentsHow many fields you have configuredThe number of instances and partitions you are currently usingAnd the endpoints for the document and search services.
When your domain finishes initializing, you will upload documents for searchingCloudSearch uses a standard syntax for encoding your documents, but more on that in a secondThere are 3 main methods for sending documents to cloudsearchYou can upload a small batch of documents through the consoleYou can use the cs-post-sdf command-line tool, specifying a source location. The source can be an S3 object (an SDF file) or a local file on diskYou can also use a utility like curl to post your documents directly to your endpointYour application can use standard libraries for HTTP transport to send directly to your endpoint
Search documents are the heart of searchYou send documents to your document service, you retrieve documents from your search serviceSDF is the proprietary format for representing search documents. XML or JSONHere’s a JSON example. If you don’t know JSON syntaxThe square braces specify an array, we’ll come back to that in a minutethe wiggly braces specify an object. The outermost braces enclose a single search documentThe object contains a set of properties with a name and value separated by a colonLet’s look at the propertiesThe ID and version identifies this search documentThe type specifies whether to add or delete the documentThe lang specifies the language of the document (english only)The fields property has a value which is also an object, with a set of named propertiesThe fields contain the data from your search problem.We have a title, an author, a year…SDF files that you send to CloudSearch are batches, signified by the enclosing array markers.Guide to SDF is coming
As you send data, it’s reflected in the search results in near real time with no additional effortAs you send data, CloudSearch will automatically scale to handle that dataSearch instances are individual hosts with the capacity to store for retrieval a certain number of documentsCloudSearch is RAM-based, providing the best latencyCloudSearch automatically adds additional search instancesYou don’t have to do partitioning yourselfCloudSearch will add up to 10 partitions handling 10s of millions of documentsCloudSearch will scale larger than that if you have a bigger need. Please contact us.
Once your SDF is defined you can configure your domain with the indexing options that you want for your fields.Using the console or command line tools, you can use a sample set (or all of) your documents to auto-configure your domainYou’ll get a proposed configurationWhen you have sent an initial configuration, you can easily update it in the domain’s dashboard on the AWS consoleYou can also update it with the command-line tools or directly with an API call to the config serviceCloudSearch supports customization of each field as we’ll see now
Here you can see a snapshot of the dashboard’s Indexing Options panel.Down the left side you can see the fields that are configured, and across each row the configuration options set for that fieldEach field shows a status, these fields are all Active – that means that the configuration shown is the one that is deployed to your domainWhen you make changes to your fields’ configurations, the status will move to “Needs Indexing”. This lets you know that CloudSearch must build your changes into your domain for them to be active.If you have any fields that need indexing, the console will show an additional link to “Run Indexing”Each field has a type, and there are 3 types availableText: text fields are processed to extract tokens from the data for individual matching. From the US Presidents example, we would process each article with the contents as a text field, so that terms in the query could match within the body of the article.Text processing also includes applying stemming, stopwords, and synonyms to each token in the data for the field.Literal fields are not processed as text. Instead the entire contents is put in the index as-is for exact matching.UINT fields are processed as 32-bit unsigned integers. You can use UINTs for range searching as well as a source for custom relevance calculations.In addition to the field’s type, each field has 3 options you can set, depending on typeSearch: Literal fields can be search enabled to allow searching them directlyFacet: Fields can be facet enabled to allow retrieval of value counts for the field across the entire result setResult: Enabling result on a field means that CloudSearch will store up to 2K bytes of the field’s value in the index. Queries can then request that value be returned in the results. Careful, though, enabling result can dramatically grow your index size.You can set a default value for each field. Any document that does not contain a value for that field will receive the default value insteadYou can set field sources. This allows you to build fields out of sets of other fields for custom searching
In addition to field configurations, you can set domain-wide indexing optionsCloudSearch provides the ability to create a custom relevance function using a simple expression syntax employing arithmetic operators and allowing you to pull values from each document’s uint fieldsIn this way, you can mix fields like popularity with the text relevance for each documentCloudSearch also lets you configure processing of tokens in your text fields. You can upload a custom stemming and synonym dictionary as well as defining your own stopwords
It’s easy to integrate CloudSearch searching into your applicationCloudSearch offers its functionality as restful services giving you an easy way to send searches and get results.With the full-featured query language you can perform simple to complex queries, specify ranking and sorting, control pagination, and retrieve facets and result fieldsThe query itself is specified in the URL parameters
The console offers a simplified way to run test searches against your data.You can perform text searchesChange your sorting criteria using text relevance or your documents’ fieldsAnd view facet counts for and filter with all of the fields you have set for
We won’t detail the full query API, but here are some examples of common search tasks.The simplest searches to run are full text searches, specified by the q= parameter. Here I’ve written out the full URL including the endpoint and path. For the rest of the examples I won’t show that part.You can perform more complex queries with the bq= syntax. In this example we search for president in the title field and with a genre attribute of history. CloudSearch supports and, or, not, and additional nesting of expressions with parenthesesTo retrieve counts for a facet enabled field, you specify the facet= parameter with a comma-separated list of fields. CloudSearch also lets you control which facet values are returned with counts and the sorting of the facets that are returned.You can specify the ranking function to use with the rank= parameter. You can use the text_relevance function to sort by relevance, a rank expression you’ve defined, or field values on your documents to get alphabetic or numeric sorting.
To retrieve the source data from your documents’ fields, you specify the return-fields= parameter. In this example, CloudSearchwill includethe values for title, actor, and director for each document returned.You can paginate your results by specifying a start and a size parameter. In this case we will get 20 results starting at the 200th.You can search for an integer range, either open-ended or within a specific set of values using the .. Syntax.There are many more features that we haven’t covered, but this should give you a feel for some of the most common uses.
We have already discussed how CloudSearch scales for data, let’s look at how CloudSearch scales for traffic.More traffic requires more CPU to handle that trafficCloudSearch adds Search Instances to add the CPU to accomplish thatCloudSearch removes rows to scale back excess capacityMaximum scale:50 instancesMax of 10 wide on XLThere’s a contact us link to go larger – customers with over 1Bn docs
This diagram shows how CloudSearch scales in 2 directions for both traffic and dataHowever, none of this scaling requires intervention on your partCloudSearch adds partitions by reindexing on a parallel fleet that is swapped in with no down timeIt adds instances as needed, again with no down time.This concludes our walk through creating a search domain. I’ll turn it back over to Puneet to discuss pricing.
Explain the pricing model and why it is the way it is.CS makes it easy to try your configurationGo to the control panel to see your resources
Who is using Amazon CloudSearch in production now?There’s wide range of use cases: SmugMug for photographic images; Sage for bioinformatics and medical research; NewsRight for news licensingOur partner Search Technologies has done a cool integration of CloudSearch for WIKIPEDIA search