2. Marianne Sweeny
Ascentium
www.ascentium.com
Marianne.sweeny@ascentium.com
Director of Search Services, Web producer
at Microsoft for 7+ years, pointy-head not
propeller-head
3. Agenda
Introduction
MOSS 2007 Search
Configuring MOSS Search
Here There Be Dragons
Resources
Appendix
5. There is No Magic Bullet
Susan Feldman (IDC) Enterprise Search Summit West 2008
– Employees average 3.5 hours/week searching
– Cost = $5000 per employee per year
There can be no “silver bullet” solution for finding information
– Customers don‟t know what they don‟t know
– “Google experience” is finding what they want/need in the first
few pages and not necessarily Google itself
– Enterprises have different lines of business and different
information types
Search of tomorrow: is here today
– Personalized to the device and user
– Contextual
– Flexible
– Secure
– Adaptable
6. Search Index: A Different Kind of Database
Search Engine Index SQL Server Index
7. Web Search and Enterprise Search
Web Search Enterprise Search
Publishers want their content to Publishers do not think about
be found document discoverability
Anarchistic publishing model = Controlled corpus of documents
“anyone, anywhere, any time” Standards and practices in place
Unlimited document set No spam
No real standards or code, more Users and authors generally
like guidelines share contextual understanding
No central authority Customized tagging or metadata
Spam Can customize search
Commercialization technology to enterprise themes
Technology is agnostic and concepts
Has to work the same for
everyone worldwide
No shared understanding Successful enterprise search efforts target corpuses of information
and set search scopes appropriately. I&KM pros are wise to study
information worker context before trying to “Google-ize” their
enterprises. Forrester Search Wave Q2 2008
8. Advanced Search
Few customers use it and those that do are
disappointed
Boolean or SQL operators work sporadically
Confusing message
What is “regular” search…not as effective
Searchhas progressed beyond the stages
of Advanced
Filters
Facets
Context
9. MOSS 2007 Search
Query engine
breaks the search
terms down
Index engine stores
the properties
Content index
stores the text
10. Better Than Ever
MOSS 2007 SharePoint 2003
Relevance customizable to the Relevance keyed on numeric values
enterprise content derived solely from document text
Automated metadata extraction Collection frequency
Enhanced text analysis Term frequency
Fully integrated admin experience Document length
between Windows Term position
SharePoint Services v3 and MOSS Different systems between Windows
2007 SharePoint Systems and SharePoint
Single search system and index Portal Server
per server farm Multiple indexes
Custom content groups, Best Custom Content groups, Best
Bets, scheduling are now shared Bets, scheduling configurations
services are portal-based
Scopes can be tied to document Scopes tied to content sources
properties Index propagated at completion of
Improved control over indexing master crawl only
11. Simplified Administration UI
Search settings page at the SSP level
Managing crawls
• Content sources
• Explicit SharePoint Content Source Type
• Content source for Business Data (Enterprise CAL)
Crawl logs
• Snapshot of crawled content in your index – lists all documents found
in the content source and their status
• Filters by date, site, and etc.
• Summary by host name (#of successes, errors, and warnings)
Crawl rules
• Included and excluded rules
• Ability to pre-test crawl rules
• Easy to change order of crawl rules
Managing scopes
• Scopes decoupled from content sources
• Scopes can span multiple content sources
• Scope by Property, Site, Content Source, and URL
12. Indexing Performance Improvements
Search is a shared service
– Unified WSS and MOSS search for 1 index per SSP
– Crawls, content sources, crawl rules schema, shared scopes etc are
administered centrally at the shared service level
– Scopes and best bets can also be administered at the consuming sites
Crawl to small indexes that are then consolidated at scheduled times
into a “master merge”
Content index that holds text of pages with Property store that holds
other document values
Propagate data incrementally as it is being indexed to the query
servers
– Propagation starts within 30 seconds of the first shadow index written
– No need to wait till the end of the crawl for information to be available in
queries
– No propagation of properties
Single item add /removal without re-indexing entire corpus with
continuous propagation
– Change Log Crawl: detects what items have changed with in a WSS or a
MOSS 2007 site and crawl only those items
– Security Change Only Crawl: no need to fully index all the content of a site
when permissions on this site have changed
13. Relevance: Types
Dynamic ranking = relevance impacted by query term
– Frequency
– Location in document
– Appearance in link text
– Appearance in URL
Static ranking = relevance independent of customer query
– URL Depth
– Click Distance
– Authority/Demoted site
– Change property weights
– Language of customer (browser setting)
– Document type: HTML files, PPT, Word docs, emails
, XML files, Excel spreadsheets, Plain text, List
items
14. Relevance: Enhancements
Manually assign synonyms and editorialized results to keywords
– Use search logs to detect popular searches, low click-
through from results or 0 result queries
Search Alerts
– User can subscribe to receive email when results change
File type filtering
– Some file types are deemed more relevant (i.e. HTML,
DOC) than others (XML, txt)
– Supports 220 files types, MS and non-MS application
Property weights *
– Assign different weights to properties so that important
properties such as „Title‟ have a bigger influence on
ranking
– Change default property weights through the Schema
Object Model
– Note: The weights used in the product were carefully
tested. Changes to the weights may also have a negative
effect on relevance
* Marcy Tobin wants me to tell you that this is not a trivial undertaking
15. MOSS 2007 Faceted Search
Facets are predetermined
content categories presented
to the customer to narrow
search results
•Can be presented pre- or
post- query
•Used for Advanced search
Empowers customer to most
effectively refine their search
Filters results by
predetermined categories
16. Federated Search
Import or export federated locations using Federated
Location Definition (.FLD) files
Incorporates results from outside content sources that
subscribe to OpenSearch 1.1
Passes the query into the subscribed resource and
returns results into single interface
Relevance calculation done according to originating
resource criteria, not MOSS 2007 criteria
Pre-defined FLD files found at
http://www.microsoft.com/enterprisesearch/connector
s/federated.aspx#fscp
Can develop own FLD files if destination subscribes to
OpenSearch 1.1
– Day Software has developed a standard connector for LiveLink
ECM
17. People Search
Build and publish rich personal profiles
Customize personal profile attributes
Populate personal profiles using information from Active Directory, other
LDAP directories, or Line-of Business systems
Control access to information using security and privacy controls
Generate and display organizational charts based on directory
information
Publish personal profiles using MOSS My Sites
Identify people who can help
Find people based on keyword matches with MOSS personal profiles
Find people in line-of-business systems
Filter results by common attributes such as Job Title or Department
Find “in-common” connections, including managers, site memberships,
distribution lists, and colleagues
Group results by social distance
Subscribe to People Alerts
18. People Search Results Page
Find people by project, expertise or…
Filter by
relevant
attributes
Contact information & online availability
19. LOB Applications with BDC
Extracts data from line-of-
business, CRM, and other
3rd Party data stores
Caches for indexing by search
service
Searches any data source
accessible through ADO.net or
Web Services
Uses Live Communication
Server for connectivity options
Aggregated into a single
application
20. FAST ESP Technology
FAST is a sophisticated search engine tailor-made for ecommerce and help
desk
Uses sophisticated linguistic processing
Searches structured and unstructured content
Indexing Process: Conversion-language detection-synonyms-spell check-
external call outs-entity extraction-categorization-vectorization-custom
navigation-normalizer-alerting-indexing
Why is it Unique
Auto Classification
Advanced Linguistics: text mining for
concept and relationship mapping
Recall: Lemmatization, synonym
expansion, wildcards, anti-phrasing,
phonetic search
Precision: Exact word matching,
exact phrase matching, proximity,
tokenization
Location aware results (retail and
news) – excellent for mobile search
Recommendation engine
Increased capacity:100-200 million
documents on 1 server and 150
million q/second
21. Custom Results
Search Scopes
Allow users to refine search through filtering
Define content resources and map to business rules/key concepts
Focused content = shared understanding = more precise results
Duplicate results filtering
Collapsing duplicates from same directory or site to leave more room for
other relevant results
Less favoritism, more results on desired page 1
Definitions
Automatically extract “definitions” from indexed content and display them
as matches directly on the results page
A web property on the Search Best Bets web part (can turn on/off display
of definition)
Returned in the Query Object Model
Can not be edited
Best Bets
Editorially assigned results based on these key concepts assigned to
selected query terms
Can be many-to-many
22. Scalability
No physical limit for the maximum number of
documents in one index
Recommended document limit is 50 Millions of
documents per indexer
A document is anything from a Word or PowerPoint
file, to a web page, an individual SharePoint list
item, one people entry, or an SAP customer record
Large/small documents count the same
The „average document size‟ depends on the
corpus mix
– i.e., heavy use of WSS 3.0 lists versus limited use
Dependent on supporting hardware
23. Security
Query time stripping – customer only sees those results
that they have permission to view
Support for pluggable authentication for content in
SharePoint Server and WSS 3.0 Sites
Implements ASP.NET 2.0 authentication model
Minimum crawler permission is “Full Read”
Still provides the same security trimming functionality
Automatically configured for new sites
Search visibility options
Prevent sites/lists appearing in search results at a
site/list level
“Security only” crawl for single item add/removal
24. Search Analytics
Export search logs to Excel
Query terms
Page views
Number of results returned
Volume trends
Query success: can define success for
certain query terms
Report Center
Access to MOSS 2007 BI features
Filters data for permissions and relevance
Key Performance Indicators [KPI]
Create a KPI list or other measures of
success
Default KPIs exist in OOB deployment
KPI information can be drawn from MOSS
2007 data sources: SharePoint lists, Excel
workbooks, SQL Server 2005 Analysis
Services, manually entered information
26. Search Roadmap
Useful participants
Content creators
Information Architect/User Experience Architect
Taxonomist
Define key enterprise themes in content
Map existing content to these themes
Create filters and scopes to map for themes
Get as much customer data as possible to find search pain points
Review search logs and customer feedback mechanisms
What are they trying to find
What terms are they using
Assemble a cross functional team to:
Assign relevance weighting that makes sense to the customer behavior and the
corpus
Develop Best Bets for searches with 0 results
Create editorial guidelines and tools that enforce strong meta data standards across
the enterprise
Develop controlled vocabulary that best describes enterprise key concepts and
themes and Is used as a foundation for meaningful metadata and facets
Design a structure that leverages the structural elements like URL depth and click
distance
27. Pareto‟s Principle
Known as the 80/20 rule
Named after late 19th
century economist
20% of your content is
answering 80% of your
searches
Not an excuse to stop
optimizing at the top 20%
Don‟t forget the Long Tail
28. Define Content
Define content scopes
Segment content into logical groups
Create scope rule based on
– Address
– Property query
– Content source
At the SSP level or individual level
SSP level scopes are shared among all sites that use the SSP
Select Authority resources
Define special terms if needed
Terms or language proprietary to the enterprise
– i.e. “goat rodeo”
Provides additional clarification for searcher
Use synonym mapping for term variants
– C# and Csharp
Two information points can be displayed for a special term
– Definition of the term
– Best Bet
29. Designate Authority Sites
Hilltop Algorithm
Quality of links more important
than quantity of links
Segmentation of corpus into broad
topics
Selection of authority sources
within these topic areas
Pre-query calculation applied at
query time
Topic Sensitive Page Rank
Consolidation of Hypertext
Induced Topic Selection [HITS]
and PageRank
Pre-query calculation of factors
based on subset of corpus
– Context of term use in document
– Context of term use in history of
queries
– Context of term use by user
submitting query
30. Educate: Structural Influences
File Type Bias
In order of relevancy (highest to lowest )
– HTML Web pages
– PowerPoint presentations
– Word documents
– Emails
– XML files
– Excel spreadsheets
– Plain text files
– List items
Auto Language Detect
Foreign language results are less relevant than results in user‟s
language
English language is always considered as relevant as user‟s language
URL Depth and Click Distance
Short URLs are like prime real estate.
Items with shorter URLs are considered more relevant than items
placed in longer URLs
– The level is determined by reviewing the number of slash (“/”) characters in
the URL
Keywords separated by hyphens in the URL are good
31. Educate: Content Influences
Anchor Link Text
Search indexes the anchor text from the following elements:
– HTML anchor elements
– SharePoint Services link lists
– SharePoint Portal Server 2003 listings
– Word 2007, Excel 2007, and PowerPoint 2007 hyperlinks
Any file types handled by installed 3rd party iFilter components
which emit hyperlinks
Metadata extraction
Shadow title detection is provided within the body of the item
– Primarily based on text formatting features
– Shadow title is added automatically to the document
– Weighted the same as the original title
– Only for Microsoft Office file types
Auto Description text
Optimized URLs
Enterprise Search checks URL matching at query time:
If query matches to the host name of a page in the index it will
display as the first result
32. Enhanced Search Results
Site Actions >> Site Settings >> Modify All Site Settings >> Site Collection Administration
(Select Keywords) >> Manage Keywords >> quot;Add Keyword“ >>
Synonym Mapping Best Bets
33. Hardware Considerations
Dedicated crawl-target servers for large
sites
Separate SQL Server instance for Search
Fast disk for SQL, fast CPU for Indexer,
more memory
Dedicated Web Front End Server for
crawling
Separate indexer machine
In most cases, your search index is on its own
server
34. Indexing Configuration
Use dedicated web front ends for crawling large
farms/sites
Upgrade WSS 2003 sites to WSS 2007 sites to
index them faster
Define Crawler Impact Rules to avoid site overload
Schedule for off-hours crawling where appropriate
Balance results freshness with load on servers
Consider using single content access account per
region
Regularly cleanup and Review
Crawl rules
Property and schema
Best Bets / keywords
35. Customizing Results Display
To access the XSL property of the Search Core Results Web Part
1. In your browser, navigate to the results page URL:Copy Code
http://<ServerName>/SearchCenter/Pages/results.aspx
2. Click the Site Actions link, and then click Edit Page.
3. In the Search Core Results Web Part, click the edit down arrow to
display the Web Part menu, and then click Modify Shared Web
Part. This opens the Search Core Results Web Part tool pane.
4. Click Data Form Web Part to display the XSL Editornode.
5. Click the Source Editor button.
6. This opens the Text Entry window for the Web Part's XSL
property. You can modify the XSLT directly in this window;
however, you may find it easier to copy the code to a file. You
can then edit that file using an application such as Visual Studio
2005.
7. After you have finished editing the file, you can copy the modified
code back into the Text Entry window and save your changes to
the Search Core Results Web Part.
37. Dragons 1
Note the infrastructure update where Microsoft rolled
the features of Search Server 2008 into MOSS 2007
that includes federated search ability, and a unified
administration dashboard.
Read more here:
http://blogs.msdn.com/sharepoint/archive/2008/07/15/announci
ng-availability-of-infrastructure-updates.aspx
Also please note that it is *not* an easy installation,
and that users *must* read the entire documentation
for it before upgrading their portal.
More people destroy their portal than upgrade it due to not
reading the documentation and installing the prerequisite
patches
Must ensure a schedule for the incremental crawl to
catch additions to the document set
Must turn on PDF indexer and stemming
38. Dragons 2
Use the Web part to accommodates wildcard
search
Found here:
http://www.sharepointblogs.com/mirror/archive/2008/06
/09/new-web-part-for-wildcard-search-in-enterprise-
search.aspx
Use of special characters in the thesaurus can lead to
highly irrelevant results and impact “did you mean”
capabilities
The Expert search capacity is predicated on the My
Sites profile
Employee participation critical to optimal
functionality
Benefits of click-distance are missed if Authority sites
are not configured
39. Dragons 3
The value of statistical ranking can vary from the
partial indexes to the master merge index
Without authoritative sites configured in the relevance
settings, the benefits of click-distance are missed
Results delayed from servers without Internet
connections
Backward compatibility
Custom applications using SharePoint 2003
administrative object model must be rewritten to
use MOSS 2007 object model
Index files, scopes, search alerts, filters, word
breakers, thesaurus files not upgraded
Custom applications using SharePoint 2003
administrative object model must be rewritten to
use MOSS 2007 object model
40. Resources
Microsoft Enterprise Search website
http://www.microsoft.com/enterprisesearch/
Webcast: Installing and Configuring Search in MOSS
2007http://msevents.microsoft.com/cui/WebCastEventDetails.aspx?culture=en
US&EventID=1032325467&CountryCode=US
Tune Search server 2008
http://www.nonlinear.ca/blog/index.php/2008/02/27/how-to-tune-microsoft-
search-server-express-2008-etc/
Configuring MOSS 2007 Search (Cale Hoopes)
http://calehoopes.blogspot.com/2007/11/configuring-moss-as-search-
appliance.html
MOSS Developer Center on MSDN
http://msdn.microsoft.com/office/server/moss/default.aspx
MOSS 2007 Software Developers Kit http://msdn2.microsoft.com/en-
us/library/ms550992.aspx
MOSS 2007 on TechNet http://technet2.microsoft.com/Office/en-
us/library/3e3b8737-c6a3-4e2c-a35f-f0095d952b781033.mspx
Search Optimization for a MOSS 2007 Content Management site:
http://msdn.microsoft.com/en-us/library/cc721591.aspx
Faceted Search from the Microsoft SharePoint Team Blog
http://blogs.msdn.com/sharepoint/archive/2008/03/17/open
41. More Resources
Enterprise search bloghttp://blogs.msdn.com/enterprisesearch/
MOSS BDC Search
http://blogs.msdn.com/gunterstaes/archive/2007/01/16/putting-it-all-together-
moss-2007-business-data-catalog-search-excel-services-sql-analysis-
services.aspx
Find it All with SharePoint Enterprise Search http://technet.microsoft.com/en-
us/magazine/cc162512.aspx
Google Enterprise Connector for MOSS 2007
http://code.google.com/apis/searchappliance/documentation/50/connector_ad
min/sharepoint_connector.html
Ontologica Search for MOSS 2007
http://www.ontolica.com/upload/pdf/factsheets/ontolicasearch_featurelist.pdf
Michael Gannotti on SharePoint
http://sharepoint.microsoft.com/blogs/mikeg/Lists/Categories/Category.aspx?N
ame=Search%20Technologies
Sitemap.xml Generator:
http://www.thesug.org/blogs/lsuslinky/Lists/Posts/Post.aspx?ID=14
SEO Advice from a Propellerhead for … : http://www.mossseo.com/
42. Even More Resources
MOSS 2007 Administrator Documentation
http://jamorgan.wordpress.com/2006/09/07/administrator-documentation-for-
moss-2007-wss-v3/
SharePoint Search linkshttp://www.virtual-
generations.com/2007/01/29/sharepoint-moss-2007-search-links/
All About SharePoint : S.S. Ahmed
http://www.sharepointblogs.com/ssa/archive/2007/01/19/working-with-
sharepoint-search-part-1.aspx
Working with MOSS search - creating scopes
http://www.sharepointblogs.com/ssa/archive/2007/01/19/working-with-
sharepoint-search-part-2.aspx
MOSS 2007 search customization
http://blogs.technet.com/pavelka/archive/2007/05/24/moss-2007-search-
customization.aspx
MOSS 2007 Search & Indexing
http://www.sharepointblogs.com/zimmer/archive/2006/11/16/moss-2007-
search-and-indexing.aspx
Create a custom Search Page
http://www.sharepointblogs.com/zimmer/archive/2007/08/25/moss-2007-
connect-a-custom-search-page-to-a-custom-search-scope.aspx
44. Auto Classification Products
Concept Searching
Auto-classifies documents for MOSS 2007
Uses established probabilistic methods to distinguish
multiword concepts and weight by importance (relevance)
Extracts concepts and weights their relevance to searcher
query
– Presents for search refinement
http://www.conceptsearching.com/conceptHMSO/ (insider
trading)
Integration with MOSS
Extracts metadata and compound terms
Incorporates with existing taxonomy if one exists
Appends metadata and stores as MOSS property
Part of the main MOSS index
Uses standard MOSS administration features
45. Adjusting Relevance Property weights
Assign different weights to properties so that certain
properties such as „Title‟ have a bigger influence on
ranking
Change default property weights through the Schema
Object Model
using Microsoft.Office.Server.Search.Administration;());
Ranking ranking = new Ranking(SearchContext.GetContext( appGuid ));
//dump parameters
foreach (RankingParameter param in ranking.RankingParameters)
{
RankingParameter lookedup = ranking.RankingParameters[param.Name];
Console.WriteLine(lookedup.Name + quot;: quot; + lookedup.Value);
}
//Lookup by index
for (int i = 0; i < ranking.RankingParameters.Count; i++){
RankingParameter param = ranking.RankingParameters[i];
Console.WriteLine(param.Name + quot;: quot; + param.Value); }
//Setting the weight of property ‘prop’ to ‘weight’
ranking.RankingParameters[property].Value = float.Parse(weight);
ranking.StartRankingUpdate(RankingUpdateType.ClickDistanceUpdate);
Console.Write(quot;Updating ..quot;);
while (ranking.Status != RankingUpdateStatus.Idle)
{ Console.Write('.');
System.Threading.Thread.Sleep(1000);
} Console.WriteLine(quot;Done.quot;);
Remember that Marcy Tobin wants me to let you know that this is not a trivial matter and she knows of what she speaks.
46. Push/Pull Data to Users
Alerts
Same alerting infrastructure for WSS and MOSS
– Timer service is used to handle all alerts notifications
Frequency can be set to Daily/Weekly
– Notifications for search alerts will be sent according to the creation time
„Alert Me‟ link can be added/removed using a web part property
on the Search Action Links web part and on the Search Core
Results web part
A rollup of all user‟s alerts for a site collection
– http://<sitecollection>/_layouts/MySubs.aspx
Alert “gotchas”
– No “My Alerts Summary” web part
– No upgrade path from SPS2003 alerts to MOSS 2007 alerts except for
WSS alert types
RSS Feeds
Ability to subscribe for an RSS feed on the search results
„RSS‟ link can be added/removed using a web part property on the
Search Action Links web part and on the Search Core Results web part
47. Protocol Handlers
Connects to a content source and
enumerates the documents
Ships with support for
Web Content, NTFS File Shares, Exchange
Public Folders, Lotus Notes Databases,
SharePoint Content, SharePoint profiles, and
Business Data Catalog
Partners providing support for
Documentum, Hummingbird, OpenText, FileNet,
Interwoven, and others
http://msdn.microsoft.com/library/en-
us/spssdk/html/_introduction_to_a_protocol_handl
er.asp?frame=true
48. The Query object model
KeywordQuery request = new KeywordQuery(site);
request.QueryText = strQuery;
request.ResultTypes |= ResultType.RelevantResults;
//if we want to get more than one result table
//request.ResultTypes |= ResultType.SpecialTermResults;
//Setting optional parameters on the Query object
request.RowLimit = 10;
request.StartRow = 0;
request.KeywordInclusion = KeywordInclusion.AllKeywords;
//Executing the query
ResultTableCollection results = request.Execute();
49. Metadata Property Mapping
Crawled properties
Emitted by iFilters and Protocol Handlers
Identified by a property set (GUID) and property
ID (name or numeric ID)
Managed properties
Mapping target for crawled properties (many-to-
many)
Identified by internal ID
Friendly name used in queries
– Can be used in the query with property:
Value