The document provides information about the taxonomy project undertaken by the American Society of Civil Engineers (ASCE) to develop a controlled vocabulary or taxonomy of technical terms for indexing their publications. It discusses the initial creation of the taxonomy with over 30,000 terms organized into technical, geographic, and corporate taxonomies. Subject experts then validated the technical topics. The document outlines ongoing efforts to enrich the taxonomy by adding equivalent and non-preferred terms, developing rules to help automate indexing, and handling challenges like disambiguating terms and accounting for variations. It emphasizes that taxonomy enrichment and rule building is an iterative process.
Why Teams call analytics are critical to your entire business
Case Study: Building the ASCE Thesaurus
1. Xi Van Fleet
Senior Manager of Information Services
Publishing Technology Department
Publication Division
American Society of Civil Engineers
2. Publications of American Society of Civil Engineering
A Brief History
American Society of Civil Engineers (ASCE) was
founded in 1852. We are the oldest engineering society
in the Untied States.
Our first publication, Transactions of American Society
of Civil Engineers, was published in 1872. It is the
predecessor of our journals.
The first monograph was published in 1892.
3. Publications of American Society of Civil Engineering
Today
Leading publisher in civil engineering
34 Peer-reviewed journals
Books and standards
Conference proceedings
Magazines
4. Online Civil Engineering Knowledge Environment
250+ ASCE e-book titles
65 ASCE Standards
Proceeding volumes with 42,000 papers from 2000 to present
Peers-reviewed journals with 60,000 papers from 1983 to present
More than 220,000 records with complete coverage of ASCE publications
Full-text database
Bibliographic database
5. Content driven
Overlapping with other engineering disciplines
e.g. chemical engineering, mechanical engineering; material engineering
Strong on core disciplines: e.g. structural engineering,
geotechnical engineering
Weaker on peripheral disciplines: Aerospace engineering,
energy engineering
ASCE Taxonomy
6. The taxonomy project started in 2009
Access Innovations created the first version based on the
existing CEDB subject headings and data mined from the
content
The draft contained over 30,000 terms. We divided it into three
individual taxonomies:
Technical topics
Geographic terms
ASCE corporate
In-house subject experts of different disciplines were invited to
validate the technical topics.
Project History
7. “Final” Version of Taxonomy of Technical Topics
Preferred terms: 2440
Equivalent terms: 3167
Top terms: 22
Terms with "Related Terms": 488
Terms withg "Non-Preferred Terms": 1320
8. Prepare ASCE Taxonomy for Machine Aided Index
(MAI)
• Taxonomy enrichment
• Rule building
9. Taxonomy Enrichment
Add Equivalent /Non-preferred Terms
• Alternative spelling
Analysis – Analyses; Modeling vs. modelling
• Irregular word forms
Curricula vs. Curriculums
• Synonyms
Flood – inundation
Health care facilities – Hospitals, Nursing homes…
• Acronyms
Automated people movers – APM
• Term variation
• Bedforms, Bed-forms, Bed forms
10. Rule Building
Rules teach MAIStro to think like humans by
providing it with context, logic, and instructions.
Simple rules
Simple conditional rules
Complex conditional rules
12. Some Synonyms are obvious and easy.
e.g. Preferred term: Driver behavior
Equivalent/Non-Preferred Terms
13. How to find synonyms
How to find synonyms
Some synonyms are “hidden”, e.g. Agricultural wastes
Equivalent/Non-Preferred Terms
14. Preferred term: Public health and safety
How to find synonyms
How to find synonyms
Equivalent/Non-Preferred Terms
15. How to find synonyms
Equivalent/Non-Preferred Terms
16. How to find synonyms
Preferred term: Public health and safety:
Note: in our content “health” can also be used for a structure, a river, or environment.
Equivalent/Non-Preferred Terms
18. Preferred term: High-rise buildings
e.g. Spring Temple Buddha
Tokyo Spring Tree
Preferred term: Developing countries
I
ASCE taxonomy term: Civil engineering landmarks
ASCE Civil engineering landmarks Award list
How to find synonyms
Equivalent/Non-Preferred Terms
20. Terms made of phrase with variations
Preferred term: Lightweight concrete
Non-Preferred terms: Light-weight concrete, Light weight concrete
Preferred term: Design/Bid/Build
Non-Preferred terms: Design-bid-build, Design bid build, D/B/B/, DBB. D-B-B
Equivalent/Non-Preferred Terms
Think about variation
21. Equivalent/Non-Preferred Terms
Terms with prefix
Bio+Preferred terms
Biobinders; Biofuels; Biocement; Biokinetics; Biofilters;
Biofouling; Biogrouting; Bioleaching…
Post + Preferred terms
Postearthquakes; Postcombustion; Postcracking
Other prefix: Pre, Micro, Macro, Super. Multi,
Non, Off...
Think about variation
22. Acronyms
Preferred term: Magnetic levitation trains
Non-preferred term: Maglev
Preferred term: Automated people movers
Non-preferred term: APM
Preferred term: Air traffic control
Acronym: ATC
ATC=apparent tardiness cost; applied technology council …
Need disambiguation
Preferred term: Intelligent transportation systems
Acronym: ITS
Be careful with acronyms
Equivalent/Non-Preferred Terms
23. Create Rulebase
MAIStro automatically creates text-to-match (TTM) rule for
every term, both preferred and non-preferred
TTM works for many terms:
Flash floods – Flash floods
Continuing education – Continuing education
Ridership – Ridership
Hydraulic engineering – Hydraulic engineering
Text that matches
24. Create Rulebase
Noun vs. verb vs. adjective vs. adverb
Preferred term: Corrosion
Corrosive
Corrosiveness
Corrosivity
Corroding
Corroded
Corrodible
Corrodibility…
Simple rule
Corros* USE Corrosion
Corrod* USE Corrosion
Text that doesn't quite match (variations)
25. Create Rulebase
Preferred term: Lateral loads
Variations: Lateral loading; Laterally loaded…
Need simple conditional rule:
load*
IF (WITH "lateral*")
Lateral loads
ENDIF
Text that doesn't quite match (variations)
26. Create Rulebase
Variations of “Span bridges”
Bridge*
IF (NEAR "span" OR NEAR "short-span" OR NEAR "long-span" OR
NEAR "single-span" OR NEAR "multi-span" OR NEAR "multiple-span"
OR NEAR "four-span" OR NEAR "three-span" OR NEAR “one-span”
OR NEAR “continuous-span" OR NEAR "simple-span" OR NEAR
"large-span")
USE Span bridges
ENDIF
Text that doesn't quite match (variations)
28. Preferred term: Structural analysis
Analy*
IF (WITH "structur*" OR WITH "load" OR WITH "loads")
IF (NEAR "arch*" OR WITH "column*" OR NEAR "bar" OR NEAR "bars" OR
NEAR "bar's" OR NEAR "beam" OR NEAR "beams" OR NEAR "strut" OR NEAR
"struts" OR NEAR "compression member*" OR NEAR "tie" OR NEAR "ties" OR
NEAR "tie rod" OR NEAR "tie-rod" OR NEAR "tie rods" OR NEAR "tie-rods" OR
NEAR "eyebar*" OR NEAR "guy-wire*" OR NEAR "guy wire*" OR NEAR
"suspension cable*" OR NEAR "wire rope*" OR NEAR "angle section*" OR
NEAR "connect*" OR NEAR "coupl*" OR NEAR "diaphragm*" OR NEAR
"flange*" OR NEAR "frame*" OR NEAR "bent" OR NEAR "bents" OR NEAR
"girder*" OR NEAR "hollow section*" OR NEAR "hollow structural section*" OR
NEAR "joint*" OR NEAR "joist*" OR NEAR "membrane*" OR NEAR "panel" OR
NEAR "plate" OR NEAR "slab*" OR NEAR "stud" OR NEAR "studs" OR NEAR
"tendon*" OR NEAR "tensile member*" OR NEAR "truss*" OR NEAR "tube*" OR
NEAR "wall*" OR NEAR "gable*" OR NEAR "wall section*" OR MENTIONS
"structural failure*" OR MENTIONS "building failure*")
USE Structural analysis
ENDIF
Create Rulebase
Text that doesn’t quite match (whole vs parts)
29. Bridge the gap
Raising the bar
Foundation
a solid foundation, a firm foundation, research
foundation…
Toll: Toll Brothers, human toll, take a toll…
Using NULL rules
right match that is wrong
Create Rulebase - To Disambiguate
30. Create Rulebase
Phases that contain more than one term
Text: Continuous Multispan Concrete Girder Highway Bridges
Preferred terms:
Continuous bridges
Span bridges
Concrete bridges
Girder bridges
Highway bridges
31. Create Rulebase - To Disambiguate
Preferred term: Wells
(noun vs adverb)
Well*
IF (WITH "hydraul*" OR WITH "Hydro*" OR WITH "Aquifer*" OR WITH "Multiaquifer*" OR WITH
"discharg*" OR WITH "pump*" OR WITH "stilling" OR WITH "flow*" OR WITH "water*" OR WITH
"groundwater" OR WITH "Recirculation" OR WITH "Artesian")
USE Wells
32. Foundation*
IF (NOT (NEAR "success*" OR NEAR "research" OR NEAR "national science" OR
NEAR "grant*" OR NEAR "president*" OR NEAR "ASCE foundation*" OR AROUND
"engineering foundation" OR NEAR "economic" OR NEAR "prize*" OR NEAR
"award*" OR NEAR "education*" OR NEAR "campaign*" OR AROUND "reason
foundation" OR AROUND "national science foundation" OR AROUND "nsf" OR
NEAR "job*" OR NEAR "partner*" OR NEAR "organization*" OR NEAR "scholar*"))
IF (WITH "bridge*" OR AROUND "bridge foundation*")
USE Bridge foundations
ENDIF
IF (WITH "dam" OR WITH "dams" OR AROUND "dam foundation*")
USE Dam foundations
ENDIF
IF (NEAR "deep" OR AROUND "deep foundation*")
USE Deep foundations
…
Create Rulebase - To Disambiguate
33. If a term is impossible to write a rule, it may not a
good term.
Bubbles
Water bubbles, air bubbles, gas bubbles, financial bubbles…
fluid dynamics, waste treatment, material science, soil
mechanics…
Clue: if you have trouble place a term in the taxonomy, you are likely to have
trouble creating rules for it.
Disambiguation
34. Create Rulebase
Test*
Test, tests, testing, testings, testify, testimony, testosterone
Wave*
Waves, wavelength, wave length, wavelet, wavefront, waverider, waveguide…
Truncate text with care
35. Preferred Term: Workplace discrimination
Discriminat*
IF (WITH "age" or WITH "minority" or WITH "racial" or WITH "race" or
WITH "disabilit*" or WITH "senior" or WITH "older" or WITH "old" or
WITH "women" or WITH "woman" or WITH "diversity" or WITH
"dispute" or WITH "equal*" or WITH "female" or WITH "male" or WITH
"workplace" or WITH "African*“ or WITH “Hispanic”)
USE Workplace discrimination
ENDIF
Text that hardly matches (need specifics)
Create Rulebase
36. Taxonomy Enrichment and Rule Building
is a Process.
Another opportunity to fine tune the taxonomy
Diffus*
IF (MENTIONS "transport" OR MENTIONS "concentration" OR MENTIONS "gradient" OR MENTIONS "advetive" OR MENTIONS "equilibr*" OR MENTIONS "voc" OR
MENTIONS "vocs"OR MENTIONS "volatile organic compound*" OR MENTIONS "water*" OR MENTIONS "moisture" OR MENTIONS "wave*" OR MENTIONS "flow" OR
MENTIONS "chemical*" OR MENTIONS "molecul*" OR MENTIONS "soil*" OR MENTIONS "waste*" OR MENTIONS "filter*" OR MENTIONS "runoff" OR MENTIONS "run-
off" OR MENTIONS "jet" OR MENTIONS "turbulen*" OR MENTIONS "gas" OR MENTIONS "emission*" OR MENTIONS "emit*" OR MENTIONS "air" OR MENTIONS
"oxygen" OR MENTIONS "thermal" OR MENTIONS "solute*" OR MENTIONS "chloride*" OR MENTIONS "contamin*" OR MENTIONS "pollut*" OR MENTIONS "organic"
OR MENTIONS "compound*" OR MENTIONS "nitri*" OR MENTIONS "ion" OR MENTIONS "ions" OR MENTIONS "dye" OR MENTIONS "dyes" OR MENTIONS "fluid*" OR
MENTIONS "channel*" OR MENTIONS "river*" OR MENTIONS "stream*" OR MENTIONS "tidal" OR MENTIONS "hydro*" OR MENTIONS "hydrau*" OR MENTIONS
"lake*" OR MENTIONS "bay" OR MENTIONS "bays" OR MENTIONS "ocean*" OR MENTIONS "coast*" OR MENTIONS "sediment*" OR MENTIONS "sea" OR MENTIONS
"seas" OR MENTIONS "catchment*" OR MENTIONS "reservoir*" OR MENTIONS "estuar*" OR MENTIONS "sewage*" OR MENTIONS "flood*" OR MENTIONS "porous
medi*" OR MENTIONS "concrete*" OR MENTIONS "bentonite" OR MENTIONS "cement*" OR MENTIONS "clay*" OR MENTIONS "advection*" OR MENTIONS
"convection*" OR MENTIONS "eddy" OR MENTIONS "eddies" OR MENTIONS "flux")
IF (AROUND "voc" OR AROUND "vocs" OR AROUND "volatile organic compound*" OR AROUND "chemical*" OR AROUND "molecul*" OR AROUND
"chlorid*" OR AROUND "nitri*" OR AROUND "ion" OR AROUND "ions" OR AROUND "polymer*" OR AROUND "species" OR AROUND "polyaromatic*" OR AROUND
"hydrocarbon*" OR AROUND "aromatic*" OR AROUND "pah" OR AROUND "pahs" OR AROUND "dichloromethane*" OR AROUND "chloromethane*" OR AROUND
"chemox")
USE Diffusion (chemical)
ENDIF
IF (AROUND "thermo*" OR AROUND "thermal" OR AROUND "thermodiffusion")
USE Diffusion (thermal)
ENDIF
IF (AROUND "porous" OR AROUND "porosity" OR AROUND "soil*" OR AROUND "clay*" OR AROUND "pore" OR AROUND "pores" OR AROUND
"cement*" OR AROUND "concrete*" OR AROUND "bentonite")
USE Diffusion (porous media)
ENDIF
IF (AROUND "fluid*")
IF (WITH "turbulen*" OR WITH "eddy" OR WITH "eddies")
USE Turbulent diffusion
ELSE
ENDIF
IF (NOT (AROUND "voc" OR AROUND "vocs" OR AROUND "volatile organic compound*" OR AROUND "chemical*" OR AROUND "molecul*" OR
AROUND "chlorid*" OR AROUND "nitri*" OR AROUND "ion" OR AROUND "ions" OR AROUND "polymer*" OR AROUND "species" OR AROUND "polyaromatic*" OR
AROUND "hydrocarbon*" OR AROUND "aromatic*" OR AROUND "pah" OR AROUND "pahs" OR AROUND "dichloromethane*" OR AROUND "chloromethane*" OR
AROUND "chemox" OR AROUND "thermo*" OR AROUND "thermal" OR AROUND "thermodiffusion" OR AROUND "porous" OR AROUND "porosity" OR AROUND "soil*"
OR AROUND "clay*" OR AROUND "pore" OR AROUND "pores" OR AROUND "cement*" OR AROUND "concrete*" OR AROUND "bentonite" OR AROUND "fluid*"OR
WITH "wave" OR WITH "waves"))
USE Diffusion
ENDIF
ENDIF
37. • It is impossible to build perfect rules.
• Noise (rules too general) or misses (rules too
granular). Try to strike a balance.
• Be ready for the unexpected. Keep note of possible
equivalent terms when you are not working on the
taxonomy, e.g. “ring of fire”=Earthquakes, “la nina”,
“el nino”, “polar vortex” =Climate change
Taxonomy Enrichment and Rule Building
is a Process