Explorations in Automatic Thesaurus Discovery

Front Cover
Springer Science & Business Media, Jul 31, 1994 - Computers - 305 pages
2 Reviews
Explorations in Automatic Thesaurus Discovery presents an automated method for creating a first-draft thesaurus from raw text. It describes natural processing steps of tokenization, surface syntactic analysis, and syntactic attribute extraction. From these attributes, word and term similarity is calculated and a thesaurus is created showing important common terms and their relation to each other, common verb--noun pairings, common expressions, and word family members.
The techniques are tested on twenty different corpora ranging from baseball newsgroups, assassination archives, medical X-ray reports, abstracts on AIDS, to encyclopedia articles on animals, even on the text of the book itself. The corpora range from 40,000 to 6 million characters of text, and results are presented for each in the Appendix.
The methods described in the book have undergone extensive evaluation. Their time and space complexity are shown to be modest. The results are shown to converge to a stable state as the corpus grows. The similarities calculated are compared to those produced by psychological testing. A method of evaluation using Artificial Synonyms is tested. Gold Standards evaluation show that techniques significantly outperform non-linguistic-based techniques for the most important words in corpora.
Explorations in Automatic Thesaurus Discovery includes applications to the fields of information retrieval using established testbeds, existing thesaural enrichment, semantic analysis. Also included are applications showing how to create, implement, and test a first-draft thesaurus.
 

What people are saying - Write a review

User Review - Flag as inappropriate

Explorations in Automatic Thesaurus Discovery presents an automated method for creating a first-draft thesaurus from raw text. It describes natural processing steps of tokenization, surface syntactic analysis, and syntactic attribute extraction. From these attributes, word and term similarity is calculated and a thesaurus is created showing important common terms and their relation to each other, common verb--noun pairings, common expressions, and word family members. The techniques are tested on twenty different corpora ranging from baseball newsgroups, assassination archives, medical X-ray reports, abstracts on AIDS, to encyclopedia articles on animals, even on the text of the book itself. The corpora range from 40,000 to 6 million characters of text, and results are presented for each in the Appendix. The methods described in the book have undergone extensive evaluation. Their time and space complexity are shown to be modest. 

User Review - Flag as inappropriate

If you like this, you might be interested in these books:
K. Spärck Jones, Synonymy and Semantic Classification, doctoral dissertation, Univ. of Cambridge, 1968; reprinted, Edinburgh Univ. Press
, 1984.
"Aspects of text structure: an investigation of the lexical organization of text" by Martin Phillips. North-Holland 1985.
Ruge, G. 1992. Experiment on linguistically-based term associations. Inf. Process. Manage. 28, 3 (Jan. 1992), 317-332.
Hinrich Schütze, Automatic word sense discrimination, Computational Linguistics, v.24 n.1, March 1998
Dekang Lin, Automatic retrieval and clustering of similar words, Proceedings of the 17th international conference on Computational linguistics, p.768-774, August 10-14, 1998, Montreal, Quebec, Canada
Ido Dagan , Lillian Lee , Fernando C. N. Pereira, Similarity-Based Models of Word Cooccurrence Probabilities, Machine Learning, v.34 n.1-3, p.43-69, Feb. 1999
 

Contents

INTRODUCTION
1
SEMANTIC EXTRACTION
7
22 COGNITIVE SCIENCE APPROACHES
8
23 RECYCLING APPROACHES
17
24 KNOWLEDGEPOOR APPROACHES
23
SEXTANT
33
32 METHODOLOGY
34
33 OTHER EXAMPLES
54
WEBSTER STOPWORD LIST
151
SIMILARITY LIST
153
SEMANTIC CLUSTERING
163
AUTOMATIC THESAURUS GENERATION
171
CORPORA TREATED
181
62 AI
187
63 AIDS
192
64 ANIMALS
197

34 DISCUSSION
57
EVALUATION
69
41 DEESE ANTONYMS DISCOVERY
70
42 ARTIFICIAL SYNONYMS
75
43 GOLD STANDARDS EVALUATIONS
81
44 WEBSTERS 7TH
89
45 SYNTACTIC vs DOCUMENT COOCCURRENCE
91
46 SUMMARY
100
APPLICATIONS
101
52 THESAURUS ENRICHMENT
114
53 WORD MEANING CLUSTERING
126
54 AUTOMATIC THESAURUS CONSTRUCTION
131
55 DISCUSSION AND SUMMARY
133
CONCLUSION
137
62 CRITICISMS
139
63 FUTURE DIRECTIONS
141
64 VISION
147
PREPROCESSORS
149
65 BASEBALL
202
66 BROWN
207
67 CACM
212
68 CISI
219
69 CRAN
228
610 HARVARD
235
611 JFK
240
612 MED
247
613 MERGERS
252
614 MOBYDICK
257
615 NEJM
261
616 NPL
266
617 SPORTS
273
618 TIME
278
619 XRAY
285
620 THESIS
290
INDEX
303
Copyright

Other editions - View all

Common terms and phrases

Popular passages

Page iii - THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE NATURAL LANGUAGE PROCESSING AND MACHINE TRANSLATION Consulting Editor Jaime Carbonell Other books in the series: EFFICIENT PARSING FOR NATURAL LANGUAGE: A FAST ALGORITHM FOR PRACTICAL SYSTEMS, M. Tomita ISBN 0-89838-202-5 A NATURAL LANGUAGE INTERFACE FOR COMPUTER AIDED DESIGN, T. Samad ISBN 0-89838-222-X INTEGRATED NATURAL LANGUAGE DIALOGUE: A COMPUTATIONAL MODEL, RE Frederking ISBN 0-89838-255-6 NAIVE SEMANTICS FOR NATURAL LANGUAGE...

References to this book

All Book Search results »

Bibliographic information