Indexing and access for digital libraries and the internet / Bates(1998)

Citation - Bates, M. J. (1998). Indexing and access for digital libraries and the Internet: Human, database, and domain factors. Journal of the American Society for Information Science, 49(13), 1185-1205.

Keyword -

Review the research issues on digital content (especially internet) indexing & accessing.

  • Introduction
    • Objective
    • Fully Automated Subject Access (environment)
  • Human Factors
    • Subject Searching vs. Indexing
    • Multiple Terms of Access
  • All Subject Vocabulary Terms Are Not Equal
    • Folk classification
    • Folk access
  • Statistical Indexing Properties of Databases
    • Bradford's Law
    • Vocabulary Scalability
    • Resnikoff-Dolby 30:1 Rule
  • Domain-Specific Indexing
  • Discussion and Conclusions

為甚麼在網路數位資源這麼容易被檢索的時代,還要研究索引與取用的問題?

「…(很多人)認為取得主題性的數位資源的問題是已經被解決的,或是已經差不多被全文檢索系統解決。」
… (Many people) assume that subject access to digital resources is a problem that has been solved, or is about to be solved, with a few more small modifications of current full-text indexing systems.

基於兩個原因,所以我們還需要研究文章中所提到的議題:

  • 首先,在自動索引處理的環境中,人、領域、與其他因素仍然發揮作用,且需要在效能上繼續優化
    First, the human, domain, and other factors would still operate in a fully automated environment, and need to be dealt with to optimize the effectiveness of information retrieval (lR) systems.
    • 無論我們開發了什麼樣的資訊系統,這個系統終究還是由人類活動所製造出來的,如:資料庫,仍舊有著相同的統計特性。
      Whatever information systems we develop, human beings still will come in the same basic model; products of human activity, such as databases, still will have the same statistical properties, and so on.
    • As should become evident, failure to work with these factors will almost certainly diminish the resulting product.
  • 根據小規模樣本所設計出來的資訊檢索系統,仍然無法改善 30% 涉及人類語言與認知處理機制的結果。但自動計算機技術遲早會解決這個問題。
  • 資訊檢索系統的使用者端,也是需要被注意與了解的。
    The human side of the IR process needs attention too. The really sophisticated use of computers will require designs shaped much more in relation to how human minds and information needs actually function, not to how formal, analytical models might assume they do.

注意這些能改善資訊檢索系統有效性的因素,最終我們能找到發展好的自動檢索處理系統的方法。
Attention to these points will be productive for effective IR, no matter how soon, or whether, we find a way to develop good, fully automated, IR.

檢索者與索引者

  • 一般認為,索引編制與搜尋檢索是彼此鏡像的;索引編制一方將內容表徵為索引詞彙,而搜尋一方將問題表徵為搜尋詞彙,比對兩邊的表徵則得到檢索結果。但實際上,這只掌握了膚淺的對稱關係。使用者與索引者的經驗有著現象學上的差異。
    It is commonly assumed that indexing and searching are mirror images of each other. In indexing. the contents are described or represented, or, in full-text searching, indicative words or phrases are matched or otherwise identified. On the searching side, the user formulates a statement of the query. Then these two representations, of document and of query, are matched to retrieve the results. But, in fact, this is only superficially a symmetrical relationship. The user's experience is phenomenologically different from the indexer's experience.
  • 使用者所經驗的是,正在描述一個他/她所不知道的東西(Belkin, 1982);因為他不知道,所以使用者只能描述在僅有的零星片段知識,試探著哪一個線索可以填補他的知識鴻溝,找到他認為「看起來應該是」的東西。或是使用者只能在一些廣泛領域中尋找,看哪一個看起來是對他有幫助的。但通常,使用者並沒有一個幫助他們描述他們所不知道的事物的工具。
    The user's task is to describe something that, by definition, he or she does not know (cf. Belkin, 1982). (Knowledge specifically of what is wanted would lead to a “known-item” search.) The user, in effect, describes the fringes of a gap in knowledge, and can only guess what the “filler” for the gap would look like. Or, the user describes a broader, more general topic area than the specific question of interest, and says, in effect, “Get me some stuff that falls in this general area and I'll pick what looks good to me.” Usually, the user has no tools available to help with that problem of describing the fringes of the gap, or the broader subject area.
    • 其他研究:Kuhlthau (1993) 的學生找題目過程. Bates, M. (1989)
  • 而索引者,掌握所有的資訊內容。理想上,索引者的挑戰是:預想沒有知識的使用者會怎麼用什麼樣的詞彙找資訊?去猜想各種狀況以滿足各種的使用者需求。
    The indexer, on the other hand, has the record in hand. It is all there in front of him or her. There is no gap. Here, ideally, the challenge for the indexer is to try to anticipate what terms people with information gaps of various descriptions might search for in those cases where the record in hand would, in fact, go part way in satisfying the user's information need.
    • Harter (1992): 主題相關的資訊,必不表示與使用者需求有著心理相關(psychological relevance)。並以Harter自己的這篇文章為例,某些研究IR系統設計與評估、書目計量學議題的人,可能會對於這篇文章的內容感興趣,但是這篇文章實際上並不與這個主題相關。
      • 相關討論 Ellis, 1996; Green & Bean, 1995; O'Connor, 1996; Soergel, 1985; Wilson, 1968
    • 但是,事實上,索引者或編目者並不會真的根據使用者的需求進行索引編制。他們根據的是:內容是什麼編什麼(they simply index what is in the record)。(See also discussion in Fidel, 1994.) 換言之,他們盡可能仔細地正確地描述或表徵文本內容本身的特徵。因此,如果索引者與使用者使用了完全不同的詞彙,也不令人意外。
    • 索引編制者掌握了很多索引詞彙的知識(例: LCSH的款目說明書;索引典款目的“範圍註(scope note)“;) 這些都是檢索使用者所無法掌握的知識。
    • 這種索引者與使用者的鴻溝,也發生在使用者與演算法設計者兩者之間。演算法設計者可能具備了許多相關知識,但使用者對這些統計演算法的結果,不可能能在所有狀況中被預期與理解。

問題不應該是「我們如何建立一個最精良、最完全的索引或分類法」而是「我們如何建立一個介面感覺自然且適合使用者,並且不計手段幫助他們發現想要的資訊的檢索系統?」

多重進入詞彙

Multiple Terms of Access

在許多不同的研究中發現,不論哪一種主題,人們都會用各種不同的詞彙進行檢索,而且沒有哪一個是最常被使用的。這些變化包含了單複數、句法上的、語意上的變化。
In study after study, across a wide range of environments, it has been found that for any target topic, people will use a very wide range of different terms, and no one of those terms will occur very frequently. These variants can be morphological (forest, forests), syntactic (forest management, management of forests) and semantic (forest, woods).

Content

Note