DNN Designer

Login |  
Readings
opera fix
Print   Minimize 
opera fix
  • Baeza, Chapter 2, Sections 2.1 - 2.5.2
  • Harter, Chapter 3
opera fix

Modern Information Retrieval (Baeza) - Chapter 2
opera fix
Print   Minimize 
opera fix

Title: Modeling
Summary:

  • Index term - a keyword or group of related words that has some meaning of its own.  An index term is simply any word which appears in the text of a document in the collection.  An index term is usually a noun because nouns have meaning by themselves.
  • A central problem regarding IR system is the issue of predicting which documents are relevant and which are not.  Ranking algorithms are at the core of IR systems.
  • 3 classic models in information retrieval:
    1. Boolean model - documents and queries are represented as sets of index terms (set theoretic).
    2. Vector model - documents and queries are represented as vectors in t-dimensional space (algebraic).
    3. Probabilistic model - modeling document and query representations is based on probability theory (probabilistic).
  • Ad-hoc Retrieval - the documents in the collection remain relatively static while new queries are submitted to the system.
  • Filtering - the queries remain relatively static while new documents come into the system.  It simply indicates to the user the documents which might be of interest to him.
    • User Profile - describing user's preferences.
    • Routing - to rank the filtered documents and show this ranking to the user.
    • The crucial step is not the ranking itself but the construction of a user profile which truly reflects the user's preferences.
    • Relevance Feedback Cycle - indicates not only the documents which are really relevant but also the documents which are non-relevant.  It is used to refine a user's profile.
  • Formal Characterization of IR Model - a quadruple [D, Q, F, R(qi, dj)]
    • D - a set composed of logical views or representations for the documents in the collection.
    • Q - a set composed of logical views or representations for the user information needs (queries).
    • F - framework for modeling document representations, queries, and their relationships.
    • R(qi, dj) - ranking function which associates a real number with a query qi in the element of Q and a document representation dj in the element of D.
  • Boolean model
    • Pro: simplicity and neat formalism
    • Con: a document is predicted to be either relevant or non-relevant; it is not simple to translate an information need into a Boolean expression.
opera fix
On Human Communication (Cherry) - Chapter 1
opera fix
Print   Minimize 
opera fix

Title: Communication and Organization
Summary:

  • Speech and writing are by no means our only systems of communication.  Social intercourse is greatly strengthened by habits of gesture - litle movements of the hands and face.
  • Communication means a sharing of elements of behavior, or modes of life, by the existence of sets of rules.
  • Sign - any physical event used in communication.
  • 3 types of rule operating upon signs:
    1. Syntactic rules - rules of syntax, relations between signs.
    2. Semantic rules - relations between signs and the things, actions, relationships, quantities - designata.
    3. Pragmatic rules - relations between signs and their users.
  • A man has remarkable powers of learning.  Every communication, every perception adds to his accumulation of experiences; he is continually becoming a different person, for his every experience is part of a continuing process.
  • Bees are able to discuss one thing only - food and where to find it.
  • Animal signs can relate only to the future, but never, like human language, refer to the past.
opera fix


Online Information Retrieval (Harter) - Chapter 3
opera fix
Print   Minimize 
opera fix

Title: Database Structure, Organization, and Search
Summary: This chapter is written from the users' perspectives.

  • Record - refers to a document surrogate - a representation of the document for storage and subsequent retrieval.
  • Entity - objects about which information will be stored.  Entities are considered in terms of their characteristics, called attributes.
  • Field - a set of characters that represent the value of an attribute for the entity under consideration.
  • Hierarchy of data elements: bit → byte → subfield → field → record → database → library.
  • Linear File - a set of index records in which each record describes one item or entity, and are arranged in an order based on teh values of one or more attributes.
  • Inverted Index - consists of records, typically alphabetically arranged, that are created from a linear file.
  • Document / Term Matrix - rows are made up of documents or records (linear file); while columns are made up of index terms (inverted index).  Example on page 73.
  • Controlled Vocabulary - can be used for searching related terms.
  • Boolean Operators - And, Or, Not: the order of operations is important and can be ambiguous.
  • Word Proximity - e.g. two search terms to be adjacent; or present in a particular field or fields such as abstract or title; or present together in any field, sentence; or separated by n or fewer words.
  • Truncation - to search on a piece of a longer word or phrase, usually its leftmost portion - using a wildcard (e.g. *).
  • Stop Words - have no value for indexing or retrieval, and receive no entries made in the inverted index (e.g. a, an, and, by).
opera fix
Information Storage and Retrieval (Korfhage)
opera fix
Print   Minimize 
opera fix

Chapter 1 - Overview

  • A person uses an information system in two major ways: to store information in anticipation of a future need, and to find information in response to a current need.
  • An information system is composed of two major portions:
    1. Ectosystem - consists of those system factors that are not under the control of the designer (i.e. user, funder, and server).
    2. Endosystem - consists of those factors that the designer can specify and control (i.e. media used to store the information, the devices used to process the information, the algorithms by which the devices work, and the data structures used to organize the information).
  • Signal → Data → Information → Knowledge → Wisdom
  • Concept of information has both personal and time-dependent components that are not present in the concept of data.
  • Information has a higher level of organization imposed by its relationship to a specific information need.
  • Knowledge builds upon information to form a large, coherent view of a portion of reality.
  • Wisdom adds to this knowledge a broader view still, encompassing all of known reality, and governing the use of the information that has been obtained and the knowledge that has been developed.

Chapter 3 - Query Structures

  • Matching process is complicated by the fact that the query and the documents may have quite different forms.
  • Stemming - reduction of a word to its root form.
  • Proximity Operators - within X words of another word.
  • Boolean Query
    • Cons:
      1. There is no good way to weight terms for significance.
      2. Misstated query - hard for non-experts of AND, OR, and NOT to understand.
      3. Order of precedence.
      4. User is free to enter a very complex query.
      5. The result set can be very small or very large.
opera fix

Copyright 2008 by WillWork.Org
Terms Of Use | Privacy Statement