Back to DFS's PHP Page


OAC Problem Set II

Concordance Problem

When researchers of literature wished to study the style of an author or grammarians wanted to investigate current or past usage, they have traditionally turned to books called concordances. A concordance is an alphabetical listing of all the words in a text. For each occurrence, the neighboring context is given. A concordance, being a book, is a static presentation of data. There are various limitations which result from the original design and production considerations.

Some concordances are sentence-based. This means that even if the word being cited is the first word in the sentence, no words from the previous sentence will be provided. Can you think of anything that is missed using this method?

You are to provide access to three English-language books in a similar fashion. You can download the data for free from Project Gutenberg. One of the books must be Alice's Adventures in Wonderland by Lewis Carroll (1832-1898). Use the zipped version: alice30h.zip.

Have a look at the beseda text corpus (a body of word data) at the Institute of Slovenian Language. Type in the word visokost to see what is found. You should use this web site as a model.

Your web page programming will have the following characteristics:

  1. It will allow the user to select any combination of the three texts to be searched.
  2. It will allow the user to specify separately the number of characters before and after the item found -- but no partial words will be printed.
  3. It will permit the context printed to go beyond sentence and paragraph boundaries.
  4. Punctuation does not count as part of a word.
  5. The search should ignore case differences, i.e., a search for the will find both "The" and "the".
  6. The splat (*) is to be used as a wildcard character. Thus the* will find not only "The" and "the", but other words as well, such as "There".
  7. A file called files.list which contains a list of all of the text files which are available. This file can then be used for creating your main page. This program design would allow you to add more files by simply downloading them and adding their names to files.list.

Something to think about

How is a web-based concordance program better than a concordance in a book?

Is the book concordance still useful? Does it fulfill a need for a researcher better than a computer program?


© DFStermole 2002
Created 3 Oct 02