The programs you wrote to investigate the frequencies of occurrence of numerical digits and then letters have taught you to process a file and learn something about its content. They may have seemed trivial, but they have provided the foundation so that this current project will acquaint you with one of the tasks performed by linguists and text analysts -- investigating the frequencies of words used in a text.
Your previous programs used the data items (digits or letters) being investigate as subscripts for the frequency counters. In doing a frequency analysis of the words (instead of characters) in a text, this is not possible in Pascal. Instead, you will initially use parallel arrays to keep track of the words and how often they appear in the text.
Your programming task will be divided into three stages.
The remainder of this page deals only with the first stage.
For the arrays to be passed as parameters, you are required to declare programmer-defined types. The following const and type sections make this possible:
const LO = 1; MAX = 2000; type freqarrayType = array [LO..MAX] of integer; wordarrayType = array [LO..MAX] of string[20];
These declarations allow you to declare variables using your own special variable types for the arrays freq and words. Among other variables, you will want to declare the following:
var
freq : freqarrayType; {Array of all unique word frequencies}
words : wordarrayType; {Array of all unique words}
numch : longint; {Number of chars in file}
wordsunique : integer; {Number of unique words in file}
wordcount : integer; {Number of words in file}
ratio : real; {Ratio for bars in graph}
biggest : integer; {Biggest frequency}
The overall structure of your program will be as follows. The subprograms marked with (f) are to be functions while the others are to be procedures.
main
|Initialize
|Introduction
|GetFilename
|ProcessFile
|f WordAlready : integer
|f LargestFreq : integer
|f BarRatio : real
|PrintStats
|Menu
|PrintOrigOrder
|PrintTable
For this first stage, the menu will only permit two choices: (1) print the frequency bar graph using the order in which the words were originally encountered and (2) quit the program.
Your output should look like the following which I created by processing Alice's Adventures in Wonderland. I obtained this text at the Project Gutenberg web site. You can use any 100+K text you like. To avoid a lot of needless typing, you may wish to find a text on the internet.
The file being processed is a:\alice30.txt. There are 155741 characters. There are 27346 total words. There are 2704 unique words. The greatest frequency for any word is 1634. The ratio to be used for the bar graph is 0.029. How would you like the data displayed? 1: Order of first occurrence Q: Quit This is a frequency chart for the words in a:\alice30.txt, listed in order of first occurrence. 1 ALICE : 3 2 S :****** 202 3 ADVENTURES : 1 4 IN : 2 5 WONDERLAND : 1 6 Lewis : 1 7 Carroll : 1 8 THE : 9 9 MILLENNIUM : 1 10 FULCRUM : 1 11 EDITION : 1 12 3 : 1 13 0 : 1 14 CHAPTER : 12 15 I :**************** 543 16 Down :*** 102 17 The :*********************************************** 1634 18 Rabbit :* 50 19 Hole : 5 20 Alice :************ 395 21 Was :********** 353 22 Beginning : 14 23 To :********************* 726 24 Get :* 46 25 Very :**** 131 26 Tired : 7 27 Of :*************** 511 28 Sitting : 10 29 By :** 58 30 Her :******* 246 31 Sister : 9 32 On :****** 193 33 Bank : 3 34 And :************************** 869 35 Having : 10 36 Nothing :* 34 37 Do :** 81 38 Once :* 34 39 Or :** 77 40 Twice : 5 41 She :**************** 548 42 Had :***** 177 43 Peeped : 3 44 Into :** 67 45 Book : 11 Do you want to continue? (Y/N)
Hint: You should start this project using the KISS principle by producing and trying to process a 15-word, 2-sentence file. Simply work on reading in individual words and printing them out on separate lines. What does the existence of the S word indicate about the algorithm used to break up the text into "words"?