CLOC (V00309)
PREFACE
~~~~~~~
This guide describes a computer program for analysing natural language
text. The development work has been done by Birmingham University Computer
Centre in collaboration with the Department of English Language and
Literature. The program is named CLOC and takes the form of a package of
facilities designed for ease of use by people with little or no computer
experience. The package will be extended as new natural language techniques
are evolved. It currently includes the production of sorted vocabulary lists,
word indexes, concordances, and the automatic discovery of collocations. The
documentation in the following sections refers to mark 2M of the CLOC package
released during July 1986. The name CLOC is an acronymn taken from the term
"ColLOCation".
Page 2
ACKNOWLEDGEMENTS
~~~~~~~~~~~~~~~~
The package and its documentation were written at Birmingham by Mr. A.
Reed. The author would like to offer his thanks to friends and colleagues
who, by criticism and advice, have aided the writing of this guide and the
production of the package. To Professor J. McH. Sinclair of the Department
of English for requesting the program and suggesting several of the features;
to Dr. J. L. Schonfelder for his enthusiasm and advice; and to Professors
Greaves, Benson (York,Toronto) and Brainerd (Toronto) whose interest and
support was most welcome.
Page 3
BIRMINGHAM UNIVERSITY COMPUTER CENTRE
CLOC USERS GUIDE
CONTENTS
~~~~~~~~
1. INTRODUCTION
2. PREPARATION OF TEXT
2.1 The Character Set
2.2 Capital and Small Letters
2.3 Diacritical Marks
2.4 Foreign Languages
2.5 Text References
2.5.1 Reading
2.5.2 Printing
2.5.3 Example
3. STRUCTURE OF A JOB
4. USING THE CLOC PACKAGE
4.1 The command language
4.1.1 The Control Statement Conventions
4.1.2 Rules for Control Statements
4.1.3 The -INSERT feature
4.1.4 The -SEND feature
4.1.5 The -NOSEND feature
4.2 INPUT DETAILS
4.3 Word definition commands
ITEMIZE USING
*LETTERS
*PADDING
*DEFERRED
*SEPARATORS
*READ AS SPACE
*IGNORE
4.4 Saving Text Files
SAVE TEXT
GET TEXT
4.5 OUTPUT DETAILS
Page 4
4.6 Word selection commands
EVERY WORD
SELECT WORDS
EXCLUDING
INCLUDING
*LIST OF WORDS
*FREQUENCY
*PATTERN
4.7 Task selection commands
WORDLIST
INDEX
CONCORDANCE
CO-OCCURRENCE
*PHRASE
*SERIES
*PATTERN
COLLOCATIONS
*SPAN
*FREQUENCY
EVERY COLLOCATE
SELECTCOLLOCATE
REJECTING
ACCEPTING
NOTE
WRITETEXT
NEWLINE
NEWPAGE
MESSAGE
FINISH
5. EXAMPLES
APPENDIX I Messages Produced by the CLOC package
APPENDIX II References
APPENDIX III Glossary
APPENDIX IV CLOC Global Syntax Rules
Page 5
CLOC USER GUIDE
~~~~ ~~~~ ~~~~~
1. INTRODUCTION
CLOC is a package which will enable a novice computer user to analyse
Natural language text by computer. This guide explains what the package can
do, and shows the reader how he can instruct it to carry out various tasks.
The package can examine the vocabulary used by an author and be told to
print it in several ways. The vocabulary, or a selected portion of it, can be
printed in order, as in a dictionary, or according to how frequently a word is
used.
The primary purpose of the CLOC package is to produce collocations of
selected words. These are frequently occurring patterns of words which appear
regularly within a text. The package is capable of discovering these patterns
and will print the context of each occurrence in a style chosen by the user.
CLOC will also produce a concordance of words selected from the text, to
show how an author uses the selected words. The amount of context that is
printed can be freely chosen by the user, and the style in which the
concordance is printed can be selected in several simple and convenient ways.
The user can guide and control the actions of the package by employing a
few simple commands, which are supplied to the package by way of statements in
a command language. The following sections explain how this language can be
used to carry out the required tasks.
2. PREPARATION OF TEXT
2.1 THE CHARACTER SET
The text to be analysed must be converted from the printed page or spoken
word into a form that a computer can read. A printed page could contain many
differing alphabets, use several type styles, and allow a large number of
special symbols. As CLOC can only deal with (say)95 different characters, a
consistent scheme must be devised to convert every letter, punctuation mark,
diacritic and significant change of type style into one or more characters in
this small and restricted alphabet.
This can be achieved by dividing the set of possible characters into
several mutually exclusive categories. Use one category of characters to
compose words, use another to separate one word from the next, and so on. For
English, the first category could include the alphabet A to Z, while the
second could include ? , ; "full stop", and "space". For example, if you
are unlucky enough to prepare your text on punched cards you could represent
the phrase "Once upon a time" as follows:-
=ONCE UPON A TIME
Notice that before every capital letter you should place some symbol (say
= ) to indicate that a capital letter comes next. This can also be used when
a capital letter occurs inside a word as in "MacDonald". This you could
represent as :
=MAC=DONALD
The CLOC package could be told that = is a special kind of letter so it could
read "=MAC=DONALD" as one word rather than the two words "MAC" and "DONALD".
In this document it will be assumed that text is prepared using the 95
printable characters of the ASCII alphabet. This includes the upper and lower
case letters together with a large variety of other symbols. All the CLOC
examples will use upper case letters for clarity only, in practice, CLOC
Page 6
commands can be given in a mixture of upper and lower case. The CLOC package
allows you to choose which characters are letters and which are not. Normally
you would choose abcdefghijklmnopqrstuvwxyz as your alphabet. Additional
special letters called "padding" and "deferred" can be defined to cope with
accents, apostophies, breathing marks, etc.
2.2 CAPITAL AND SMALL LETTERS
The CLOC package is designed to work with text containing a mixture of
upper and lower case letters. Thus if you ask for a concordance of (say)
"the" you will get all the places where "the" occurs even when "The" is given
in the text. Usually the case of the letters is not relevant, but if required
you can tell CLOC that the case of letters is significant, in which case "the"
and "The" will be counted as different words. (see the ITEMISE USING command
for details).
2.3 DIACRITICAL MARKS
Marks placed above or below letters to indicate stress can be represented
by two characters, one for the letter and one for the particular mark. For
example one could represent the following:
cliche as clich1e and on punched cards as CLICH1E
where the symbol 1 is used to represent the acute accent. This can be
combined with capitalisation as follows:
Ecosse as 1Ecosse and on punched cards as 1=ECOSSE
Note that each diacritical mark must be considered as either a padding or a
deferred special "letter".
2.4 FOREIGN LANGUAGES
When the language of the text is not in the Roman alphabet, the letters
in the language must be converted to characters in the computer's alphabet.
Using Greek as an example and a systematic conversion changing to A, to B
and to G etc. we could write:-
as POLEMIKOS
where each Greek letter is replaced by a Roman letter. When the Natural
language being used contains words from another language they should be
carefully disinguished. This can easily be achieved by prefixing each rarely
occurring foreign word with a special character. Thus when using English
containing French words they could be distinguished by a $ symbol. For
example:-
"In French, lard means bacon" could be coded as:
In French, $lard means bacon
or on punched cards as
=IN =FRENCH, $LARD MEANS BACON
2.5 TEXT REFERENCES
Page 7
2.5.1 Reading
This feature allows you to include in your text data the name of the
author, the section, chapter, and line number, etc. in a manner similar to
the well known COCOA package. The text references feature is only invoked
when you supply an INPUT DETAILS command with the REFS option in the
specification field. Suppose you had supplied REFS<> this would tell the
package that your text references were enclosed by the the two characters <
and >. Your text data could now include say <A DICKENS><P 1><L 1> which to
you would signify author Dickens, Page 1, Line 1. Subsequent lines would
contain the text data for this section. For the next page it is sufficient
for you to punch <P 2><L 1> as the author has not changed. As the package
reads each line of text the line number is increased by 1, so the line number
should be reset to 1 when you start a new page. Note that <L 1> refers to the
next line, hence no text data should occur after it on the same line. Apart
~~~~ ~~~~
from this restriction, text references may be placed anywhere in your text
data. It is, however, recommended that text references be placed on lines
separate from the text data itself. Blank lines and lines containing text
references only are ignored.
The general form of a text reference is:
aletter gap referenceb
~~~~~~ ~~~ ~~~~~~~~~
where letter is A or B or C ...... Z
~~~~~~
reference is any sequence of characters
~~~~~~~~~
gap denotes one or more spaces
~~~
and ab are the two characters specified on the REFSab part of the
INPUT DETAILS command. (See section 4.2.)
Note
~~~~
a) When L is used as a letter, the reference must be a number
~~~~~~
which will normally be 1.
b) An incorrect text reference will be ignored - you can therefore
include a title of the form: a titleb which the package will
~~~~~
ignore.
2.5.2 Printing
Each line of a printed concordence can be prefixed with a detailed text
reference. The CONCORDANCE ,COLLOCATIONS, CO-OCCURRENCE, INDEX and WRITETEXT
commands can contain the keyword REFS followed by letter number pairs, e.g.
~~~~~~ ~~~~~~
Page 8
REFS A4P2L3. This example will cause the first 4 characters of the current
"A" reference, the first 2 of the "P" reference, and a 3 figure line number to
~~~~~~~~~ ~~~~~~~~~
be placed before every printed citation. Each entry will be separated from
the next by one space. The general rule is that for every letter number pair
~~~~~~ ~~~~~~
occuring after the REFS keyword the first number characters of the current
~~~~~~
value of the letter references are printed.
~~~~~~
A minor variation occurs when a line number of say L2 is asked for, and
the given citation occurs at line 100 or greater. On the first occasion that
this happens the package increases the request by 1, and subsequent references
are now considered to have been requested by L3.
Note that in the first instance all reference letters are considered to
contain "spaces": thus a request of say X4 will cause 4 spaces (+1 space) to
be printed.
2.5.3 Example
< The Small Celandine >
<A WORDSWORTH><V 1><L 1>
There is a Flower, the Lesser Celandine
That Shrinks, like many more, from cold and rain;
And, the first moment that the sun may shine,
Bright as the sun itself, 'tis out again!
<V 2><L 1>
When hailstones have been falling, swarm on swarm,
Or blasts the green field and the trees distress'd,
Oft have I seen it muffled up from harm,
In close self-shelter, like a Thing at rest.
3. STRUCTURE OF A JOB
The user must supply instructions to the package to tell it how to read
and analyse the text. These instructions are prepared according to rules
described in Section 4, and must be supplied to the package in the following
order:
(Optional) INPUT DETAILS command
1. Word definition commands
2. Word selection commands
3. Task selection commands, e.g. word listing, index, concordance,
co-occurrence, collocation and writetext commands
FINISH command
The precise order in which the individual commands should be presented is
described in the Appendix. The layout and effect of each command will be
described later.
Page 9
The input details command can be used to inform the package of any
special characteristics in the way the text has been prepared.
Word definition commands inform the package how to interpret the text
data which is about to be read. They define the constitution of a word in
~~~~
terms of an alphabet of characters. The CLOC package assumes that a file of
~~~~~~~~~~
text is composed of a series of distinguishable words, clearly separated from
each other by spaces, full stops, etc.
Words selection commands are used to instruct the package to choose a
suitable collection of words, termed "nodes", which will be used by the task
commands during the analysis of the text.
The word listing command causes the package to list the selected words
sorted into a chosen order.
The index command will produce for each word a list of references to it's
position in the original text.
The concordance command is used to command the package to print a
concordance of the selected words. The style chosen for printing the
concordance can be chosen by the user.
The co-occurrence command will produce a concordance of given phrases or
series of words.
The collocations command is used to command the package to search the
context of the selected words, and to print the context of those which possess
frequently occurring neighbours.
The writetext command will cause the original text to be printed out, but
each line will start with a text reference.
The following example program illustrates how CLOC commands could appear
in a typical task.
ITEMIZE USING CLOC
word definition *LETTERS abcdefghijklmnopqrstuvwxyz
SELECT WORDS
word selection *FREQUENCY (100 TO 500)
EXCLUDING
*LIST OF WORDS i the but
Concordance command CONCORDANCE KWIC, CITE 6 BY 6
FINISH command FINISH
This example causes CLOC to do the following things:
(a) Read text composed of a series of words in the English alphabet
(b) Select a collection of words each of which has a frequency of occurrence
lying in the range 100 to 500 inclusive, with the exception of the words 'i'
'the' and 'but'
(c) Produce a Key Word In Context concordance of the chosen words, where each
Page 10
word is surrounded by 12 words of context.
4. USING THE CLOC PACKAGE
The package needs to be told what actions to perform and in which order
they should be done. It is the function of the 'command language' to supply
this information in a clear and unambiguous way. Each of the following
sections contains detailed information on how to instruct the package to carry
out certain actions. The sections are described in the order in which they
should be presented to the package, when they are phrased in terms of the
'command language'.
4.1 THE COMMAND LANGUAGE
The package is controlled by a series of 'control statements' each of
which contains a 'command'. These commands are obeyed by CLOC in the order in
which they are given. Certain commands are optional and can be included when
the user requires more facilities than those provided by default.
4.1.1 The control statement conventions
CLOC commands must be prepared in a standard format. A control statement
is notionally divided into two distinct 'fields', each of which occupies a
certain number of columns on a line. The fields for CLOC control statements
are as follows:-
(a) The control field occupying columns 1 to 15 inclusive. Its function
~~~~~~~ ~~~~~
is to inform the package of an action to be performed, or to present the
package with further information.
(b) The specification field occupying columns 16 to 80 inclusive. This
~~~~~~~~~~~~~ ~~~~~
portion of the card supplies extra information to the instruction in the
control field.
An example of a command is
WORDLIST ALPHA
control field specification field
This command instructs the package to sort the words into alphabetic
order, and to print them.
The command WORDLIST must be placed in columns 1 to 15 of the line. The
specification field, columns 16 to 80, of this command contains the sorting
criterion ALPHA, meaning 'alphabetic order'.
4.1.2 Rules for Control Statements
(a) Each control field must be written within columns 1 to 15.
(b) The commands must be spelt correctly. (Thus CONCORDENCE is not a
valid command!)
(c) When columns 1 to 15 contain spaces only, the specification field of
the line is treated as a continuation of the specification field of the
previous command.
Page 11
(d) Keywords in the specification field must not contain spaces,nor be
split when continuing them onto the next line.
(e) Some commands contain a star symbol (*) in column 1. This indicates
that the command is subsidiary to the last unstarred command. The starred
commands supply extra information to that supplied by the unstarred command
upon which they are dependent. The star symbol is there for your convenience
to remaind you of the function of these commands, the symbol itself is
optional.
(f) The right parenthesis ) can be used to terminate a control field.
The specification field is then deemed to start immediately following it.
Hence, one can write WORDLIST)ALPHA. This feature is intended for use when
CLOC control statements are typed at a terminal rather than punched on cards.
Note that you can use a ")" in column 1 to stand for the 15 space continuation
field described above in part (c).
(g) All CLOC control statements and keywords can be written in upper or
lower case (or a mixture of the two). The examples in this guide are written
in upper case for clarity only.
(h) The pseudo CLOC commands -INSERT -SEND -NOSEND do not take any
continuation lines. They can therefore be placed anywhere in a sequence of
CLOC commands.
4.1.3 The -INSERT feature.
This allows you to include a file of prewritten CLOC commands. You
could, for example, have a set of files each containing a different sequence
of word definition commands. Alternatively you could have files containing
specific lists of words which you could -INSERT when required.
Example of use
~~~~~~~ ~~ ~~~
a) -INSERT WORDDEF1
SELECT WORDS
b) EXCLUDING
-INSERT VERBS
General form
~~~~~~~ ~~~~
-INSERT filename
~~~~~~~~
The contents of filename are used as if they appeared in
~~~~~~~~
place of the -INSERT command.
Points to note
~~~~~~ ~~ ~~~~
1. There are no continuation lines for this (pseudo) CLOC command.
This allows filename to contain lists of words on continuation
~~~~~~~~
lines only. For example, if a file A contains:
_______________is are was were be
Page 12
_______________up down in out
we could write
*LIST OF WORDS
-INSERT A
_______________EXTRA-WORDS-HERE
which would be interpreted as if you had written
*LIST OF WORDS
_______________is are was were be
_______________up down in out
_______________EXTRA-WORDS-HERE
2. The syntax of filename depends on the computer that CLOC
~~~~~~~~
is implemented on.
3. Lines taken from an -INSERT file will be copied onto the
CLOC information and diagnostic file. An indication will be
given as to where the lines came from.
4. The contents of an -INSERT file may include other -INSERT commands.
The depth of nesting is implementation dependent.
4.1.4 The -SEND feature
This (pseudo) CLOC command will cause subsequent CLOC commands to be sent
to the CLOC diagnostic and information file.
Example and General form
~~~~~~~ ~~~ ~~~~~~~ ~~~~
-SEND
4.1.5 The -NOSEND feature
This (pseudo) CLOC command will stop the normal process of sending
CLOC commands to the CLOC diagnostic and information file.
Example and General form
~~~~~~~ ~~~ ~~~~~~~ ~~~~
-NOSEND
4.2 The INPUT DETAILS command (optional)
This command allows you to specify the maximum width of lines of text
data; to indicate the presence of text references; to determine which parts
to skip; to define an explicit newline symbol; to include ignorable
comments; and to specify rules about line continuation.
When this command is absent INPUT DETAILS WIDTH80 is assumed.
Examples:
a) INPUT DETAILS WIDTH72
b) INPUT DETAILS NEWLINE/,REFS<>
c) INPUT DETAILS WIDTH128,NEWLINE/,CONTINUE+
d) INPUT DETAILS COMMENT(),NEWLINE/,CONTINUE+,REFS<>
General Form:
INPUT DETAILS WIDTHnumber,SKIPab,COMMENTab,NEWLINEa,
CONTINUEa,RUNOVER,REFSab
Default value:
Page 13
INPUT DETAILS WIDTH80
Parameters
~~~~~~~~~~
WIDTHnumber default value WIDTH80
At most number characters will be read for each line of text data.
~~~~~~
Trailing spaces will be removed by the package. All characters which occur
after column number will be ignored.
~~~~~~
SKIPab
When present this instruction causes the package to ignore all characters
between a and b inclusive. This option withdraws characters a and b
from the available character set.
COMMENTab
Words which occur between the pair of characters aa and the pair of
~~~~ ~~ ~~~~ ~~
characters bb will not appear in the word count tables, but they will appear
in the context when a citation is printed. This option withdraws the
characters a and b from the available character set.
NEWLINEa
When present the character a represents a logical newline. This option
allows more than one "line of text" to be placed on the same line. Note that
a will also be inserted automatically at the end of each line. This option
withdraws character a from the available character set.
CONTINUEa
When CONTINUEa is present and a is found in the text, all characters
remaining on the line are ignored. The next line is considered to replace the
ignored part, and to be on the same line.
RUNOVER
When RUNOVER is present, and the text reading position is at the
end-of-line (i.e. at the WIDTH number position), the end-of-line will not
~~~
terminate a word. Hence the full width of line can be used to store text, and
~~~~~~~~~
words can run over onto the next line.
REFSab
When present the package extracts text references of the form aletter
~~~~~~
referenceb from the text file. This option withdraws characters a and b
~~~~~~~~~
from the available character set.
4.3 WORD DEFINITION COMMANDS
The CLOC package has been designed to read text punched according to many
differing conventions. No matter how a text has been coded, the package
interprets it as an arbitrary series of words. The composition of words is
~~~~~
left up to you, but CLOC needs to know what rules are used for constructing
words. These rules embody a strategy for extracting words from the characters
Page 14
in the text data. The process of combining characters in this way is called
itemization and one must first select which itemizing strategy the package is
~~~~~~~~~~~
to use; the rules of the strategy are supplied by subsidiary (starred)
commands.
ITEMIZE USING. (The -ISE ending can be used if desired.)
This command is used to select a strategy for itemizing the text.
Example
~~~~~~~
ITEMIZE USING CLOC
General form
~~~~~~~ ~~~~
ITEMIZE USING strategy name
~~~~~~~~ ~~~~
Two possibilities are available at present. They are
a) CLOC
b) CLOC UNCHANGED
Strategy a) ensures that words which differ only by the case of their
letters, and/or contain *PADDING letters (q.v.) are counted as the same
word. When strategy b) is chosen, words will always be distinguished by
the case of their letters and the presence of *PADDING letters.
For example, consider the sentence:-
The MacDonald Hotel is different from the Mac'donald Motel.
Assuming that the apostrophy ' has been designated a padding letter, then
when the CLOC itemising strategy is in use the word "the" is deemed
to occur twice, as does the word "macdonald". When a CONCORDANCE
or COLLOCATIONS task (etc) is run, they too will treat the various forms as
if they were the same word. The citations will of course look like the
original text. The effect of the CLOC itemising strategy is that :-
The is mapped to "the"
the is mapped to "the"
MacDonald is mapped to "macdonald"
Mac'donald is mapped to "macdonald"
Hotel is mapped to "hotel"
Motel is mapped to "motel"
is is mapped to "is"
different is mapped to "different"
from is mapped to "from"
When CLOC UNCHANGED is used all the above words are considered distinct.
Other itemization strategies may be introduced in future versions of
the package.
Default
~~~~~~~
When strategy name is absent CLOC is assumed.
~~~~~~~~ ~~~~
Page 15
The ITEMIZE USING CLOC command has a number of subsidiary commands.
These commands tell the package how to interpret the characters it finds on
the lines containing the text data.
*LETTERS
This command is mandatory and must be the first command which follows the
ITEMIZE USING CLOC command. This informs the package of the alphabet of
characters out of which words are composed. A word is defined to be one or
~~~~
more consecutive letters. Every character which could form part of a word
must be specified here. This includes characters used for accents,
apostrophes, hyphenation, changes of type style etc.
Example
~~~~~~~
*LETTERS abcdefghijklmnopqrstuvwxzy
General form
~~~~~~~ ~~~~
*LETTERS letter characters
~~~~~~ ~~~~~~~~~~
The order in which the letter characters appear in the command is
~~~~~ ~~~~~~ ~~~~~~~~~~
significant. This order determines the way in which words will be
alphabetically sorted. In the above example, those words beginning with 'a'
will preceed those starting with 'b', and so on. Thus 'alan' will sort before
'fred' which itself precedes 'freda'. Note that this command automatically
caters for upper and lower case letters.
*PADDING
This command is optional and when present informs the package of those
letter characters which are to be ignored when words are placed in the
~~~~~~ ~~~~~~~~~~
vocabulary table. Usually this command will contain those letter characters
~~~~~~ ~~~~~~~~~~
used as apostrophes or hyphenation, but any characters specified on the above
*LETTERS command could also be used.
When the CLOC itemising strategy is chosen the *PADDING letters cannot appear
in the vocabulary print-outs, because they are absent from the vocabulary
table.
Note that if you were to choose an itemising strategy (e.g. CLOC UNCHANGED)
which allowed padding letters to appear in the vocabulary table, they would be
ignored when sorting took place.
Example
~~~~~~~
(a) *PADDING ' -
Words containing the apostrophe and/or the hyphen will have them
removed
before the word is stored in the vocabulary table.
General form
~~~~~~~ ~~~~
*PADDING letter characters
~~~~~~ ~~~~~~~~~~
Page 16
where every character declared must be a letter character declared on the
~~~~~~ ~~~~~~~~~
*LETTERS command. This is a deliberate design decision to emphasise that CLOC
defines a word to be a sequence of letters.
~~~~~~
*DEFERRED
This command is optional and when present informs the package of those
letter characters which are to be ignored when words are sorted
~~~~~~ ~~~~~~~~~~
alphabetically. Usually this command will contain those letter characters
~~~~~~ ~~~~~~~~~~
used for accents and changes of type style, but any characters specified on
the above *LETTERS command could also be used. Words which contain *DEFERRED
letters will be counted separately.
Examples
~~~~~~~~
(a) *DEFERRED -
Hyphenated words will be separately indexed.
(b) *DEFERRED aeiou
Words will be sorted alphabetically ignoring vowels.
General form
~~~~~~~ ~~~~
*DEFERRED letter characters
~~~~~~ ~~~~~~~~~~
where every character declared must be a letter character declared on the
~~~~~~ ~~~~~~~~~
*LETTERS command. This is a deliberate design decision to emphasise that CLOC
defines a word to be a sequence of letters, and that the deferred feature only
~~~~~~
affects the sorting order.
This command ensures that words which differ only in (say) diacritical
marks are adjacent in an alphabetically ordered dictionary. Words will always
be distinguished by their *DEFERRED letters, each will have a separate entry
in the vocabulary table. Note that whenever two words differ only in deferred
~~~~
letters, their sorting order is determined by the order of the deferred
letters on the *LETTERS command.
*SEPARATORS
This command is optional and when present informs the package of those
characters which separate one word from the next. When this command is absent
every character that is not declared by the *LETTERS command is automatically
assumed to be a separator. The symbols one would use to separate one word
from the next might be the fullstop, comma, semicolon, etc. The CLOC package
always takes a 'space' to be a separator.
Example
~~~~~~~
*SEPARATORS ? ! ; .
General form
~~~~~~~ ~~~~
Page 17
*SEPARATORS separator characters
~~~~~~~~~ ~~~~~~~~~~
The order in which separator characters appear on this command is of no
~~~~~~~~~ ~~~~~~~~~~
significance. Note that a character must (and cannot) be declared both as a
letter character and as a separator character at one and the same time. Those
~~~~~~ ~~~~~~~~~ ~~~~~~~~~ ~~~~~~~~~
characters which are neither letter characters nor separator characters will
~~~~~~ ~~~~~~~~~~ ~~~~~~~~~ ~~~~~~~~~~
be assumed to signify 'spaces' and will be interpreted as if they were
declared on the following control statement.
*READ AS SPACE
This command is optional and when present informs the package of those
characters which signify a space. These characters although present in the
text data will be assumed to stand for the space character and will be printed
as such when concordances and collocations are produced.
Example
~~~~~~~
*READ AS SPACE %
General form
~~~~~~~ ~~~~
*READ AS SPACE space characters
~~~~~ ~~~~~~~~~~
The order in which the space characters appear on this command is of no
~~~~~ ~~~~~~~~~~
significance. This command can be used to remove punctuation marks from a
text or to cause one word to be read as several. For example, if the text
contained N'EST%PAS, the package would read it as two words N'EST and PAS, and
would print it as N'EST PAS. If the % sign were declared as a (padding)
letter character instead of on the *READ AS SPACE command, N'EST%PAS would be
~~~~~~ ~~~~~~~~~
read as single word, and printed as N'EST%PAS.
~~~~
*IGNORE
This command is optional and when present informs the package of those
characters which are to be totally ignored when the text is read.
~~~~~~~ ~~~~~~~
Example
~~~~~~~
*IGNORE @ /
General form
~~~~~~~ ~~~~
*IGNORE ignore characters
~~~~~~ ~~~~~~~~~~
The order in which the ignore characters appear on this command is of no
~~~~~~ ~~~~~~~~~~
significance. This command can be used to ignore characters which were placed
in the text for special purposes. As an example one could cause 'house-wife'
Page 18
to be read as if it were 'housewife' by declaring "-" as an ignore character.
~~~~~~ ~~~~~~~~~
4.4 SAVING TEXT FILES
THE ITEMIZATION PROCESS
The package treats the text as a series of words separated from each
~~~~~
other by separators. Thus:
~~~~~~~~~~
text: separator word separator ... word separator
~~~~~~~~~ ~~~~ ~~~~~~~~~ ~~~ ~~~~ ~~~~~~~~~
The composition of words and separators, and the method of extracting
~~~~~ ~~~~~~~~~~
them from a text, are chosen by the ITEMIZE USING command and its subsidiary
commands. Every time a CLOC job is run the text will be read word by word,
~~~~ ~~~~
and carefully saved in a special form which permits rapid production of
concordances and collocations. Whenever the same text is to be examined
several times it is clearly desirable to use this special form and save
computer time by not reading the same text over and over again. The following
commands achieve this aim.
The SAVE TEXT and GET TEXT commands
These commands cause text to be stored in, and returned from, the
computer's filing system. Their function is to bypass the text reading stage
and so allow computer time to be saved when many analyses are performed on one
file of text. Further information on the filing system can be obtained from
your local computer centre.
The SAVE TEXT command causes itemised text to be placed in a permanent
file, named filename.
~~~~~~~~
The GET TEXT command causes itemized text to be retrieved from a
permanent file, named filename, previously created by the SAVE TEXT command.
~~~~~~~~
General form
~~~~~~~ ~~~~
SAVE TEXT filename
~~~~~~~~
GET TEXT filename
~~~~~~~~
Examples
~~~~~~~~
On one run of the package the following commands are sufficient to read
the text and store it in the special form.
ITEMIZE USING CLOC
*LETTERS abcdefghijklmnopqrstuvwxyz
SAVE TEXT MYFILE
FINISH
Page 19
Once the file of text has been saved, the following jobs could be run in
which the GET TEXT command replaces the itemizing instructions in the previous
example.
GET TEXT MYFILE
EVERY WORD
WORDLIST ALPHA
FINISH
and on some other run one could write:
GET TEXT MYFILE
SELECT WORDS
*PATTERN *.ing
CONCORDANCE KWIC, CITE 4 BY 4
FINISH
4.5 THE OUTPUT DETAILS COMMAND (optional)
This command allows you to choose the maximum line width for wordlists
and citations to suit your particular lineprinter or terminal device.
When this command is absent OUTPUT DETAILS WIDTH 120 is assumed.
Example
~~~~~~~
OUTPUT DETAILS WIDTH80
General form
~~~~~~~ ~~~~
OUTPUT DETAILS WIDTHnumber
~~~~~~
Parameters
~~~~~~~~~~
WIDTHnumber
~~~~~~
No more than number character positions will be reserved on the output device.
~~~~~~
All word lists will be packed into a line of this width.
4.6 WORD SELECTION COMMANDS
This section describes how one can select, from the vocabulary of the
text, a collection of words for analysis. This collection is used by
subsequent commands when performing sorting, and producing concordances and
collocations. One can choose the entire vocabulary or select a portion of it.
An exclusion facility is provided which operates on the complete vocabulary or
on the portion selected.
The command EVERY WORD specifies the entire vocabulary of the text. The
SELECT WORDS command is used, in conjunction with several subsidiary commands,
to define a given portion of the vocabulary. The EXCLUDING command can be
used to remove unwanted words and reduce the size of the above collection. (A
further command INCLUDING is provided in case your exclusion commands remove
too many words.)
The set of words defined using the SELECT WORDS, EXCLUDING, or INCLUDING
commands is specified by way of several subsidiary commands. These are termed
set description commands and are described later.
~~~ ~~~~~~~~~~~
Page 20
To choose a collection of words one can use either of the following two
constructions, without supplying an exclusion list.
(a) EVERY WORD The entire vocabulary is selected
(b) SELECT WORDS The following set description
~~~ ~~~~~~~~~~~
set description describes the words to be used.
~~~ ~~~~~~~~~~~
The exclusion list can be placed after either of the above constructions
to give the following alternatives.
(c) EVERY WORD The entire vocabulary
EXCLUDING excluding
set description this set description is selected.
~~~ ~~~~~~~~~~~ ~~~ ~~~~~~~~~~~
(d) SELECT WORDS The words specified in
set description1 set description1 are used,
~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~~~~
EXCLUDING excluding those words in
set description2 set description2.
~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~~~~
(e) EVERY WORD The entire vocabulary
EXCLUDING excluding
set description1 this set description1,
~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~~~~
INCLUDING but including
set description2 set description2
~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~~~~
(f) SELECT WORDS The words specified in
set description1 this set description1
~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~~~~
EXCLUDING excluding
set description2 this set description2
~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~~~~
INCLUDING but including
set description3 the set description3
~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~~~~
The commands EVERYWORD and SELECTWORDS set description define
~~~ ~~~~~~~~~~~
a working set of words. You can then use the EXCLUDING commands to
remove words from the working set, and the INCLUDING commands to
add words to the working set. You can repeat the EXCLUDING and
INCLUDING commands as often as you need to get precisely the
collection of words that you are interested in.
Here are a few examples
1. SELECT WORDS
*PATTERN *ing
Page 21
2. SELECT WORDS
*PATTERN *ing
EXCLUDING
*LIST OF WORDS running jumping
3. EVERY WORD
EXCLUDING
*PATTERN *ing
INCLUDING
*LIST OF WORDS running jumping
Example 1 selects all words that end with 'ing'.
Example 2 selects all ING words apart from 'running' and 'jumping',
Example 3 chooses the whole vocabulary less the 'ing' words,
but with 'running' and 'jumping' included.
Set description commands
~~~ ~~~~~~~~~~~
These commands all have a star symbol(*) in column 1, showing they are
subsidiary to the previous unstarred command. Three ways of choosing words
are provided. They are, by frequency of occurrence, by an explicit list, or
by a pattern. Each of the following commands may be repeated as often as
required.
frequency of occurrence - The *FREQUENCY command.
~~~~~~~~~ ~~ ~~~~~~~~~~
This command is used to choose a set of words each member of which has a
particular frequency of occurrence or lies in a given frequency range.
Examples
~~~~~~~~
(a) *FREQUENCY (100 TO 500)
This command will select only those words which occur between 100 and 500
times inclusive.
(b) *FREQUENCY 1 OR 4 OR >50
This command will select words which occur exactly once, exactly four
times, or more than 50 times.
General form
~~~~~~~ ~~~~
*FREQUENCY expression
~~~~~~~~~~
where expression is one or more terms connected by OR symbols. And a term is
~~~~~~~~~~ ~~~~ ~~~~
one of the following:
(a) integer for example 10
~~~~~~~
only words occurring exactly integer times will be selected.
~~~~~~~
(b) >integer for example >10
~~~~~~~
only words occurring more than integer times will be selected.
~~~~~~~
Page 22
(c) <integer for example <10
~~~~~~~
only words occurring less than integer times will be selected.
~~~~~~~
(d) (integer1 TO integer2) for example (100 TO 500)
~~~~~~~~ ~~~~~~~~
only words lying in the range integer1 to integer2 inclusive
~~~~~~~~ ~~~~~~~~
will be selected. Note that integer1 must be smaller than
~~~~~~~~
integer2.
~~~~~~~~
An explicit list - The *LIST OF WORDS command.
~~ ~~~~~~~~ ~~~~
This command allows one to specify a set of words of interest by
supplying them explicitly.
Note that when the CLOC itemising strategy is in use, each item in the
explicit list will be mapped to a "word". Thus you do not need to supply the
exact case of the letters nor include padding letters.
Example
~~~~~~~
*LIST OF WORDS this that me you
General form
~~~~~~~ ~~~~
*LIST OF WORDS list
~~~~
where list is one or more words separated from each other by one or more
~~~~
spaces.
A Pattern - The *PATTERN command
~ ~~~~~~~
This command specifies a skeletal form of a word, and causes the package
to select only those words which match the specified pattern. Two reserved
characters are used within a pattern;
(a) a dummy-symbol which is .
(b) a variable-symbol which is *
The dummy-symbol stands for any letter.
~~~ ~~~~~~
The variable-symbol stands for "any sequence of letters, including none
at all".
These reserved characters can be used in combination with the letter
~~~~~~
characters defined by the word definition commands, to construct a pattern.
~~~~~~~~~~
(a) *PATTERN run*
(b) *PATTERN *ing
(c) *PATTERN pre*ed
Page 23
In (a) all words which start with 'run' are selected.
In (b) all words which end with 'ing' are selected.
In (c) all words which start with 'pre' and end in 'ed' are selected.
(d) *PATTERN *ing *ed
(e) *PATTERN a* b* c*
These examples show how more than one pattern can be included on the same
*PATTERN command line. Each is separated from the next by at least one space.
In (d) all words which end in 'ing' or end in 'ed' are selected. This is
~~
equivalent to having a *PATTERN line for '*ing' and another one for '*ed'.
In (e) this selects all words which start with 'a' or 'b' or 'c'. One
~~ ~~
can use this feature to produce a full concordance in sections; first the
'a', 'b', and 'c's then 'd', 'e', 'f's etc.
(f) *PATTERN ....
(g) *PATTERN .h.a...
(h) *PATTERN *...ing
In (f) all four letter words will be chosen.
In (g) all six letter words with 2nd letter 'h' and 4th letter 'a' will
be selected.
In (h) all words of at least six letters which end in 'ing' will be
~~ ~~~~~
picked out.
NOTE If, within a pattern, "*" and/or "." are being used as letters, the
~~~~
following option can be used to define your own variable and dummy symbols.
The DUMMYaVARIABLEb option
This option must be used whenever a given pattern is to contain "*" or
"." as letters. The revised symbols apply for the current *PATTERN command
only.
Examples:
(a) *PATTERN DUMMY?VARIABLE- *run-
(b) *PATTERN DUMMY.VARIABLE? ?...ing*
In (a) the "?" temporarily replaces "." as the dummy-symbol; the "-"
temporarily replaces "*" as the variable-symbol. All words which start with
'*run' are selected.
In (b) all words at least seven letters long and ending with 'ing*' are
~~ ~~~~~
selected.
General Form
~~~~~~~ ~~~~
The a) *PATTERN pattern1 pattern2 pattern3 etc.
or b) *PATTERN DUMMYaVARIABLEb pattern1 pattern2 pattern3 etc.
Notes:
1. In a) DUMMY.VARIABLE* is implied before the first pattern.
2. At least one pattern must appear on the command.
3. A pattern consists of letters,the dummy-symbol, and the variable
~~~~~~~
symbol in any combination.
Page 24
4. In (b) the character "a" becomes the new dummy-symbol, overriding ".",
the character "b" becomes the new variable-symbol, overriding "*".
5. When the CLOC itemising strategy is in use each explicit
pattern will be carefully mapped to one which will match
the various "words" in the vocabulary. Thus the pattern 'run*'
will match "run" "running" "Run" "RUNNING" etc. Padding letters will
be ignored since the vocabulary words do not contain them. For example,
a) if ' was a padding letter then .... would match "don't"
b) if ' was a deferred letter then .... would not match "don't"
This is because in case a) padding letters are removed before the
word is stored in the vocabulary table, so it looks like "dont".
In practice it is sufficient for you to examine the vocabulary table
printed using the WORDLIST command (q.v.) to find out what "words"
are in the vocabulary.
The above set description commands can follow each other. When they do
so, the set of words chosen will be the sum of the words described by each
control statement. Words defined on two or more command lines will be counted
once only.
Example
~~~~~~~
*FREQUENCY (100 TO 500)
*LIST OF WORDS me you we they
*PATTERN ...ing
The above sequence of commands defines a set of words which contain:- all
words of frequency 100 to 500 inclusive, the words 'me' 'you' 'we' 'they', and
all six letter words which end in 'ing'.
4.7 TASK SELECTION COMMANDS
The following commands operate on the previously selected collection of
words. Each command specifies an action to be performed. Eleven tasks can be
selected:
(A) sorting the chosen words into alphabetic and/or frequency
order
(B) printing a word-index
(C) producing a concordance of the chosen words
(D)+ finding co-occurrences of words
(E) discovering the collocations within the context of the
chosen words.
(F)+ write out the itemised text
(G)+ output a newline
(H)+ output a newpage
(I)+ output a message
(J)+ include a comment
Page 25
(K)+ FINISH the run of the package
Each command can be included as often as required. Those marked + above
do not need to be preceded by any word selection statements.
(A) WORDLIST
This command causes the package to sort the previously chosen collection
of words into order, and print them. The type of sorted list that is produced
is determined by the keyword in the specification field. This allows one to
produce word counts in alphabetic order, reverse alphabetic order, etc. These
are printed across the page, the number per line is determined by the maximum
word length and output line width. In all word lists each word is preceded by
its frequency of occurrence.
Examples:
~~~~~~~~
a) WORDLIST ALPHA
b) WORDLIST REVALPHA
c) WORDLIST AFREQ
In a) an alphabetic wordlist, is produced.
In b) a reverse alphabetic wordlist, i.e. one in rhyming order, is
produced.
In c) a wordlist in ascending frequency order is printed.
General form
~~~~~~~ ~~~~
WORDLIST sorting criterion
~~~~~~~ ~~~~~~~~~
where sorting criterion can be one of the following:
~~~~~~~ ~~~~~~~~~
ALPHA
This causes the package to sort the words into ascending alphabetic
order. The collating order for letters is taken from the word definition
commands.
DALPHA
This causes the package to sort the words into descending alphabetic
order. The collating order for letters is taken from the word definition
commands.
REVALPHA
This causes the package to sort the selected words into reverse
alphabetic order, in which words with similar endings sort together. The
collating order for letters is taken from the word definition commands.
AFREQ
This causes the package to sort the words into ascending frequency order.
Words having the same frequency of occurrence will be sorted in ascending
alphabetic order.
DFREQ
This causes the package to sort the selected words into descending
frequency order. Words having the same frequency of occurrence will be sorted
in ascending alphabetic order.
Page 26
FIRST
This causes the package to sort the selected words in the order in which
they first occur in the text. Note that words are printed across the page.
~~~~~ ~~~~~
LAST
This causes the package to sort the selected words in the order in which
they last occur in the text. Note that words are printed across the page.
~~~~ ~~~~~
ALENGTH
This causes the package to sort the selected words in ascending length
order, which is in order of their length in characters ignoring any deferred
(or padding) letters. Words of equal length will be sorted in ascending
alphabetic order.
DLENGTH
This causes the package to sort the selected words in descending length
order, which is the descending order of their length in characters ignoring
any deferred (or padding) letters. Words of equal length will be sorted in
ascending alphabetic order.
AXLENGTH
This causes the package to sort the selected words in ascending extended
length order, which is the order of their length in characters including any
deferred (or padding) letters. Words of equal length will be sorted in
ascending alphabetic order.
DXLENGTH
This causes the package to sort the selected words in descending extended
length order, which is the descending order of their length in characters
including any deferred (or padding) letters. Words of equal length will be
sorted in ascending alphabetic order.
In each of the above cases, each word printed is preceded by its
frequency of occurrence in the text. By default, the words and their
frequencies are printed across the page, rather than in columns. This is done
so that you can easily write a program (say in SNOBOL or BASIC) to reformat
the output from the CLOC package as suitable data for another package, say for
statistical analysis or graph plotting.
(B) INDEX
This command instructs the package to print a word index of the selected
words. The parameters in the specification field allow you to presort the
keywords, and optionally supply the form of text reference that will be used.
Examples:
~~~~~~~~
a) INDEX
b) INDEX ALPHA
c) INDEX REVALPHA,REFS P2 L2
d) INDEX NOREFS
Examples a) and b) produce the same word index. The keywords are
alphabetically sorted. Each reference to a line is specified by a simple line
number.
Page 27
In c) a reverse alphabetic word index is produced. Each reference gives
information about "Page number" and "Line within page" assuming that P and L
references are included in the text for each page.
In d) an alphabetical word index is produced. No text references of any
kind are printed.
General Form
~~~~~~~ ~~~~
INDEX sorting criterion , reference
~~~~~~~ ~~~~~~~~~ ~ ~~~~~~~~~
where sorting criterion is one of ALPHA, DALPHA, REVALPHA, AFREQ, DFREQ,
~~~~~~~ ~~~~~~~~~
FIRST, LAST, ALENGTH, DLENGTH, AXLENGTH, DXLENGTH. The chosen keyword set is
sorted into this order before the word index is produced.
reference allows each word to be referenced with portions of references which
~~~~~~~~~
are embedded in the text data. The instruction takes the form:
Example
~~~~~~~
REFS A4P2L6
General Form
~~~~~~~ ~~~~
REFS letter number letter number ... letter number
~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~ ~~~~~~ ~~~~~~
The letter must be from A to Z, and identifies an embedded text
~~~~~~
reference. The number of characters printed for the reference is given by
number. When printed, each reference will be separated from the next by one
~~~~~~
space.
NOREFS
When this keyword is present no text references of any kind will be printed.
Defaults
~~~~~~~~
1. When sorting criterion is absent, ALPHA is assumed.
~~~~~~~ ~~~~~~~~~
2. When reference is absent, an absolute record number is used.
~~~~~~~~~
(C) CONCORDANCE
This command instructs the package to print a concordance of the selected
words. The parameters in the specification field allow the user to presort
the keywords; to choose the citation style; to select a citation width; and
optionally supply a text reference.
Examples:
~~~~~~~~
a) CONCORDANCE
b) CONCORDANCE ALPHA,KWIC,CITE 4 BY 4
Page 28
c) CONCORDANCE REVALPHA,CITE 6 BY 6,REFS P2 L2
d) CONCORDANCE CITE FROM.TO.INCLUSIVE
e) CONCORDANCE REVALPHA,LEFT,CITE FROM/TO/EXCLUSIVE,REFS S2 P3 L2
f) CONCORDANCE REVALPHA,CITE 6 BY 6,NOREFS
g) CONCORDANCE CITE 4 BY 4, ABOUT NODE-1
Examples a) and b) produce the same concordance. The keywords are
alphabetically sorted. Keyword in context citations are printed which have
four words on either side of the word of interest. Each line is identified by
a simple record number.
In c) a reverse alphabetic KWIC concordance is produced with six words of
context on either side of the keyword. Each line printed is prefixed by text
reference information giving "Page number" and "Line within page" assuming
that P and L references are included in the text for each page.
In d) an alphabetical KWIC concordance is printed. The keyword is
surrounded by as many words as possible up to and including a "." character.
By this means a sentence of context is printed.
In e) assuming that "/" has been declared as a "newline" character (see
INPUT DETAILS command), this example will print a reverse alphabetic
concordance. Each citation will consist of a full line of context, left
justified and prefixed with "section", "page" and "line number" information,
assuming that S, P and L references have been included in the text.
In f) the same concordance as in c) will be produced but no text
references will precede the citations.
In g) the same concordance as in b) will be selected but the citations
will be printed with the word before the keyword centralised on the line.
General Form
~~~~~~~ ~~~~
CONCORDANCE sorting criterion, style, citationwidth, offset, reference
~~~~~~~ ~~~~~~~~~~ ~~~~~~ ~~~~~~~~~~~~~~ ~~~~~~~ ~~~~~~~~~
where sorting criterion is one of ALPHA, DALPHA, REVALPHA, AFREQ, DFREQ,
~~~~~~~ ~~~~~~~~~
FIRST, LAST, ALENGTH, DLENGTH, AXLENGTH, DXLENGTH. The chosen keyword set is
sorted into this order before the concordance is produced.
style is the type of concordance required. This can be one of two kinds,
~~~~~
namely:
1. KWIC - key word on context, in which the word of interest is
centralised on the print line. CENT can be used as a synonym for
KWIC.
2. LEFT - in which the line of context is printed as far to the left as
possible.
citation width indicates the amount of context to be printed. This takes the
~~~~~~~~ ~~~~~
form:
Page 29
either: CITE integer1 BY integer2
~~~~~~~~ ~~~~~~~~
in which integer1 words are printed before the keyword, and
~~~~~~~~
integer2 words are printed after the keyword.
~~~~~~~~
or: CITE FROMchar1TOchar2INCLUSIVE
~~~~~ ~~~~~
or: CITE FROMchar1TOchar2EXCLUSIVE
~~~~~ ~~~~~
This option causes the package to print the citations between two given
characters char1 and char2. The left context begins with character char1, and
~~~~~ ~~~~~ ~~~~~
the right context ends with char2. When INCLUSIVE is present the characters
~~~~~
char1 and char2 are removed from the printed line.
~~~~~ ~~~~~
offset is optional and when present takes the form
~~~~~~
either: ABOUT NODE+integer
~~~~~~~
or: ABOUT NODE-integer
~~~~~~~
This option allows citations to be printed about words near to the
keyword. For example, ABOUT NODE+1 causes all citations to be printed as if
the word to the right of the keyword was being used for citations. Similarly,
ABOUT NODE-1 chooses the word to the left of the keyword when citations are
printed. The default value of offset is ABOUT NODE+0 .
~~~~~~
reference allows each citation to be prefixed with portions of references
~~~~~~~~~
which are embedded in the text data. The instruction takes the form:
Example
~~~~~~~
REFS A4P2L6
General Form
~~~~~~~ ~~~~
REFS letter number letter number ... letter number
~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~ ~~~~~~ ~~~~~~
The letter must be from A to Z, and identifies an embedded text
~~~~~~
reference. The number of characters printed for the reference is given by
number. When printed, each reference will be separated from the next by one
~~~~~~
space.
NOREFS
When this keyword is present no text references of any kind will be printed.
Defaults
~~~~~~~~
Page 30
1. When sorting criterion is absent, ALPHA is assumed.
~~~~~~~ ~~~~~~~~~
2. When style is absent, KWIC is assumed.
~~~~~
3. When citation width is absent, CITE 4 BY 4 is assumed.
~~~~~~~~ ~~~~~
4. When offset is absent, ABOUT NODE+0 is assumed.
~~~~~~
5. When reference is absent, an absolute record number is used.
~~~~~~~~~
(D)+ CO-OCCURRENCE
This command is used when you know two or more words and need to study
how they occur in a text. The parameters in the specification field allow you
to choose a style for presenting the results; to select a citation width and
to optionally supply a text reference. Subsidiary commands enable you to
choose word pairs or phrases of interest and also to choose a series of words
separated by an arbitrary word distance. Note: The CO-OCCURRENCE command
~~~~
need not be preceded by any word selection commands.
Examples:
a) CO-OCCURRENCE
*PHRASE you are
*PHRASE now is the winter
b) CO-OCCURRENCE CITE 8 BY 8, REFS P2 L2
*PHRASE just man
*SERIES a UPTO6 goat
*SERIES how GAP2 see
*SERIES he UPTO3 miner GAP1 and
c) CO-OCCURRENCE
*PATTERN *ing *ly
Example a) produces 4 BY 4 KWIC citations of positions in a text in which
the phrase "you are" occurs. After these have been found, all occurrences of
the phrase "now is the winter" are found. In both cases punctuation is
ignored, hence all examples are discovered.
Example b) 8 BY 8 citations are printed centralised on the page, each
prefixed with a 2 figure page number and 2 figure line number. After printing
all occurrences of the phrase "just man", the three *SERIES commands are
obeyed in order. The first command finds all occurrences of "a ... goat"
where "a" and "goat" are separated from each other by 0, 1, 2, 3, 4, 5, or 6
words of context. The second command finds all occurrences of "how" and "see"
separated from each other by precisely two arbitrary words. The third *SERIES
~~~~~~~~~
command shows how you can mix the UPTOnumber and GAPnumber options within one
~~~~~~ ~~~~~~
specification. This example will find all occurrences of "he" "miner" "and"
where "he" and "miner" are separated by 0, 1, 2 or 3 arbitrary words and with
"miner" and "and" separated by exactly 1 arbitrary word. Example c) shows how
you can also put a list of patterns which will be searched in the order they
are given. This represents a skeletal form of a phrase. When citations are
printed the node for offset purposes is the first word of a *PHRASE or a
*SERIES or a *PATTERN. The offset option allows you to shift the citation
left or right as required.
Page 31
General Form
~~~~~~~ ~~~~
CO-OCCURRENCE style , citation width , offset, reference
~~~~~ ~ ~~~~~~~~ ~~~~~ ~ ~~~~~~~ ~~~~~~~~~
*PHRASE list
~~~~
*SERIES word type number word ... type number word
~~~~ ~~~~ ~~~~~~ ~~~~ ~~~ ~~~~ ~~~~~~ ~~~~
*PATTERN pattern1 pattern2 pattern3 etc.
(or) *PATTERN DUMMYaVARIABLEb pattern1 pattern2 pattern3 etc.
style is the type of citation required. This can be one of two kinds, namely:
~~~~~
1. KWIC - key word on context, in which the word of interest is
centralised on the print line. CENT can be used as a synonym for
KWIC.
2. LEFT - in which the line of context is printed as far to the left as
possible.
citation width indicates the amount of context to be printed. This takes the
~~~~~~~~ ~~~~~
form:
either: CITE integer1 BY integer2
~~~~~~~~ ~~~~~~~~
in which integer1 words are printed before the keyword, and
~~~~~~~~
integer2 words are printed after the keyword.
~~~~~~~~
or: CITE FROMchar1TOchar2INCLUSIVE
~~~~~ ~~~~~
or: CITE FROMchar1TOchar2EXCLUSIVE
~~~~~ ~~~~~
This option causes the package to print the citations between two given
characters char1 and char2. The left context begins with character char1, and
~~~~~ ~~~~~ ~~~~~
the right context ends with char2. When INCLUSIVE is present the characters
~~~~~
char1 and char2 are removed from the printed line.
~~~~~ ~~~~~
offset is optional and when present takes the form
~~~~~~
either: ABOUT NODE+integer
~~~~~~~
or: ABOUT NODE-integer
~~~~~~~
This option allows citations to be printed about words near to the
keyword. For example, ABOUT NODE+1 causes all citations to be printed as if
the word to the right of the keyword was being used for citations. Similarly,
ABOUT NODE-1 chooses the word to the left of the keyword when citations are
printed. The default value of offset is ABOUT NODE+0 . The node for offset
~~~~~~
Page 32
purposes is the first word of a *PHRASE or a *SERIES.
reference allows each citation to be prefixed with portions of references
~~~~~~~~~
which are embedded in the text data. The instruction takes the form:
Example
~~~~~~~
REFS A4P2L6
General Form
~~~~~~~ ~~~~
REFS letter number letter number ... letter number
~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~ ~~~~~~ ~~~~~~
The letter must be from A to Z, and identifies an embedded text
~~~~~~
reference. The number of characters printed for the reference is given by
number. When printed, each reference will be separated from the next by one
~~~~~~
space.
NOREFS
When this option is present no text references of any kind will be printed.
Defaults
~~~~~~~~
1. When style is absent, KWIC is assumed.
~~~~~
2. When citation width is absent, CITE 4 BY 4 is assumed.
~~~~~~~~ ~~~~~
3. When offset is absent, ABOUT NODE+0 is assumed.
~~~~~~
4. When reference is absent, an absolute record number is used.
~~~~~~~~~
type number is either UPTOnumber or GAPnumber, and where
~~~~ ~~~~~~ ~~~~~~ ~~~~~~
UPTOnumber means 0, 1, 2, 3.. number of arbitrary words may
~~~~~~ ~~~~~~
occur at this position.
GAPnumber means exactly number arbitrary words occur at
~~~~~~ ~~~~~~
this position.
list is one or more words separated from each other by spaces.
~~~~
Note
~~~~
1. Each word must be separated from the type and number by at least one
~~~~ ~~~~~~
space.
Page 33
2. The number can be 0, in which case UPTO0 and GAP0 are equivalent and
~~~~~~
indicate that the two words are to be adjacent. *PHRASE is the
degenerative case of a *SERIES in which all numbers are zero.
~~~~~~~
3. *PHRASE and *SERIES and *PATTERN commands can be repeated as often
as required.
(E) COLLOCATIONS
This command instructs the package to examine the collocates in the
context surrounding each selected word. Those collocates which are found to
have a significant affinity to the selected word will have their context
printed. This option allows closely associated pairs of words to have their
context printed. The occurrence of a collocate will be counted whenever it
occurs in a range several words to the left or to the right of the selected
word. This region is termed a span, the size of which can be chosen using a
~~~~
subsidiary command.
Examples
~~~~~~~~
COLLOCATIONS ALPHA,KWIC,CITE 4 BY 4
COLLOCATIONS REVALPHA,CITE 6 BY 6,REFS P2 L2
COLLOCATIONS CITE 5 BY 5, ABOUT NODE+1
COLLOCATIONS CITE 6 BY 6, ABOUT COLLOCATE
COLLOCATIONS CITE FROM.TO.EXCLUSIVE, ABOUT COLLOCATE-1
COLLOCATIONS FIRST,CONDENSED
General Form
~~~~~~~ ~~~~
a) COLLOCATIONS sorting criterion, style, citation width, offset, reference
~~~~~~~ ~~~~~~~~~~ ~~~~~~ ~~~~~~~~ ~~~~~~ ~~~~~~~ ~~~~~~~~~
b) COLLOCATIONS sorting criterion, CONDENSED
~~~~~~~ ~~~~~~~~~
where sorting criterion is one of ALPHA, DALPHA, REVALPHA, AFREQ, DFREQ,
~~~~~~~ ~~~~~~~~~
FIRST, LAST, ALENGTH, DLENGTH, AXLENGTH, DXLENGTH. The chosen keyword set is
sorted into this order before the collocations are produced.
style is the type of citation required. This can be one of two kinds, namely:
~~~~~
1. KWIC - key word on context, in which the word of interest is
centralised on the print line. CENT can be used as a synonym for
KWIC.
2. LEFT - in which the line of context is printed as far to the left as
possible.
citation width indicates the amount of context to be printed. This takes the
~~~~~~~~ ~~~~~
form:
either: CITE integer1 BY integer2
~~~~~~~~ ~~~~~~~~
in which integer1 words are printed before the keyword, and
~~~~~~~~
Page 34
integer2 words are printed after the keyword.
~~~~~~~~
or: CITE FROMchar1TOchar2INCLUSIVE
~~~~~ ~~~~~
or: CITE FROMchar1TOchar2EXCLUSIVE
~~~~~ ~~~~~
This option causes the package to print the citations between two given
characters char1 and char2. The left context begins with character char1, and
~~~~~ ~~~~~ ~~~~~
the right context ends with char2. When INCLUSIVE is present the characters
~~~~~
char1 and char2 are removed from the printed line.
~~~~~ ~~~~~
offset is optional and when present takes the form
~~~~~~
either: ABOUT NODE+integer
~~~~~~~
or: ABOUT NODE-integer
~~~~~~~
or: ABOUT COLLOCATE+integer
~~~~~~~
or: ABOUT COLLOCATE-integer
~~~~~~~
This option allows citations to be printed about words near to the node
or collocate. For example, ABOUT NODE+1 causes all citations to be printed as
if the word to the right of the node was being used for citations. Similarly,
ABOUT NODE-1 chooses the word to the left of the node when citations are
printed. The commands ABOUT COLLOCATE+1 ,or ABOUT COLLOCATE-1 do the same
except that centralisation is done using the collocate. When +integer is
~~~~~~~
absent +0 is assumed, (similarly for -integer).
~~~~~~~
reference allows each citation to be prefixed with portions of references
~~~~~~~~~
which are embedded in the text data. The instruction takes the form:
Example
~~~~~~~
REFS A4P2L6
General Form
~~~~~~~ ~~~~
REFS letter number letter number ... letter number
~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~ ~~~~~~ ~~~~~~
The letter must be from A to Z, and identifies an embedded text
~~~~~~
reference. The number of characters printed for the reference is given by
number. When printed, each reference will be separated from the next by one
~~~~~~
space.
NOREFS
Page 35
When this keyword is present no text references of any kind will be printed.
Defaults
~~~~~~~~
1. When sorting criterion is absent, ALPHA is assumed.
~~~~~~~ ~~~~~~~~~
2. When style is absent, KWIC is assumed.
~~~~~
3. When citation width is absent, CITE 4 BY 4 is assumed.
~~~~~~~~ ~~~~~
4. When offset is absent, ABOUT NODE+0 is assumed.
~~~~~~
5. When reference is absent, an absolute record number is used.
~~~~~~~~~
For b):
The CONDENSED option causes the discovered collocates to be listed in a
simple tabular form. This gives on quick look at an author's word
associations, allowing one to choose accurately which nodes to select and
which collocates to reject. The following few lines illustrate the format of
the table produced by this command.
NODE COLLOCATE PAIR
~~~~ ~~~~~~~~~ ~~~~
25 ship 99 the 3
16 target 5 house 2
8 weston 8 master 8
Subsidiary commands to the COLLOCATIONS command
~~~~~~~~~~ ~~~~~~~~ ~~ ~~~ ~~~~~~~~~~~~ ~~~~~~~
The *SPAN command
This command specifies the range of searching that is done when the
package performs a collocation analysis. The specification field of this
command defines the range which will be searched in terms of the number of
words to the left and right of the word of interest.
Example 1
~~~~~~~ ~
*SPAN 4 BY 4
Example 2
~~~~~~~ ~
*SPAN 4 BY 4 RESTRICTED
General form
~~~~~~~ ~~~~
*SPAN integer1 BY integer2 qualifier
~~~~~~~~ ~~~~~~~~ ~~~~~~~~~
where integer1 indicates the number of words to be searched before the word of
~~~~~~~~
interest and integer2 indicates the number of words after the word of
~~~~~~~~
interest. qualifier is either UNRESTRICTED or RESTRICTED. UNRESTRICTED means
~~~~~~~~~
that all words in the left and right span will be counted as collocates.
RESTRICTED means that when a pair of nodes are closer that leftspan+rightspan,
Page 36
overlapping collocates will be counted once only, and the node will not be
counted as a collocate.
Note that the citation width could be narrower than the span. This will
~~~~~~~~ ~~~~~ ~~~~
cause some collocates to appear to be absent from the context, they will
however be found in the text. Normally the citation width should be greater
~~~~~~~~ ~~~~~
than the span.
~~~~
Default
~~~~~~~
When the *SPAN command is absent the value taken for it will be;
*SPAN 4 BY 4
When qualifier is absent UNRESTRICTED is assumed.
~~~~~~~~~
The *FREQUENCY command
This command is used to select significant collocates according to their
frequency of occurrence. Only those collocates which occur within the
specified frequency limits will have their citations printed. Note that the
frequency of occurrence of a collocate is different from the frequency of
occurrence of the same object treated as a word.
~~~~
Example
~~~~~~~
*FREQUENCY (100 TO 500)
General Form
~~~~~~~ ~~~~
*FREQUENCY expression
~~~~~~~~~~
where expression is one or more terms connected by OR symbols. A
~~~~~~~~~~ ~~~~
term is one of the following:
~~~~
(a) integer for example 10
~~~~~~~
Only collocates occurring exactly integer times will be selected.
~~~~~~~
(b) >integer for example >10
~~~~~~~
Only collocates occurring more than integer times will be selected.
~~~~~~~
(c) <integer for example <10
~~~~~~~
Only collocates occurring less than integer times will be selected.
~~~~~~~
(d) (integer1 TO integer2) for example (100 TO 500)
~~~~~~~ ~~~~~~~
Only collocates lying in the range integer1 to integer2 inclusive
~~~~~~~ ~~~~~~~
Page 37
will be selected. Note that integer1 must be smaller than
~~~~~~~
integer2.
~~~~~~~
Default
~~~~~~~
When the *FREQUENCY command is absent the package assumes the following value
for it.
*FREQUENCY >1
EVERY COLLOCATE
This command ensures that every collocate chosen by the SPAN and
FREQUENCY commands will be considered for selection.
Example and general form
~~~~~~~ ~~~ ~~~~~~~ ~~~~
EVERY COLLOCATE
SELECTCOLLOCATE
set description
~~~ ~~~~~~~~~~~
This command ensures that only those words in the set description
~~~ ~~~~~~~~~~~
will be considered as collocates.
Example
~~~~~~~
SELECTCOLLOCATE
*LIST OF WORDS father mother
REJECTING
set description
~~~ ~~~~~~~~~~~
This command allows one to specify an exclusion list for
collocates. Its function is to remove insignificant collocates
e.g. 'the' 'but' 'and' from those to be printed, thereby producing as
results only the interesting collocates.
Example
~~~~~~~
REJECTING
*PATTERN run*
ACCEPTING
set description
~~~ ~~~~~~~~~~~
This command allows you to add to the collocation of possible collocates
those in the set description. It is most often used to supply words
~~~ ~~~~~~~~~~~
that were excluded because the REJECTING command was too restrictive.
Page 38
The above commands can only be supplyed in a fixed order. The order is
similar to that for word selection described earlier but this time we
use the collocate selection commands. The commands EVERY COLLOCATE or
SELECTCOLLOCATE are alternatives, only one can be chosen, but if both
are absent the command EVERY COLLOCATE is assumed. Thus we can say:-
either: EVERY COLLOCATE
or: SELECTCOLLOCATE
set description
~~~ ~~~~~~~~~~~
The commands REJECTING or ACCEPTING can be chosen as often as neccessary
so that you can get just those collocates that you want.
Examples
~~~~~~~~
a) REJECTING
*LIST OF WORDS the but and
b) REJECTING
*PATTERN *ing *ed
c) REJECTING
*FREQUENCY (100 TO 700)
d) SELECTCOLLOCATE
*PATTERN run*
e) SELECTCOLLOCATE
*PATTERN run*
REJECTING
*LIST OF WORDS running
In a) The specific collocates 'the' 'but' 'and' will not be printed.
In b) All collocates which end in 'ing' or 'ed' will not be printed.
In c) All collocates whose vocabulary frequency of occurrence is
between 100 and 700 inclusive will not be printed.
In d) Only those collocates which start with 'run' will be chosen
In e) All collocates which start with 'run' but excluding the word 'running'
will be chosen.
(F)+ WRITETEXT
This command will cause CLOC to print out the itemised text. Each line
can be prefixed with a text reference to the first word in each line.
Examples
~~~~~~~~
a) WRITETEXT
b) WRITETEXT REFS P3 L4
c) WRITETEXT NOREFS
in a) the text reference will be a simple record number.
in b) a 3 character page number P and a 4 figure line number L will
be used as the reference.
in c) no references of any kind will be printed.
Page 39
General form
~~~~~~~ ~~~~
WRITETEXT reference
~~~~~~~~~
where reference is optional, and when present takes the form:-
~~~~~~~~~
Example
~~~~~~~
REFS A4P2L6
General Form
~~~~~~~ ~~~~
REFS letter number letter number ... letter number
~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~ ~~~ ~~~~~~ ~~~~~~
The letter must be from A to Z, and identifies an embedded text
~~~~~~
reference. The number of characters printed for the reference is given by
number. When printed, each reference will be separated from the next by one
~~~~~~
space.
NOREFS
When this keyword is present no text references of any kind will be printed.
Default
~~~~~~~
1. When reference is absent, an absolute record number is used.
~~~~~~~~~
(G)+ NEWLINE
This command will insert one or more newlines in the CLOC results file.
You can use this feature to widen the gap between the results from
successive tasks.
Examples
~~~~~~~~
a) NEWLINE
b) NEWLINE 1
c) NEWLINE 5
General form
~~~~~~~ ~~~~
NEWLINE integer
~~~~~~~
The command will cause integer newlines to be sent to the CLOC
~~~~~~~
results file.
Default
~~~~~~~
When integer is absent, a value of 1 is assumed.
~~~~~~~
(H)+ NEWPAGE
This command will cause a newpage to be thrown on the CLOC results file.
Page 40
Example and general form.
~~~~~~~ ~~~ ~~~~~~~ ~~~~
NEWPAGE
(I)+ MESSAGE
This command will send the contents of the specification field, and any
continuation, to the CLOC results file.
Example
~~~~~~~
MESSAGE Henry the Fifth (part 1)
General form
~~~~~~~ ~~~~
MESSAGE character sequence
~~~~~~~~~ ~~~~~~~~
The character sequence will be sent to the CLOC results file.
~~~~~~~~~ ~~~~~~~~
(J)+ NOTE
This command can be used at your convenience to insert a
commentary about the following or preceding task. All characters
in the specification field of this command are ignored.
Example
~~~~~~~
NOTE THIS TEXT IS TAKEN FROM WORDSWORTH
General Form
~~~~~~~ ~~~~
NOTE character sequence
~~~~~~~~~ ~~~~~~~~
Where character sequence is totally ignored. This information will be
~~~~~~~~~ ~~~~~~~~
printed on the CLOC diagnostic and information channel along with the other
control statements.
(K)+ FINISH
This command must be the final one in the sequence. It informs the
package that no further commands are to follow.
Example and general form
~~~~~~~ ~~~ ~~~~~~~ ~~~~
FINISH
5 EXAMPLES
Before preparing a large volume of text, or before trying out the CLOC
package on some prepared text, you should run the example programs given
below. To do this you will need to read the documentation on using the
package on your local computer. This will provide the basic information you
need to know to run any CLOC job. You should compare the results produced by
the computer with those given in the example programs to check your
understanding of the command language. You are also recommended to vary the
given commands in order to gain some feel for their effect. Often the
examples will contain all the commands you need to solve your given problem,
in which case all you need do is supply your own text. All the examples in
Page 41
this section use the following extract from "CAUTION: LOW FLYING DUCKS" by
the author.
The University ; "A society of individuals living and working
together for the advancement of learning and the dissemination of
knowledge". (University of York Development Plan).
In 1617 James I received a petition requesting a University
for York. This was followed by a petition to Parliament in 1652,
and a deputation to the University Grants Committee in 1947. The
University officially opened in 1963 with a student population
comprising 216 undergraduates and 12 postgraduates.
The site consisted of 190 acres of marshy land and a large
decrepit Elizabethan mansion, Heslington Hall, destined to become
the administration building. Draining the saturated ground was
accomplished by widening a natural stream and creating a fourteen
acre artificial lake around which the University was constructed.
If the above were coded on punched cards using the recommendations in
section 2 of this guide, it would look like this, the following:-
=THE =UNIVERSITY : "=A SOCIETY OF INDIVIDUALS LIVING AND WORKING
TOGETHER FOR THE ADVANCEMENT OF LEARNING AND THE DISSEMINATION OF
=KNOWLEDGE". (=UNIVERSITY OF =YORK =DEVELOPMENT =PLAN) .
=IN 1617 =JAMES =I RECIEVED A PETITION REQUESTING A =UNIVERSITY
FOR =YORK, =THIS WAS FOLLOWED BY A PETITION TO =PARLIAMENT IN 1652,
AND A DEPUTATION TO THE =UNIVERSITY =GRANTS =COMMITTEE IN 1947. =THE
=UNIVERSITY OFFICIALLY OPENDED IN 1963 WITH A STUDENT POPULATION
COMPRISING 216 UNDERGRADUATES AND 12 POSTGRADUATES.
=THE SITE CONSISTED OF 190 ACRES OF MARSHY LAND AND A LARGE
DECREPIT =ELIZABETHAN MANSION, =HESLINGTON =HALL, DESTINED TO BECOME
THE ADMINISTRATION BUILDING. =DRAINING THE SATURATED GROUND WAS
ACCOMPLISHED BY WIDENING A NATURAL STREAM AND CREATING A FOURTEEN
ACRE ARTIFICIAL LAKE AROUND WHICH THE =UNIVERSITY WAS CONSTRUCTED.
We will assume that you are using a computer that has both upper and
lower case and that you stored the text in the form that it was first written.
The coming examples show how CLOC commands are put together to perform the
following tasks:-
1. Alphabetic sorting
2. The pattern feature
3. The exclusion list
4. Producing a concordance
5. Finding collocations
Page 42
Example number 1 Alphabetic Sorting
~~~~~~~ ~~~~~~ ~~~~~~~~~~ ~~~~~~~
The following commands cause CLOC to read the above text and
sort the vocabulary into ascending alphabetic order.
ITEMISEUSING)CLOC
*LETTERS)abcdefghijklmnopqrstuvwxyz
EVERYWORD
WORDLIST)ALPHA
FINISH
The output produced by the computer is a listing of the commands,
including the defaults and comments, and the results of the
sorting process.
a) Control statement listing
default input details width80
ITEMISEUSING)CLOC
*LETTERS)abcdefghijklmnopqrstuvwxyz
default *separators !"#$%&'()*+,-./0123456789:;<=>?[\]^_`{|}~
the text contains :
113 running words
71 distinct words
and the maximum word length is 14 characters
default output details width120
EVERYWORD
WORDLIST)ALPHA
FINISH
b) The Results.
table of 71 words in ascending alphabetic order
===============================================
9 a 1 accomplished 1 acre 1 acres 1 administration 1 advancement
6 and 1 around 1 artificial 1 become 1 building 2 by
1 committee 1 comprising 1 consisted 1 constructed 1 creating 1 decrepit
1 deputation 1 destined 1 development 1 dissemination 1 draining 1 elizabethan
1 followed 2 for 1 fourteen 1 grants 1 ground 1 hall
1 heslington 1 i 4 in 1 individuals 1 james 1 knowledge
1 lake 1 land 1 large 1 learning 1 living 1 mansion
1 marshy 1 natural 6 of 1 officially 1 opened 1 parliament
2 petition 1 plan 1 population 1 postgraduates 1 received 1 requesting
1 saturated 1 site 1 society 1 stream 1 student 9 the
1 this 3 to 1 together 1 undergraduates 6 university 3 was
1 which 1 widening 1 with 1 working 2 york
Page 43
Example number 2 The PATTERN Feature
~~~~~~~ ~~~~~~ ~~~ ~~~~~~~ ~~~~~~~
The example illustrates how the pattern feature can be used to select, from
the above text, words which end in a standard way. The selected words are
then listed in alphabetic order.
ITEMISEUSING)CLOC
*LETTERS)abcdefghijklmnopqrstuvwxyz
SELECTWORDS
*PATTERN) *ing
WORDLIST)ALPHA
FINISH
The output produced is a listing of the commands and the results.
a) Control statement listing
default input details width80
ITEMISEUSING)CLOC
*LETTERS)abcdefghijklmnopqrstuvwxyz
default *separators !"#$%&'()*+,-./0123456789:;<=>?[\]^_`{|}~
the text contains :
113 running words
71 distinct words
and the maximum word length is 14 characters
default output details width120
SELECTWORDS
*PATTERN) *ing
WORDLIST)ALPHA
FINISH
b) The Results.
table of 9 words in ascending alphabetic order
==============================================
1 building 1 comprising 1 creating 1 draining 1 learning 1 living
1 requesting 1 widening 1 working
Page 44
Example Number 3 The Exclusion List
~~~~~~~ ~~~~~~ ~~~ ~~~~~~~~~ ~~~~
This example shows how one can exclude a set of words from a previously
selected set. The resultant collection is listed in ascending alphabetic
order.
ITEMISEUSING)CLOC
*LETTERS)abcdefghijklmnopqrstuvwxyz
SELECTWORDS
*FREQUENCY) >1
EXCLUDING
*LISTOFWORDS) a the of and
WORDLIST)ALPHA
FINISH
The output produced is a listing of the commands and the alphabetically
ordered list.
a) Control statement Listing
default input details width80
ITEMISEUSING)CLOC
*LETTERS)abcdefghijklmnopqrstuvwxyz
default *separators !"#$%&'()*+,-./0123456789:;<=>?[\]^_`{|}~
the text contains :
113 running words
71 distinct words
and the maximum word length is 14 characters
default output details width120
SELECTWORDS
*FREQUENCY) >1
EXCLUDING
*LISTOFWORDS) a the of and
WORDLIST)ALPHA
FINISH
b) The Results
table of 8 words in ascending alphabetic order
==============================================
2 by 2 for 4 in 2 petition 3 to 6 university
3 was 2 york
Page 45
Example number 4 Producing a concordance
~~~~~~~ ~~~~~~ ~~~~~~~~~ ~ ~~~~~~~~~~~
The following commands will produce a concordance of the words selected.
The output is centralized on the page and sorted in ascending alphabetic
order.
ITEMISEUSING)CLOC
*LETTERS)abcdefghijklmnopqrstuvwxyz
SELECTWORDS
*PATTERN) *ed
CONCORDANCE) KWIC,CITE 5 BY 5
FINISH
a) Control statement Listing
default input details width80
ITEMISEUSING)CLOC
*LETTERS)abcdefghijklmnopqrstuvwxyz
default *separators !"#$%&'()*+,-./0123456789:;<=>?[\]^_`{|}~
the text contains :
113 running words
71 distinct words
and the maximum word length is 14 characters
default output details width120
SELECTWORDS
*PATTERN) *ed
CONCORDANCE) KWIC,CITE 5 BY 5
FINISH
b) The results
concordance of 8 nodes
======================
node accomplished occurs 1 times
11 Draining the saturated ground was accomplished by widening a natural stream
node consisted occurs 1 times
8 undergraduates and 12 postgraduates. The site consisted of 190 acres of marshy land
node constructed occurs 1 times
12 around which the University was constructed.
node destined occurs 1 times
9 decrepit Elizabethan mansion, Heslington Hall, destined to become the administration building
node followed occurs 1 times
4 University for York. This was followed by a petition to Parliament
node opened occurs 1 times
6 Committee in 1947. The University officially opened in 1963 with a student population
node received occurs 1 times
3 Development Plan). In 1617 James I received a petition requesting a University
node saturated occurs 1 times
10 the administration building. Draining the saturated ground was accomplished by widening
Page 46
Example number 5 Finding COLLOCATIONS
~~~~~~~ ~~~~~~ ~~~~~~~ ~~~~~~~~~~~~
The following commands cause the package to scan the context of
the selected words, and to print the examples of their collocations.
The output is centralised on the page and sorted in ascending alphabetic
order.
ITEMISEUSING)CLOC
*LETTERS)abcdefghijklmnopqrstuvwxyz
SELECTWORDS
*LISTOFWORDS) university
COLLOCATIONS) KWIC,CITE 5 BY 5
*FREQUENCY) >2
FINISH
a) Control statement Listing
default input details width80
ITEMISEUSING)CLOC
*LETTERS)abcdefghijklmnopqrstuvwxyz
default *separators !"#$%&'()*+,-./0123456789:;<=>?[\]^_`{|}~
the text contains :
113 running words
71 distinct words
and the maximum word length is 14 characters
default output details width120
SELECTWORDS
*LISTOFWORDS) university
COLLOCATIONS) KWIC,CITE 5 BY 5
default *span 4 by 4
*FREQUENCY) >2
FINISH
b) The Results
collocation analysis of 1 nodes (cited about node)
==================================================
node university occurs 6 times
collocate the occurs 9 times
node-collocate pair occurs 6 times
0 The University ; "A society of individuals living
2 and the dissemination of knowledge". (University of York Development Plan). In
5 and a deputation to the University Grants Committee in 1947. The University
5 and a deputation to the University Grants Committee in 1947. The University
6 University Grants Committee in 1947. The University officially opened in 1963 with a
12 artificial lake around which the University was constructed.
node university occurs 6 times
collocate a occurs 9 times
node-collocate pair occurs 4 times
0 The University ; "A society of individuals living
3 received a petition requesting a University for York. This was followed
3 received a petition requesting a University for York. This was followed
5 and a deputation to the University Grants Committee in 1947. The University
node university occurs 6 times
collocate of occurs 6 times
node-collocate pair occurs 3 times
Page 47
0 The University ; "A society of individuals living
2 and the dissemination of knowledge". (University of York Development Plan). In
2 and the dissemination of knowledge". (University of York Development Plan). In
node university occurs 6 times
collocate in occurs 4 times
node-collocate pair occurs 3 times
5 and a deputation to the University Grants Committee in 1947. The University
6 University Grants Committee in 1947. The University officially opened in 1963 with a
6 University Grants Committee in 1947. The University officially opened in 1963 with a
Page 48
Appendix I
~~~~~~~~ ~
Messages Produced by the CLOC package
~~~~~~~~ ~~~~~~~~ ~~ ~~~ ~~~~ ~~~~~~~
Three categories of message are printed by the package, these are errors,
~~~~~~
warnings, and comments.
~~~~~~~~ ~~~~~~~~
Error Messages
~~~~~ ~~~~~~~~
These cause the run of the package to be abandoned. Where the error is
caused by a mistake in a command the symbol 1 is printed under the faulty
position on the command, 2 for the second error on the line, and so on up to
9 . Error messages take the form ERROR - text of message, where the text is
~~~~ ~~ ~~~~~~~
one of the following
a. MISSING MANDATORY STATEMENT
You have forgotten to include or have misspelt an
essential command.
b. CONTROL STATEMENT ENDS PREMATURELY
The continuation field of 15 spaces was expected but
not found.
c. INCORRECT CONTROL STATEMENT
A mistake has been found on the line, the symbol 1
points to it.
d. UNKNOWN SYMBOL
The item in the specification field has not been
recognised.
e. CHARACTER ALREADY DEFINED
The indicated character has occurred on this or an earlier
line.
f. NO LETTERS PROVIDED
The *LETTERS command exists, but no letters have been
put on it.
g. NUMBER IS TOO LARGE
The indicated number is too large for the CLOC
package to use.
h. UPPER VALUE DOES NOT EXCEED LOWER
The upper value in a frequency range is smaller than
the lower value.
i. NO WORDS FOUND
The combination of word selection commands has chosen
no words.
j. FILE NOT PRODUCED BY CLOC MARK mark
~~~~
Page 49
The file used by the GET TEXT command was not
produced by an earlier run of the package.
k. ABOVE STATEMENT NOT EXPECTED
This line may be a misspelt or spurious command.
l. SYMBOL NOT ALLOWED IN THIS CONTEXT
The indicated symbol is not permitted there.
m. CAPACITY EXCEEDED
The text to be processed contains more words than
the package is able to handle.
n. NUMBER OF REFERENCES EXCEEDS number
~~~~~~
The text contains more text references than the
package can accept.
o. NO WORD SELECTION COMMANDS PROVIDED
The commands EVERY WORD or SELECT WORDS are
absent or misspelt.
p. NUMBER EXPECTED AT THIS POSITION
The previous CLOC keyword must be followed by a number.
q. A NUMBER CANNOT BE PLACED HERE
The previous CLOC keyword must not be followed by a number.
~~~
r. ZERO NUMBER NOT PERMITTED
s. SPACE NOT ALLOWED HERE
The package makes several checks on its operation and in certain
instances may fail with the message SYSTEM ERROR number. Such an occurrence
~~~~~~
should be reported to your local advisory service.
Warning Messages
~~~~~~~ ~~~~~~~~
These are produced when the package finds a simple mistake in a command
not important enough to cause a fatal error. The mistake will be ignored and
the next command will be examined. The symbol 1 is printed under the faulty
position on the line, (and so on up to 9 ). Warning messages take the form
WARNING - text of message, where the text is one of the following.
~~~~ ~~ ~~~~~~~
a. CHARACTER ALREADY DEFINED
The *SEPARATORS, *PADDING commands etc. contain repeated
characters.
b. SPURIOUS CHARACTERS FOUND AND IGNORED
The specification field contains characters which should
not be present, they will be ignored.
c. SET DESCRIPTION SPECIFIES NO WORDS
Page 50
The combination of word selection commands resulted in no
words selected.
d. WORD(S) NOT FOUND
The indicated words on the *LIST OF WORDS command
are not present in the vocabulary of the text.
e. ABOVE ITEM TOO LONG
The word printed is longer than the system can cope
with, trailing letters have been removed.
f. NO REFERENCE INFORMATION IN TEXT
The citation option REFS was chosen but no text references
were placed in the text.
g. NO WORDS MATCH THIS PATTERN
The current vocabulary does not contain words of this form.
The package makes several checks on its operation and in certain
instances may produce the message SYSTEM WARNING number. Such an occurrence
~~~~~~
should be reported to your local computer advisory service.
Comment messages
~~~~~~~ ~~~~~~~~
These are produced when the system has read the text and is about to read
the task selection commands. One or both of the following comments may be
produced.
a. TEXT FILE name SAVED ON date AT time
~~~~ ~~~~ ~~~~
The text has been read and stored in a permanent
file to be later used by the GET TEXT command.
b. TEXT FILE name ACCESSED. (SAVED ON date AT time)
~~~~ ~~~~ ~~~~
The GET TEXT command has accessed this file.
c. The TEXT CONTAINS:
integer1 RUNNING WORDS
~~~~~~~
integer2 DISTINCT WORDS
~~~~~~~
AND THE MAXIMUM WORD LENGTH IS integer3 CHARACTERS.
~~~~~~~
The text under analysis is integer1 words in length, and the vocabulary
~~~~~~~
used contains integer2 different words. The longest word or words contain
~~~~~~~
integer3 characters.
~~~~~~~
Page 51
APPENDIX II
~~~~~~~~ ~~
References
~~~~~~~~~~
1. "The RUNCLOC macro", Computer Centre Users Manual, University of
Birmingham
2. "CLOC - An Applications Package in ALGOL 68R" Presented to
"Applications of ALGOL 68" Conference, April 1975, University of
Liverpool.
3. "English Lexical Studies", J. McH. Sinclair, S. Jones and R.
Daley.
4. "The COCOA Manual",D.B. Russell,ATLAS Computer Laboratory.
5. "Statistical Package for the Social Sciences (SPSS), N.H. Nie, D.H.
Bent, and C.H. Hull, Publ. McGraw-Hill, New York, 1970.
6. "CLOC: A Collocation Package", ALLC Bulletin, Vol. 5, No. 2,
1977.
7. "RATS: A Middle-level Text Utility System": Smith; Computers and
the Humanities, Vol. 6, P.277.
8. "JEUDEMO: A Text-Handling System": Bratley, Lusigaan, and
Ouellette, Computers in the Humanities. Pub. Edinburgh University
Press.
9. "Computer Analysis of Natural Language": Reed 1973, Birmingham
University Computer Centre, internal report 1973.
10. "CLOC User Guide, A.Reed,1975,Computer Centre,University of
Birmingham.
11. "OXEYE: A Text Processing Package for the 1906A", L. Burnard,
1976, Oxford University Computing Service.
12. "CLOC: A General-purpose concordance and collocations generator";
A. Reed, J. L. Schonfelder, 1979, Aston University.
13. "OCP: Oxford Concordance Program", S. Hockey and I Marriott,
October 1980, Oxford University Computing Service.
14. "Anatomy of a Text Analysis Package", Reed A, Computer Lang.,
Vol. 9, No. 2, pp 89-96, 1984.
Page 52
APPENDIX III
~~~~~~~~ ~~~
Glossary
~~~~~~~~
LETTER - One of an arbitrary collection of graphic signs, used to construct
words.
SEPARATOR - i) A graphic sign which is not a LETTER
ii) An arbitrary sequence of i) above.
WORD - An arbitrary sequence of LETTERs, generally contiguous,but may
contain graphic signs which are totally ignored during reading.
NODE - A particular word about which a concordance can be printed or a
collocation analysis performed.
SPAN - The context of words, surrounding a NODE, which is used during
collocation analysis.
COLLOCATE - one of the words of context in a SPAN.
Page 53
APPENDIX IV
~~~~~~~~ ~~
CLOC Global Syntax Rules
~~~~ ~~~~~~ ~~~~~~ ~~~~~
The following "railroad" diagram describes the syntax rules for CLOC
control statements. Follow the arrows from top to bottom and you will pass
through all compulsory commands. A diversion of route indicates optional
commands. A choice of route indicates a choice of commands at that position.
!
............... .............!..........
!INPUT DETAILS!<----! !
~~~~~~~!~~~~~~~ ! !
`----------->! !
..........!.......... .....!....
!ITEMISE USING CLOC! !GET TEXT!
~~~~~~~~~~!~~~~~~~~~~ ~~~~~!~~~~
.....!.... !
!*LETTERS! !
~~~~~!~~~~ !
.--------------------------------->! !
! .......... ! !
!--!*PADDING!<-------. ! !
! ~~~~~~~~~~ ! ! !
! ........... ! ! !
!--!*DEFERRED!<------! ! !
! ~~~~~~~~~~~ ! ! !
! ............. !<------------! !
!--!*SEPARATORS!<----! ! !
! ~~~~~~~~~~~~~ ! ! !
! ................ ! ! !
!--!*READ AS SPACE!<-! ! !
! ~~~~~~~~~~~~~~~~ ! ! !
! ......... ! ! !
`--!*IGNORE!<--------' ! !
~~~~~~~~~ ........... ! !
!SAVE TEXT!<-----! !
~~~~~!~~~~~ ! !
`---------->!<---------------------'
................ !
!OUTPUT DETAILS!<--!
~~~~~~~!~~~~~~~~ !
`---------->!
.--------------------------------->!<--------------------------------------.-.
! .---------.---------.---<---+-->--.--------.-------------. ! !
! ....!.... ....!.... ....!.... ! ...!.. .....!..... .......!....... ! !
! !NEWLINE! !NEWPAGE! !MESSAGE! ! !NOTE! !WRITETEXT! !CO-OCCURRENCE! ! !
! ~~~~!~~~~ ~~~~!~~~~ ~~~~!~~~~ ! ~~~!~~ ~~~~~!~~~~~ ~~~~~~~!~~~~~~~ ! !
`<-----'---------'---------' ! !<-------' !<---------! !
! ! .<--------.-----+--->. ! !
! ! .....!.... ....!.... ....!.... ! !
! ! !*PATTERN! !*PHRASE! !*SERIES! ! !
! ! ~~~~~!~~~~ ~~~~!~~~~ ~~~~!~~~~ ! !
! ! `---------`----------`---->' !
! `---------------------------------->'
Page 54
!
.--------------------------------->!
! .-------------------------+----------------------.
! ......!....... ......!..... ....!...
! !SELECT WORDS! !EVERY WORD! !FINISH!
! ~~~~~~~!~~~~~~ ~~~~~~!~~~~~ ~~~~~~~~
! .......!......... !
! !Set description!-------------->!
! ~~~~~~~~~~~~~~~~~ !
! .------------------------>!<-----------------------.
! ! ........... ! ........... !
! ! !EXCLUDING!<-------!------>!INCLUDING! !
! ! ~~~~~!~~~~~ ! ~~~~~!~~~~~ !
! ! ........!........ ! ........!........ !
! `<--!Set description! ! !Set description!-->'
! ~~~~~~~~~~~~~~~~~ ! ~~~~~~~~~~~~~~~~~
!--------------------------------->!
! .-----------.----------.----+---------.-------------------.------------.
! ....!..... ....!.. ......!...... ......!....... ......!........ !
! !WORDLIST! !INDEX! !CONCORDANCE! !COLLOCATIONS! !CO-OCCURRENCE! !
! ~~~~!~~~~~ ~~~~!~~ ~~~~~~!~~~~~~ ~~~~~~!~~~~~~~ ~~~~~~!~~~~~~~~ !
!<-----'-----------'----------' ....... ! !<---------. !
! !*SPAN!<---! .----.<----+--->. ! !
! ~~~!~~~ ! ! ...!..... ....!.... ! !
! `------>! ! !*PHRASE! !*SERIES! ! !
! ............ ! ! ~~~~!~~~~ ~~~~!~~~~ ! !
! !*FREQUENCY!<-! ! `---------`---->' !
! ~~~~~~!~~~~~ ! ! .......... ! !
! `------>! !->!*PATTERN!-------->' !
! .<--------------------' ! ~~~~~~~~~~ !
! ................. ! ................. ! ........... !
! !EVERY COLLOCATE!<--+-->!SELECTCOLLOCATE! !<-!WRITETEXT!<---------!
! ~~~~~~~!~~~~~~~~~ ! ~~~~~~~~!~~~~~~~~ ! ~~~~~~~~~~~ !
! ! ! ........!........ ! ......... !
! ! ! !Set description! !<-!NEWLINE!<-----------!
! ! ! ~~~~~~~~!~~~~~~~~ ! ~~~~~~~~~ !
! `----------->!<----------' ! ......... !
! .------------------->!<--------------------. !<-!NEWPAGE!<-----------!
! ! ........... ! ........... ! ! ~~~~~~~~~ !
! ! !REJECTING!<---!--->!ACCEPTING! ! ! ...... !
! ! ~~~~~!~~~~~ ! ~~~~~!~~~~ ! !<-!NOTE!<--------------!
! ! .........!....... ! .......!......... ! ! ~~~~~~ !
! `-!Set description! ! !Set description!->' ! ......... !
! ~~~~~~~~~~~~~~~~~ ! ~~~~~~~~~~~~~~~~~ !<-!MESSAGE!<-----------'
`----------------------'<-----------------------------' ~~~~~~~~~
.<----------------------.
! ............ !
! .->!*FREQUENCY!------'!
Where ! ! ~~~~~~~~~~~~ !
! ! ................ !
Set description ---`-!->!*LIST OF WORDS!--'!
! ~~~~~~~~~~~~~~~~ !
! .......... !
`->!*PATTERN!-------->'
~~~~~~~~~~
Click on FTP to download from the FTP archives.
![[FTP]](http://www2.encompassus.org/hidedecus/graphics/i_ftp.gif)