Table of Contents NAME Format of semantic concordance file DESCRIPTION A semantic concordance consists of texts that have been syntactically and semantically tagged. The semantic tagging was done by hand, using various tools to annotate the English text with WordNet senses. The "raw" data were reformatted and syntactically tagged before semantic tags were assigned. See semcor(7WN) for more information about semantic concordances. For historical reasons, semantic concordance files are referred to as context files . Some of the field element within context files reflect that history, as well. File Format Regardless of which files comprise a semantic concordance, and which words are tagged, the format of each concordance file is the same. The context file format follows SGML guidelines, using elements and attribute/value pairs to record information about the file, paragraph and sentence boundaries, and syntactic and semantic information. SGML elements give a file its structure; attribute/value pairs provide additional information. All SGML elements require both start and end tags. SGML attribute/value pairs are of the form: attribute = value Our SGML format deviates from the standard in that value s are only enclosed in quotation marks when there may be embedded spaces. Due to the large number of attribute/value pairs, the presence of quotation marks around each value would have substantially increased the size of the concordances. Nomenclature The structure of a context file is specified below in pseudo-BNF notation. Each SGML element is on a separate line. Terminals are printed in boldface and are represented in the file as written. Items in italics are variables. Upper-case strings are non-terminals. File Structure CONTEXTFILE ::= CONTEXT+ CONTEXT ::= PARA+ | SENT+ PARA ::=

SENT+

SENT ::= SENT_TOK+ SENT_TOK ::= ( WORD_FORM | PUNC )+ WORD_FORM ::= word | word | word | word POS ::= pos= POS_TAG POS_TAG ::= CC | CD | DT | EX | FW | IN | JJ | JJR | JJS | LS | MD | MD|VB | NN | NNP | NNPS| NNP|NP | NNP|VBN | NNS | NN|SYM | NP | NPS | PDT | POS| PP | PR | PRP | PRP$ | RB | RBR | RBS | RP | TO | UH | VB | VBD | VBG| VBN | VBP | VBZ | WDT | WP| WP$ | WRB SEM_TAG ::= LEMMA WNSN LEXSN PN | NULL LEMMA ::= lemma= lemma WNSN ::= wnsn= sense_number LEXSN ::= lexsn= lex_sense PN ::= pn= CATEGORY | NULL CATEGORY ::= person | location | group | other RDF ::= rdf= redefinition | NULL DC ::= dc= distance | NULL SEP ::= sep=" separator_string " | NULL TAGNOTE ::= tagnote= TAGNOTE_TYPE TAGNOTE_TYPE ::= sns_miss | indist_sns | wd_miss | insuffctxt | sense_lost | misc NOTE ::= note=" note " OT ::= ot= OTHER_TAG | NULL OTHER_TAG ::= notag | metaphor | idiom | complexprep | foreignword | nonceword PUNC ::= PUNC_CHARACTER PUNC_CHARACTER ::= [ , . ? ! , ; ( [ ) ] ` ' $ " : ] Interpretation of SGML Elements This element indicates the beginning of a context file. conc specifies the name of the semantic concordance that the file is in. A semantic concordance file contains one or more context elements from the same semantic concordance. See semcor(7WN) for a list of semantic concordance names. This element indicates the beginning of a context. filename is the name of the original corpora file from which this context was extracted. paras indicates that this document contains paragraph delimiters.

Start of a new paragraph. paragraph_number is an integer. The first paragraph in a context is numbered 1 , and paragraph numbers are incremented sequentially. Start of a new sentence. sentence_number is an integer. The first sentence in each context is numbered 1 , and sentence numbers are incremented sequentially throughout the context . Sentence numbers do not restart at 1 in each paragraph. word This element represents a word form. word is the orthographic form as it appears in the original document. All of the syntactic and semantic information is stored as attribute/value pairs described below. cmd= cmd Indicates the status of the wf element. cmd Meaning tag word is to be tagged done word is semantically tagged ignore word should not be tagged update used during semantic concordance development only retag used during semantic concordance development only pos= pos pos is the syntactic tag assigned by Eric Brill's stochastic part-of-speech tagger. See Syntactic Tags below for a list of possible values. lemma= lemma The base form of the word or collocation that the other attribute/value pairs in this wf pertain to. This is the form of the string used to search the WordNet database. If rdf is present, lemma is the base form of the redefinition. When pn is present, redefinition , lemma and category all have the same value. wnsn= sense_number sense_number is the integer sense number corresponding to the WordNet output display. lexsn= lex_sense lex_sense , when concatenated onto lemma using % as the concatenation character, creates a sense_key that indicates the WordNet sense to which word should be linked. This is the semantic tag for word . The format of a sense_key is described in senseidx(5WN) . pn= category Indicates that word is a proper noun categorized as one of the values of CATEGORY. When pn is present, redefinition , lemma and category all have the same value. rdf= redefinition If present, word has been "redefined" to something else. This is mainly used to define discontinuous collocations, correct typographical errors in the text, or to enter a string that should be used to search WordNet instead of word in order to find an appropriate sense for the semantic tag. When pn is present, redefinition , lemma and category all have the same value. dc= distance Indicates that word is part of a discontinuous collocation in which the words comprising the collocation are not adjacent. distance is an integer that specifies how many wf elements away the semantic tag for the collocation is. It may be either negative, indicating wf elements prior to this one, or positive, indicating wf elements following in the file. sep=" separator_string " Indicates that the space between this wf element and the next should be displayed as separator_string . The string may be one or more character. The default word separator is one space. tagnote= tagnote_type A tagnote attribute/value pair is always present if cmd is update or retag . This is used only during semantic concordance development, and indicates the type of problem encountered during semantic tagging. See TAGNOTE_TYPE for a list of possible values. note=" note" A note attribute/value pair is always present with tagnote . note may contain a string that provides additional information about the tagnote , or may be empty. ot= other_tag If present, a semantic tag cannot be assigned to word for one of the reasons listed in OTHER_TAG. Syntactic Tags The following tags are assigned by Brill's stochatstic part-of-speech tagger. Syntactic Tag Interpretation CC Coordinating conjunction CD Cardinal number DT Determiner EX Existential "there" FW Foreign word IN Preposition or subordinating conjunction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List item marker MD Modal NN Noun, singular or mass NNP Proper noun, singular NNPS Proper noun, plural NNS Noun, plural NP Proper noun, singular NPS Proper noun, plural PDT Predeterminer POS Possessive ending PP Personal pronoun PR Pronoun PRP Pronoun PRP$ Pronoun, plural RB Adverb RBR Adverb, comparative RBS Adverb, superlative RP Particle SYM Symbol TO "to" UH Interjection VB Verb, base form VBD Verb, past tense VBG Verb, gerund or present participle VBN Verb, past participle VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner WP Wh-pronoun WP$ Possessive wh-pronoun WRB Wh-adverb SEE ALSO escort(1WN) , senseidx(5WN) , wndb(5WN) , morphy(7WN) , semcor(7WN) . ---------------------------------------------------------------------------- Table of Contents * NAME * DESCRIPTION o File Format o Nomenclature o File Structure o Interpretation of SGML Elements o Syntactic Tags * SEE ALSO