NCHLT: isiXhosa POS tag set

Tag set

For purposes of annotators, this tag set is by and large taken over from Taljard et al (2008) and various documents compiled by G Faasz  and U Heid  from the IMS, Stuttgart and D J Prinsloo and E Taljard, University of Pretoria. The information below refers to the current state of the tagset, but further development will probably necessitate any number of changes.

The tagset is mainly based on the lexical and morphological criteria defined by Lombard (1985) and Louwrens (1991). As described above, the logical structure of the tagset is divided into two layers of linguistic description (annotation levels):

The first annotation level includes all mandatory, or, according to EAGLES, obligatory information, namely up to three elements: an element hinting at the word class, a second one specifying functional or syntactic properties, and a third one giving morphological specifics, cf. e.g. PRO(noun)EMP(hatic)PERS(on).

 

The second level of annotation includes recommended and optional information. This level is in most cases used for a detailed description of closed class items described in the tagger lexicon. Compare the following excerpt:

 

Figure 1: Annotation levels

Description

Tag 1st level (mandatory information)

Tag 2nd level (optional/ recommended information)

Pronouns:

 

 

emphatic personal

PROEMPPERS

1sg,2sg,1pl,2pl

Verbals:

V

tr

Morphemes:

 

 

deficient

MORPH

def

 

As for the actual tagging, an additional first level of tagging is envisaged. On this level, linguistic words will be tagged. For Northern Sotho, this implies that the four orthographic units ke + a + mo + rata will be tagged as V, since together they constitute a linguistic verb.

 

The tagset currently distinguishes 29 categories and different levels of annotation. The first part of the tag gives a general indication of the nature of the unit in question. These are as follows:

1.         $ = Punctuation

2.         ABBR = abbreviation

3.         ADJ = adjective

4.         ADV = adverb

5.         ASP = aspectual marker

6.         AUX = auxilliary verb

7.         CCOP = class-indicating copulative subject concord

8.         CDEM = class-indicating demonstrative

9.         CDEMCOP = class-indicating demonstrative copulative

10.      CN = class-indicating nominal prefix

11.      CO = class-indicating object concord

12.      CPOSS = class-indicating possessive concord

13.      CS = class-indicating subject concord

14.      ENUM = enumerative

15.      IDEO = ideophone

16.      INT = interjection

17.      JUNC = conjunction

18.      MNEG =  negative morpheme

19.      N = noun

20.      NPP = place and brand name

21.      NUM = numerative

22.      PART = particle

23.      PROEMP = emphatic pronoun

24.      PROPOSS = possessive pronoun

25.      PROQUANT = quantitative pronoun

26.      QUE = question word

27.      TENSE = tense marker

28.      V = verbal

29.      VCOP = copulative verb

As we envisage going deeper into morphological analysis, we also plan for the implementation of the following tags:

AS = adjectival stem

CA = class indicating adjectival prefix

NS = noun stem

NSuf = nominal suffix

VEnd = verbal ending

VExt = verbal extension

VR = verb root

 

1.         PUNCTUATION

The tag $ is used for all punctuation marks. These include full stops, commas, colons, semi-colons, quotation marks, hyphens, exclamation marks, brackets, etc.

2.         ABBREVIATION

All abbreviations are tagged as ABBR.

 

3.         ADJECTIVE

The following tags are used:

Level 1: ADJ01-14, ADJLOC

Notes:

Examples:

            se segolo      ADJ07

            mo gobotse ADJLOC

4.         ADVERB

The following tags are used:

Level 1:          ADV

Level 2:         ADV_loc

Notes:

Examples:   

impela                       ADV_nil

kwaButhelezi           ADV_loc

 

5.         ASPECTUAL MARKER

The following tags are used:

Level 1: ASP

Level 2: ASP_pot, ASP_prog

Note:

Examples:

ba fo bolela

ASP_nil

ba sa bolela

ASP_prog

ba ka bolela

ASP_pot

6.         AUXILLIARY

The following tag is used:

Level 1: AUX

Notes:

Examples:

ba šetše ba fihlile

AUX

o ile bolela bjalo

AUX

 

7.         [CLASS-INDICATING] COPULATIVE SUBJECT CONCORD

The following tags are used:

Level 1: CCOP01-10, CCOP14-15, CCOPLOC, CCOPPERS

Level 2: CCOPPERS_1sg, CCOPPERS_1pl, CCOPPERS_2sg, CCOPPERS_2pl

Notes:

Examples:

le nna ke gona

CCOPPERS_1sg

borotho bo gona

CCOP14_nil

re mo toropong

CCOPPERS_1pl

 

8.         [CLASS-INDICATING] DEMONSTRATIVES

The followings tags are used:

CDEM01-10, CDEM14-15, CDEMLOC

Notes:

Examples:   

            abantu laba  CDEM02

            isitsha leso   CDEM07

            khona laphaya        CDEMLOC

 

9.         [CLASS-INDICATING] COPULATIVE DEMONSTRATIVES

The followings tags are used:

Level 1: CDEMCOP

Level 2: CDEMCOP_01-10, CDEMCOP_14-15, CDEMCOP_loc

Notes:

Examples:

            nanku            CDEMCOP_01

            nazi                CDEMCOP_08

            naku               CDEMCOP_loc

 

10.      [CLASS-INDICATING] NOMINAL PREFIX

11.      [CLASS-INDICATING] OBJECT CONCORD

The following tags are used:

Level 1: CO01-10, CO14-15, COLOC, COPERS

Level 2: COPERS_1pl, COPERS_2pl, COPERS_2sg

Notes:

Examples:

Ba re thušitše

COPERS_1pl

Re a go nyaka

COPERS_2sg

Ke a a rata

CO06

Ba tlo se reka

CO07

12.      [CLASS-INDICATING] POSSESSIVE CONCORD

The following tags are used:

Level 1: CPOSS01-10, 14-15, CPOSSLOC

Notes:

Examples:

bana ba gagwe

CPOSS02

diaparo tša bana

CPOSS08

fase ga tafola

CPOSSLOC

 

13.      [CLASS-INDICATING] SUBJECT CONCORD

The following tags are used:

Level 1: CS01-10, CS14-15, CSLOC, CSINDEF, CSNEUT, CSPERS

Level 2: CSPERS_1sg, CSPERS_1pl, CSPERS_2sg, CSPERS_2pl

Notes:

Examples:

se fihlile

CS07

ga di ešo tša fihla

CS10

fase go a tonya

CSLOC

go a fiša

CSINDEF

e be e le marega

CSNEUT

o a tshwenya

CSPERS_2sg

ra thoma mošomo

CSPERS_1pl

 

14.      ENUMERATIVE

The following tag is used:

Level 1:          ENUM

Note:

Examples:

ihashe  linye

ENUM

la mahashe  manye

ENUM

 

15.      IDEOPHONE

The following tag is used:

Level 1:          IDEO

Examples:   

mbo                                                   

IDEO

chuku

IDEO

cobo-cobo

IDEO

cwaka           

IDEO

 

16.      INTERJECTION

The following tag is used:

Level 1: INT

Level 2: INT_neg

Notes:          

Examples:

heke

INT_nil

hayi

INT_neg

 

17.      CONJUNCTION

The following tag is used:

Level 1:          JUNC

Notes:

Examples:

kodwa

JUNC

ngoba

JUNC

 

 

 

 

18.      NEGATIVE MORPHEME

The following tag is used:

Level 1: MNEG

Notes:

Examples:

ga ba bolele

MNEG

ba sa bolele

MNEG

gore ba se bolele

MNEG

 

19.      NOUN

The following tags are used:

Level 1: N01-10, N01a, N02b, N14, NLOC

Level 2: _aug, _dim, _loc, _name

Notes:

Examples:

inja

N09_nil

uSipho

N01a_name

injana

N09_dim

ekhaya

N09_loc

indlovukazi

N09_aug

emlilwaneni

N03_dim_loc

phansi

NLOC

ooSipho

N02a_name

 

20.      PLACE AND BRAND NAMES

The following tag is used:

Level 1: NPP

Level 2: NPP_name, NPP_brand

Notes:

Examples:

eThekwini

NPP_place

Coke

NPP_brand

 

21.      NUMERATIVE

The following tag is used:

NUM

Note:

22.      PARTICLE

The following tags are used:

Level 1:          PART

Level 2:         PART_cop, PART_agen, PART_hort, PART_loc, PRT_que, PART_temp, PART_ins, PART_con

 

Notes:

Examples:

ke marega

PART_cop

e bonwa ke dimpša

PART_agen

a re bale

PART_hort

ka kua toropong

PART_loc

na ba tlile?

PART_que

ka Mokibelo

PART_temp

ka thipa

PART_ins

go na le kotsi

PART_con

 

23.      EMPHATIC PRONOUN

The following tags are used:

Level 1: PROEMP01-10, PROEMP14-15, PROEMPLOC, PROEMPPERS

Level 2: PROEMPPERS_1sg, PROEMPPERS_1pl, PROEMPPERS_2sg, PROEMPPERS_2pl

Notes:

Examples:

yena

PROEMP01

thina

PROEMPPERS_1pl

khona

PROEMPLOC

izincwadi zona

PROEMP10

kuyona

PROEMP09

24.      POSSESSIVE PRONOUNS

The following tags are used:

Level 1: PROPOSS01-10, PROPOSS14-15, PROPOSSLOC, PROPOSSPERS

Level 2: PROPOSSPERS_1sg, PROPOSSPERS_1pl, PROPOSSPERS_2sg, PROPOSSPERS_2pl

Notes:

Examples:

bana ba gagwe

PROPOSS01

bana ba gešo

PROPOSSPERS_1pl

bana ba rena

PROPOSSPERS_1pl

maoto a tšona

PROPOSS10

dikolo tša gona

PROPOSSLOC

 

25.      QUANTITATIVE PRONOUNS

The following tags are used:

PROQUANT01 – 10, PROQUANT14-15, PROQUANTLOC

Notes:

Examples:

Abantwana bonke

PROQUANT02

zonke ziqedile

PROQUANT10

bona bonke

PROQUANT02

Abantwana bodwa

PROQUANT02

Zona zodwa

PROQUANT10

Yinja yodwa

PROQUANT9

 

26.      QUESTION WORDS

The following tags are used:

Level 1: QUE

Level 2: QUE_N01a, QUE_N02b, QUE_loc, QUE_time, QUE_man, QUE_01 – 10, 14 – 15

Notes:

 

Examples:

                 

bafike nini?

QUE_time

basebenza njani/kanjani?

QUE_man

basebenza kuphi?

QUE_loc

abantu baphi?

QUE_02

uthanda ubani?

QUE_N01a

uqedile na?

QUE_nil

 

27.      TENSE MARKER

The following tags are used:

Level 1: TENSE

Level 2: TENSE_fut, TENSE_pres, TENSE_past

Notes:

Examples:

ba tlo bolela

TENSE_fut

ba a bolela

TENSE_pres

ba ka se bolele

TENSE_fut

ga ba a bolela

TENSE_neg

 

28.      VERBAL

The following tag is used:

Level 1: V

Notes:

Examples:

(ndi)bona

V_tr

(uzi)shaya

V_tr

(bazo)funda

V_tr

(uba)phekela

V_dtr

(yi)dle

V_tr

                    (ku)facaka                   V_itr

29.      COPULATIVE VERB

The following tag is used:

Level 1: VCOP

Level 2: VCOP_neg

Notes:

Examples:

ke na le

VCOP_nil

ge e le marega

VCOP_nil

ge a se gona

VCOP_neg

ya ba selemo

VCOP_nil