文档视界 最新最全的文档下载
当前位置:文档视界 › A CHRG Analysis of ambiguity in Biological Texts (Extended Abstract)

A CHRG Analysis of ambiguity in Biological Texts (Extended Abstract)

A CHRG Analysis of ambiguity in Biological Texts (Extended Abstract)
A CHRG Analysis of ambiguity in Biological Texts (Extended Abstract)

A CHRG Analysis of ambiguity in

Biological Texts

(Extended Abstract)

Veronica Dahl and Baohua Gu

Logic and Functional Programming Group,

School of Computing Science,Simon Fraser University,

Burnaby,B.C.V5A1S6Canada

{veronica,bgu}@cs.sfu.ca

Abstract.We propose a methodology for analyzing human language

sentences which can e?ciently choose between alternative readings spring-

ing from the interaction between coordination and preposition phrase

attachment.We present a proof-of-concept in terms of an extremely

succinct CHRG[3]analyzer for interpreting biological text titles.Our

method uses expert knowledge on semantic types and compatibilities

among them,as well as contextual facilities of CHRG to gain an overall

view of the sentence components involved in disambiguation.

1Introduction

This work was inspired by our e?orts to automatically extract concepts from biological text,where one of the main challenges faced is the amount of infor-mation succinctly packed into titles.Typically,biological named entities appear as acronyms or in condensed versions,and the amount of information is max-imized by heavy use of coordination within noun phrases.As well,they have a tendency to contain several prepositional phrases with no clear indication of what antecedent they should attach to.For example,given the title sentence of a Medline abstract:“IL-2gene expression and NF-kappa B activation through CD28requires reactive oxygen production by5-lipoxygenase”,it is hard to see whether the activation through CD28refers only to NF-kappa B or to both IL-2 gene and NF-kappa B.

The ambiguity involved in titles containing even one instance of coordination or preposition phrase attachment is a challenge,and when two or more coexist in the same title,the number of possible interpretations explodes,making it extremely di?cult for a naive automated system to cope with.

However,it is not unusual to?nd,within the text or in related knowledge bases such as biological dictionaries and ontologies(e.g.,the GENIA Ontology [8]),short descriptions of what the entities’names refer to,or at the very least their semantic types(e.g.,protein molecule,DNA family or group).The short descriptions are often contained in simple constructs named appositions(e.g.,as

in“Grf40,a novel Grb2family member””),and their semantic types can very often be found directly in available taxonomies(e.g.,the GENIA corpus[7]),or inferred from other related knowledge bases(e.g.,[8]).

We have found that by extracting the semantic class to which each named entity refers,and by considering the relationships the sentence involves it in,we

can discard many of the ambiguities that originate in the coordination construc-tions where they intervene.In this paper we propose an analysis of ambiguity in compact text in general,while focusing on biological texts’titles in particular

for ease of demonstration and exempli?cation.

2Analysis of the Ambiguities Most Commonly Present in Biological Titles

2.1An Example

The typical features of titles and similar corpora are:

a)the entities are referred to through abbreviations or acronyms;

b)coordination is quite common;

c)prepositional phrase attachment ambiguities are very common too;

The following sentence,taken from the GENIA corpus[7],illustrates: IL-2gene expression and NF-kappa B activation through CD28requires re-active oxygen production by5-lipoxygenase.

In this sentence,coordination interacts with preposition phrase attachment with highly ambiguous results:the sentence could mean either(note that each reading

is the conjunction of two simpler sentences,noted with a)and b)below):

–Reading(1):

1a.IL-2gene expression through CD28requires reactive oxygen production by5-lipoxygenase.

1b.NF-kappa B activation through CD28requires reactive oxygen produc-tion by5-lipoxygenase.

–Reading(2):

2a.IL-2gene expression requires reactive oxygen production by5-lipoxygenase.

2b.NF-kappa B activation through CD28requires reactive oxygen produc-tion by5-lipoxygenase.

At?rst glance,deciding among these possible readings appears an un-surmountable task.However,if we simply retrieve the semantic types or classes to which each entity belongs,and note whether such classes can meaningfully appear as ar-guments of the relationships in which they are involved,many of the possible readings fade away.Any remaining ones will in general be those that are also ambiguous for a human expert in the biological notions involved.

2.2Our methodology

To analyze sentences such as the above,we?rst consult the GENIA corpus, a repository of annotations for every biological named entity in biological text in order to attach each abbreviation or acronym to a)its full name and b)its semantic type.In those cases where this information is not present,we look for an apposition either in the text that accompanies the title we are analyzing,or in related texts.

In the previous section’s example,a lookup for IL-2gene in the GENIA corpus yields the name IL-2gene and the semantic type DNA domain or region. On the other hand,it could be that an entity is not annotated as a biological entity in the GENIA corpus,but we?nd it within the text or within some other corpus,in an apposition which can point us to the type.Even when an entity is found in the GENIA corpus,consultation of the appositions which further de?ne it may be useful.For instance,from“Grf40,A novel Grb2family member,is involved in T cell signaling through interaction with SLP-76and LAT.”,we can infer that the protein molecule GRf40further belongs to the subfamily Grb2.

Our problem now reduces to encode which semantic types make sense in each argument of each of the biological relationships we most commonly encounter in biological texts.This information has to be constructed from an expert’s knowledge.The Appendix shows a prototype implementation of our methodology as a?rst step in demonstrating its usefulness.

3Exploiting ontological information

One of the?les we consult in our system is the GENIA ontology,which expresses subtype relationships in the format exempli?ed by:

subtype("Natural_source","Source").

In another?le we have the expert knowledge about compatibility between biomedical concepts,expressed in our system as constants or as types,e.g. compatible(IL-2_gene_expression’,’Protein_molecule’).

Such information will be consulted from our grammar rules for disambigua-tion.For instance,the example in2.1is easily taken care of by just one CHRG rule,namely

np(A),prep(P),np(B)/-(verb(_);prep(_);eos(_))<:> subtype(A,’BioProcess’),subtype(B,’BioEntity’),compatible(A,B) |np(A+P+B).

This rule will create a noun phrase from two noun phrases(represented A and B)joined by a preposition(P),provided that the second np is?anked by either a verb,another preposition,or and end of sentence character,and provided that A’s type and B’s type are compatible(as well as being a subtype of, respectively,’BioProcess’and’BioEntity’).The ontology is consulted,of course, when checking the desired subtype relationships.

4Criticality of CHRG for our methodology

The use of CHRG is crucial to our approach,since

a)it allows us to work bottom-up,thereby heading more directly to the right analysis.

b)it allows us to put together into the same rule information coming from heterogeneous sources.For instance,we mine ontology information from the Genia Ontology[8],and can consult or e?ect transformations of that information that suit disambiguation purposes.Left hand sides of CHRG rules do not care from which?le(among the ones having been read)the information that needs to be put together comes from,so this gives a great degree of modularity and allows us to incorporate information from heterogenous sources.

c)in the case of the text,or even of some titles containing appositions that de?ne a given term.If we consider for instance the sentence:”Overexpression of DR-nm23,a protein encoded by a member of the nm23gene family,inhibits granulocyte di?erentiation and induces apoptosis in32Dc13myeloid cells”,we can glean some de?nitional information for DR-nm23.The level of granularity with which we want to take advantage of this feature will vary according to our purposes.We might be content for instance with noting only that it is a protein,in which case the rest of the sentence can be ignored(CHRG includes a facility for disregarding intermediate strings which will not be analyzed)or that it is a protein and is encoded by a member of the nm23gene family,or that it is a protein,is encoded by a member of the nm23gene family,and induces apoptosis,and so on.Likewise,the information that nm23is a gene can be gleaned from the same sentence.In other words,we can implement a specialized CHRG which only looks at appositions within a text,disregards the rest of the text,and decides how to usefully exploit the information in the apposition:will it be used to consult the type hierarchy,to expand it,or just to add a de?nition into the database we are working with?

d)the use of CHRGs allows for a straightforward coexistence with CHR[6] rules,and even for the same symbols to be considered both as grammar symbols or as constraints.This is exempli?ed by the coexistence of the grammar rule described in Section3with the CHR rule:

np(X),np(Y),compatible(A,B)==>

subtype(X,A),subtype(Y,B)|compatible(X,Y).

which extends the user’s de?nitions of compatibility by considering that if the user has described A and B as being compatible,and the parser has discovered X and Y as noun phrases,where X is a subtype of A and Y is a subtype of B,X and Y are also compatible.This inferred compatibility information can then be used by the grammar rule in Section3,since it is now in the constraint store.

We do not know of any other system which so seamlessly would allow us to combine grammar and program rules for similar interactions needed.

As well,the facility of CHRG for looking at context allows us to implement the above described methodology with extreme conciseness.The full prototype program takes only one page and is included in full in the Appendix.

It is to be noted that as a side e?ect of disambiguation,our implementation completes the meaning of coordinated sentences which do not overtly contain all the conjuncts.Previous work for reconstructing elided meanings within co-ordinated phrases in natural language typically take more machinery for their implementation,https://www.docsj.com/doc/7e18474236.html,puting parallelism in discourse,or further tools such as assumptions and Datalog grammars[4][5].

Let us exemplify with the same sample sentence,taken from a real life title, which we showed in Section1.Semantic types are described through binary pred-icates of the form type(Entity,Class),e.g.from the program in the Appendix we can see that the semantic type of“IL-2gene-expression”is“bioprocess”.As well,our expert knowledge base includes information on compatibilities,from which we can know for instance that“IL-2gene-expression is compatible with CD28.

The rules that analyze conjoined noun phrases consult such information and take appropriate action by conjoining only those components that are compatible in type,and likewise appropriately attaching any prepositional phrases.Thus,in the above example,the second reading is simply not accessible from the grammar rules given,since they fail to satisfy the compatibility condition.

5Discussion

We have proposed a CHRG methodology to disambiguate multiple readings of sentences in biological text,on the basis of compatibilities between semantic types,which are calculated on the?y by consulting the GENIA ontology and dynamically extending user de?ned,basic relationships on compatibilities.Our parsing technique integrates semantics at the lexical level,exploiting an ontology for the application domain(biological texts).

We mix grammar rules and CHR proper rules to allow productive interac-tion between domain constraints and grammatical constraints.As explained in Section4,this makes it easier to express our problem in directly executable terms.

Our approach uses includes expert knowledge on semantic types of named entities and on compatibility between entities based on their semantic types. Appositions can provide further information about an entity of interest,as we also saw.Some appositions may even throw light upon relationships between two entities.

We have shown that this approach allows us to very succinctly express within the grammar rules the conditions under which alternative readings originating in preposition attachment plus coordinating ambiguities should be chosen.

We have exempli?ed our methodology for the particular problem of PP-attachment in coordinate constructions,and tested it with a?rst running proto-type which is shown in the Appendix.These?rst results show that much simpler machinery can be arrived at within our methodology than was previously the case in related work on coordination,including our own work with Datalog gram-mars and assumptions[4]and even CHR[5].Part of this is due to the restriction

of our domain to a well-investigated domain for which online corpora and on-tologies exist,namely the biological domain,but as also pointed out,a bigger part is due to the use of CHRG rules which focus on the relevant context seen overall,checks on types and their compatibilities,and uses this information to decide how to form meaning from the meanings of the involved parts.

However,we yet have to combine the advantages obtained in the present work with other long distance dependency work around CHR[6],for a uniform, more encompassing treatment,perhaps along the lines suggested in citeDahl-2004.Our present focus on titles allows us to get away with”just”allowing coordination among the possible long-distance dependency phenomena.

With this work we hope to stimulate further research into the subject. Acknowledgements This work is supported by the CONTROL project,funded by Danish Natural Science Research Council,and by Veronica Dahl’s NSERC Discovery Grant.

References

[1]Aguilar Solis,D.and Dahl,V.:Coordination revisited:a CHR approach.In Proc.

Iberamia’04,Mexico.

[2]CHRG User Manual.http://akira.ruc.dk/henning/chrg/

[3]Henning Christiansen:CHR grammars.Theory and Practice of Logic Programming,

5(4-5):467-501(2005)

[4]Dahl,V.:On Implicit Meanings.In:Computational Logic:From Logic Program-

ming into the Future.F.Sadri and T.Kakas(eds).(invited contribution),volume in honour of Bob Kowalski,Springer-Verlag,2002.

[5]Dahl,V.:Treating Long-Distance Dependencies through Constraint Reasoning.In

Proc.of the3rd International Workshop on Multiparadigm Constraint Programming Languages,2004.

[6]Frhwirth,T.:Theory and Practice of Constraint Handling Rules,Special Issue on

Constraint Logic Programming(P.Stuckey and K.Marriot,Eds.),Journal of Logic Programming,Vol37(1-3),pp95-138,October1998.

[7]GENIA Corpus:http://www-tsujii.is.s.u-tokyo.ac.jp/genia/topics/Corpus/

[8]GENIA Ontology:http://www-tsujii.is.s.u-tokyo.ac.jp/genia/topics/Corpus/genia-

ontology.html

Appendix A:the prototype CHRG implementation for

title disambiguation

%the CHR rules and CHR grammar rules used for disambiguation

:-compile(’chrg.txt’).

handler simple_coordination_solver.

constraints np/1,compatible/2,subtype/2.

grammar_symbols sentence/1,subj/1,verb/1,obj/1,

np/1,conj/1,prep/1,eos/1,

compatible/2,subtype/2.

%to induce new subtype relations

np(A),subtype(A,B),subtype(B,C)==>subtype(A,C).

%to induce new compatible relations

np(X),np(Y),compatible(A,B)==>

subtype(X,A),subtype(Y,B)|compatible(X,Y).

np(X),compatible(A,B)==>subtype(X,A)|compatible(X,B).

np(Y),compatible(A,B)==>subtype(Y,B)|compatible(A,Y).

%%grammar rules to group a np with a following preposition phrase

np(A),prep(P),np(B),conj(K),np(C)<:>np(A+P+B),conj(K),np(A+P+C).

np(A),prep(P),np(B)/-(verb(_);prep(_);eos(_))<:>

subtype(A,’BioProcess’),subtype(B,’BioEntity’),compatible(A,B) |np(A+P+B).

np(A),conj(K),np(B+P+C)<:>

subtype(A,’BioProcess’),subtype(C,’BioEntity’),compatible(A,C) |np(A+P+C),conj(K),np(B+P+C).

%%rules to classify noun phrases as subjects or objects of verbs

np(A)/-verb(_)::>subj(A).

verb(_)-\np(A)::>obj(A).

%%rules to handle coordinations

np(A),conj(_),subj(B)::>subj(A),subj(B).

obj(A),conj(_),np(B)::>obj(A),obj(B).

%%to identify a complete sentence

subj(A),verb(V),obj(B)::>sentence(s(A,V,B)).

sentence(A),conj(_),sentence(B)<:>sentence(A+B).

%%to solve incomplete sentences

subj(A),verb(V)/-conj(_),sentence(s(_,_,B))

::>sentence(s(A,V,B)).

sentence(s(A,_,_))-\conj(_),verb(V),obj(B)

::>sentence(s(A,V,B)).

subj(A)-\conj(_),sentence(s(_,V,B))

::>sentence(s(A,V,B)).

sentence(s(A,V,_))-\conj(_),obj(B)

::>sentence(s(A,V,B)).

subj(A),verb(V1),conj(_),verb(V2),obj(B)

::>sentence(s(A,V1,B)),

sentence(s(A,V2,B)).

%include example sentence,ontology,concepts,and domain knowledge

:-include(’test_example.txt’).%sentences for testing

:-include(’genia_ontology.txt’).%the GENIA Ontology

:-include(’genia_concepts.txt’).%annotations from GENIA corpus

:-include(’compatibility.txt’).%compatibility between concepts end_of_CHRG_source.

(N.B.for the referees:you can download the example1.txt and other title phrases or title sentences of the GENIA corpus we are considering here from www.cs.sfu.ca/bgu/personal/CSLP2007)

Appendix B:the Auxiliary Files

%the content of file"genia_ontology.txt"

subtype(’BioEntity’,’BioConcept’).

subtype(’BioProcess’,’BioConcept’).

subtype(’Source’,’BioEntity’).

subtype(’Substance’,’BioEntity’).

subtype(’Natural_source’,’Source’).

subtype(’Artificial_source’,’Source’).

subtype(’Organism’,’Natural_source’).

subtype(’Body_part’,’Natural_source’).

subtype(’Tissue’,’Natural_source’).

subtype(’Cell_type’,’Natural_source’).

subtype(’Cell_component’,’Natural_source’).

subtype(’Other_artificial_source’,’Artificial_source’). subtype(’Cell_line’,’Artificial_source’).

subtype(’Multi_cell’,’Organism’).

subtype(’Mono_cell’,’Organism’).

subtype(’Virus’,’Organism’).

subtype(’Compound’,’Substance’).

subtype(’Atom’,’Substance’).

subtype(’Organic’,’Compound’).

subtype(’Inorganic’,’Compound’).

subtype(’Amino_acid’,’Organic’).

subtype(’Nucleic_acid’,’Organic’).

subtype(’Lipid’,’Organic’).

subtype(’Carbohydrate’,’Organic’).

subtype(’Other_organic_compound’,’Organic’).

subtype(’Protein’,’Amino_acid’).

subtype(’Peptide’,’Amino_acid’).

subtype(’Amino_acid_monomer’,’Amino_acid’).

subtype(’DNA’,’Nucleic_acid’).

subtype(’RNA’,’Nucleic_acid’).

subtype(’Nucleotide’,’Nucleic_acid’).

subtype(’Polynucleotide’,’Nucleic_acid’).

subtype(’Protein_family_or_group’,’Protein’).

subtype(’Protein_complex’,’Protein’).

subtype(’Protein_molecule’,’Protein’).

subtype(’Protein_subunit’,’Protein’).

subtype(’Protein_substructure’,’Protein’).

subtype(’Protein_domain_or_region’,’Protein’).

subtype(’Protein_ETC’,’Protein’).

subtype(’DNA_family_or_group’,’DNA’).

subtype(’DNA_molecule’,’DNA’).

subtype(’DNA_domain_or_region’,’DNA’).

subtype(’DNA_substructure’,’DNA’).

subtype(’DNA_ETC’,’DNA’).

subtype(’RNA_family_or_group’,’RNA’).

subtype(’RNA_molecule’,’RNA’).

subtype(’RNA_domain_or_region’,’RNA’).

subtype(’RNA_substructure’,’RNA’).

subtype(’RNA_ETC’,’RNA’).

%the content of file"genia_concepts.txt"

%sample domain knowledge about the types of bioconcepts

subtype(’IL-2_gene_expression’,’BioProcess’).

subtype(’NF-kappa-B_activation’,’BioProcess’).

subtype(’reactive_oxygen_production’,’BioProcess’).

subtype(’CD28’,’Protein_molecule’).

subtype(’5-lipoxygenase’,’Protein_molecule’).

%the content of file"compatibility.txt"

%sample domain knowledge about compatibility between bioconcepts

compatible(’IL-2_gene_expression’,’Protein_molecule’). compatible(’NF-kappa-B_activation’,’Protein’).

compatible(’reactive_oxygen_production’,’Protein_molecule’).

%the content of file"test_example.txt"

%a sample sentence to disambiguate

%we assume that base NPs have been identified beforehand

s1:-X=[’IL-2_gene_expression’,and,’NF-kappa-B_activation’,\ through,’CD28’,requires,’reactive_oxygen_production’,\

by,’5-lipoxygenase’,’.’],parse(X).

[’IL-2_gene_expression’]<:>np(’IL-2_gene_expression’).

[and]<:>conj(and).

[’NF-kappa-B_activation’]<:>np(’NF-kappa-B_activation’). [through]<:>prep(through).

[’CD28’]<:>np(’CD28’).

[requires]<:>verb(require).

[’reactive_oxygen_production’]<:>np(’reactive_oxygen_production’). [by]<:>prep(by).

[’5-lipoxygenase’]<:>np(’5-lipoxygenase’).

[’.’]<:>eos(’.’).

Appendix C:the Execution of Testing Sentences

%the execution results of the testing sentence on SICSTUS 3.8.4

|?-s1.

<0>IL-2_gene_expression<1>and<2>NF-kappa-B_activation\

<3>through<4>CD28<5>requires<6>reactive_oxygen_production\ <7>by<8>5-lipoxygenase<9>.<10>

all(0,10),

begin(-1,0),

end(10,11),

subtype(’NF-kappa-B_activation’,’BioProcess’),

subtype(’CD28’,’BioEntity’),

compatible(’NF-kappa-B_activation’,’CD28’),

subtype(’IL-2_gene_expression’,’BioProcess’),

subtype(’CD28’,’BioEntity’),

compatible(’IL-2_gene_expression’,’CD28’),

subtype(reactive_oxygen_production,’BioProcess’),

subtype(’5-lipoxygenase’,’BioEntity’),

compatible(reactive_oxygen_production,’5-lipoxygenase’),

verb(5,6,require),

np(0,5,’IL-2_gene_expression’+through+’CD28’),

subj(0,5,’IL-2_gene_expression’+through+’CD28’),

conj(0,5,and),

np(0,5,’NF-kappa-B_activation’+through+’CD28’),

subj(0,5,’NF-kappa-B_activation’+through+’CD28’),

obj(6,7,reactive_oxygen_production),

sentence(0,7,s(’NF-kappa-B_activation’+through+’CD28’,require,\

reactive_oxygen_production)),

sentence(0,7,s(’IL-2_gene_expression’+through+’CD28’,require,\

reactive_oxygen_production)),

eos(9,10,’.’),

np(6,9,reactive_oxygen_production+by+’5-lipoxygenase’),

obj(6,9,reactive_oxygen_production+by+’5-lipoxygenase’),

sentence(0,9,s(’NF-kappa-B_activation’+through+’CD28’,require,\

reactive_oxygen_production+by+’5-lipoxygenase’)), sentence(0,9,s(’IL-2_gene_expression’+through+’CD28’,require,\

reactive_oxygen_production+by+’5-lipoxygenase’))?

yes

|?-

相关文档