Korean J Lab Med.  2008 Feb;28(1):79-87. 10.3343/kjlm.2008.28.1.79.

Development of a System for Extracting the Information of Candidate Tumor Markers Reported in Biomedical Literatures

Affiliations
  • 1Department of Computer Science Education, Korea University, Seoul, Korea.
  • 2Department of Laboratory Medicine, University of Ulsan College of Medicine and Asan Medical Center, Seoul, Korea. hboh@amc.seoul.kr

Abstract

BACKGROUND: Since the human genome project was completed in 2003, there have been numerous reports on cancer and related markers. This study was aimed to develop a system to extract automatically information regarding the relationship between cancer and tumor markers from biomedical literatures.
METHODS
Named entities of tumor markers were recognized by both a dictionary-based method and machine learning technology of the support vector machine. Named entities of cancers were recognized by the MeSH dictionary.
RESULTS
Relational and filtering keywords were selected after annotating 160 abstracts from PubMed. Relational information was extracted only when one of the relational keywords was in an appropriate position along the parse tree of a sentence with both tumor marker and disease entities. The performance of the system developed in this study was evaluated with another set of 77 abstracts. With the relational and filtering keyword used in the system, precision was 94.38% and recall was 66.14%, while without the expert knowledge precision was 49.16% and recall was 69.29%.
CONCLUSIONS
We developed a system that can extract relational information between a tumor and its markers by incorporating expert knowledge into the system. The system exploiting expert knowledge would serve as a reference when developing another information extraction system in various medical fields.

Keyword

Tumor; Tumor marker; Information extraction

MeSH Terms

Abstracting and Indexing as Topic
Algorithms
Database Management Systems
Humans
*Medical Informatics Computing
Neoplasms/metabolism
Programming Languages
*PubMed
Software
*Tumor Markers, Biological

Figure

  • Fig. 1. System overview for the relational information extraction. (A) Named entity recognition module. (B) Information extraction module.

  • Fig. 2. Named entity recognition (NER) module by support vector machine (SVM)

  • Fig. 3. Sentence annotation by medical experts for extracting relational and filtering keywords.

  • Fig. 4. Postorder traversal algorithm for extracting relational information using a parse tree.

  • Fig. 5. Example of extracted information.


Reference

1.Collins FS, Green ED, Guttmacher AE, Guyer MS, US National Human Genome Research Institute. A vision for the future of geno-mics research. Nature. 2003. 422:835–47.
2.Jensen LJ., Saric J., Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet. 2006. 7:119–29.
Article
3.Temkin JM., Gilder MR. Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics. 2003. 19:2046–53.
Article
4.Friedman C., Kra P., Yu H., Krauthammer M., Rzhetsky A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics. 2001. 17(S):S74–82.
Article
5.Ono T., Hishigaki H., Tanigami A., Takagi T. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics. 2000. 17:155–61.
Article
6.Cristianini N, Shawe-Taylor J, editors. An introduction to support vector machines and other kernel based learning methods. 1st ed.Cambridge: Cambridge University Press;2000.
7.McNaught J., Black WJ. Information extraction. Ananiadou S, McNaught J, editors. Text Mining for biology and biomedicine. 1st ed.Norwood: Artech House;2006. p. 143–77.
8.Kim JD., Ohta T., Tateisi Y., Tsujii J. GENIA corpus- semantically annotated corpus for bio-textmining. Bioinformatics. 2003. 19(S):i180–2.
9.Collins M. Head-Driven Statistical Models for Natural Language Parsing [Dissertation]. Philadelphia (PA): Pennsylvania Univ.;1995.
10.Tanabe L., Wilbur WJ. Tagging gene and protein names in biomedical text. Bioinformatics. 2002. 18:1124–32.
Article
11.Kazama J., Makino T., Ohta Y., TsujiiJ J. Tunning support vector machines for biomedical named entity recognition. In: Association for Computational Linguistics, ed. ACL 2002 Workshop. Proceedings of the ACL 2002 Workshop on Natural Language Processing in the Biomedical Domain; 2002 July 11; Philadelphia, PA, USA;. 2002. 1–8.
12.Zhou G., Zhang J., Su J., Shen D., Tan C. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics. 2004. 20:1178–90.
Article
13.Proux D., Rechenmann F., Julliard L., Pillet V V., Jacq B. Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. Genome Inform Ser Workshop Genome Inform. 1998. 9:72–80.
14.Chae JM, Jung SY, Oh HB. (Eds.).Tumor marker information extraction system. http://medtextmining.net/. (Updated on Aug 2006).
15.Ananiadou S., McNaught J. Introduction. Ananiadou S, McNaught J, editors. Text mining for biology and biomedicine. 1st ed.Norwood: Artech House;2006. p. 1–12.
16.Lee KJ., Hwang YS., Kim S., Rim HC. Biomedical named entity recognition using two-phase model based on SVMs. J Biomed Inform. 2004. 37:436–47.
Article
17.Park JC., Kim JJ. Named entity recognition. Ananiadou S, McNaught J, editors. Text mining for biology and biomedicine. 1st ed.Norwood: Artech House;2006. p. 121–42.
18.Ananiadou S., Nenadic G. Automatic terminology management in biomedicine. Ananiadou S, McNaught J, editors. Text mining for biology and biomedicine. 1st ed.Norwood: Artech House, Inc.;2006. p. 67–98.
19.Bodenreider O. Lexical, terminological, and ontological resources for bilogical text mining. Ananiadou S, McNaught J, editors. Text mining for biology and biomedicine. 1st ed.Norwood: Artech House;2006. p. 43–67.
20.Hatzivassiloglou V., Duboue PA., Rzhetsky A. Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 2001. 17(S):S97–106.
Article
21.Krauthammer M., Rzhetsky A., Morozov P., Friedman C. Using BLAST for identifying gene and protein names in journal articles. Gene. 2000. 259:245–52.
Article
22.Marsh SG., Albert ED., Bodmer WF., Bontrop RE., Dupont B., Erlich HA, et al. Nomenclature for factors of the HLA system, 2004. Tissue Antigens. 2005. 65:301–69.
Article
23.Horn F., Lau AL., Cohen FE. Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics. 2004. 20:557–68.
Article
24.Novichkova S., Egorov S., Daraselia N. MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics. 2003. 19:1699–706.
Article
25.Shin HR., Won YJ., Jung KW., Park JG., Ahn YO. Cancer Registration and Statistics in Korea. J Korean Assoc Cancer Prev. 2004. 9:49–55. (신해림, 원영주, 정규원, 박재갑, 안윤옥. 우리나라 암등록사업과암통계. 대한암예방학회지 2004;9: 49-55.).
26.Hernandez J., Thompson IM. Prostate-specific antigen: a review of the validation of the most commonly used cancer biomarker. Cancer. 2004. 101:894–904.
Article
27.Herbst RS., Bajorin DF., Bleiberg H., Blum D., Hao D., Johnson BE, et al. Clinical Cancer Advances 2005: major research advances in cancer treatment, prevention, and screening–a report from the American Society of Clinical Oncology. J Clin Oncol. 2006. 24:190–205.
Article
Full Text Links
  • KJLM
Actions
Cited
CITED
export Copy
Close
Share
  • Twitter
  • Facebook
Similar articles
Copyright © 2024 by Korean Association of Medical Journal Editors. All rights reserved.     E-mail: koreamed@kamje.or.kr