Searching Semi-Structured Data Using Landmarks

This paper introduces landmark search operators for extracting data from poorly formatted Web pages, plain text files, and XML/SGML documents lacking grammars. The emphasis is on ease of use, and a fast, simple implementation, which can be readily ported to a wide variety of host languages.

There are two main operators: one using unique textual landmarks to divide text regions into smaller regions suitable for further search, and an operator that searches for XML/SGML tag pairs, and returns the matches as regions. An iterator class allows a search to be carried out repeatedly.

Downloads

The PDF file for the paper (170 KB). Last updated: 5th July 2005.
Zipped Source code and Examples (71 KB). Last updated: 5th July 2005.

Dr. Andrew Davison
E-mail: ad@coe.psu.ac.th
Back to my home page