Skip to main content

Currently Skimming:

Searching for Statistical Diagrams--Shirley Zhe Chen, Michael J. Cafarella, and Eytan Adar
Pages 69-78

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 69...
... Standard text-based search will only retrieve the diagrams' enclosing documents. Web image search engines may retrieve some diagrams, but they generally work by examining textual content that surrounds images, thus missing out on many important signals of diagram content (Bhatia et al., 2010; Carberry et al., 2006)
From page 70...
... They are embedded in PDFs with little to distinguish them from surrounding text, the text embedded in a diagram is highly stylized with meaning that is very sensitive to the text's precise role, and, because diagrams are often an integral part of a highly engineered document, they can have extensive "implicit hyperlinks" in the form of figure references from the body of the surrounding text. Our Diagram Extractor component attempts to recover all of the relevant text for a diagram and determine an appropriate semantic label (caption, y-axis label, etc.)
From page 71...
... Furthermore, we show that DiagramFlyer's hybrid snippet generator allows users to find results 33% more accurately than with a standard image-driven snippet. We also place DiagramFlyer's intellectual contributions in a growing body of work on domain-independent information extraction -- techniques that enable retrieval of structured data items from unstructured documents, even when the number of topics (or domains)
From page 72...
... . It also looks for any sur rounding text that mentions the figure, labeling the relevant sentences as FIGURE 3 Diagram metadata labels for a sample diagram.
From page 73...
... The index tracks each extracted field separately so that keyword matches on individual parts of the diagram can be weighted differently during ranking. As seen in Figure 4, DiagramFlyer's online search system is similar in appearance to traditional Web search engines.
From page 74...
... 74 FRONTIERS OF ENGINEERING FIGURE 4 A screenshot of the DiagramFlyer search system.
From page 75...
... To target PDFs that are more likely to contain diagrams, we further restricted the crawl to the .edu domain. A query workload is critical for evaluat ing our Search Ranker and Snippet Generator components, but we do not discuss them in this abbreviated paper.
From page 76...
... RELATED WORK There is a vast literature on text search, snippet generation, image search, and image processing; much of it is not relevant to the unusual demands of searching statistical diagrams. There has been some work in specialized diagram understanding, for example, in processing telephone system diagrams (Arias et al., 1995)
From page 77...
... Proceedings of the 29th Annual International ACM SIGIR Conference, Seattle, Wash., August 6–11, 2006. Huang, W., C


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.